473,723 Members | 2,211 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Help with parsing web page

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Jul 18 '05 #1
9 2914
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

thanks

R
Jul 18 '05 #2
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTM LParser):
def __init__(self):
HTMLParser.__in it__(self, NullFormatter() )
self.state = ""
self.data = ""

def start_title(sel f, attrs):
self.state = "title"
self.data = ""

def end_title(self) :
print "Title:", self.data.strip ()

def handle_data(sel f, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(ope n(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zo ran.com>
The only difference between children and adults is the price of the toys.

Jul 18 '05 #3
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTM LParser):
def __init__(self):
HTMLParser.__in it__(self, NullFormatter() )
self.state = ""
self.data = ""

def start_title(sel f, attrs):
self.state = "title"
self.data = ""

def end_title(self) :
print "Title:", self.data.strip ()

def handle_data(sel f, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(ope n(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zo ran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga
Jul 18 '05 #4
Am Mon, 14 Jun 2004 17:48:33 +0100 schrieb RiGGa:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.


Hi,

Since HTML can be broken in several ways, I would
pipe the HTML thru tidy first. You can use the "-asxml"
option, and then parse the xml.

http://tidy.sourceforge.net/

Thomas

Jul 18 '05 #5
"RiGGa" <ri***@hasnomai l.com> wrote in message
news:aF******** *************@s tones.force9.ne t...
Hi,

I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:

<td>ip-address</td><td>arbitrar y text giving server location</td>

For example:
<td>132.163.4.1 01</td>
<td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.p y
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>" )
tdEnd = Suppress("</td>")
timeServerPatte rn = tdStart + ipAddress.setRe sultsName("ipAd dr") + tdEnd +
\
tdStart + CharsNotIn("<") .setResultsName ("loc") + tdEnd

# get list of time servers
nistTimeServerU RL =
"http://www.boulder.nis t.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerU RL )
serverListHTML = serverListPage. read()
serverListPage. close()

addrs = {}
for srvr,startloc,e ndloc in timeServerPatte rn.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc

Jul 18 '05 #6
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga


RiGga,
If you want something, hopefully, not too simple. Frequently, you can
strip out the html and the resulting list will have a label followed by
the piece of data you want to save.
Do you need mysql code?
wes

def RemoveLessThanG reaterThanSecti onsTokenize( s ):
state = 0
str = ""
list = []
for ch in s:
#grabbing good chars state
if state == 0: # s always starts with '<'
if ch == '<':
state = 1
if len(str) > 0:
list.append(str )
str = ""
else:
str += ch
#dumping bad chars state
elif state == 1: # looking for '>'
if ch == '>':
state = 0
return list

Jul 18 '05 #7
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Many thanks for all your help, I will go away and digest it.

R
Jul 18 '05 #8
RiGGa wrote:
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation
is not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTM LParser):
def __init__(self):
HTMLParser.__in it__(self, NullFormatter() )
self.state = ""
self.data = ""

def start_title(sel f, attrs):
self.state = "title"
self.data = ""

def end_title(self) :
print "Title:", self.data.strip ()

def handle_data(sel f, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(ope n(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zo ran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so
a little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Said I would be back :)

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R
Jul 18 '05 #9
RiGGa <ri***@hasnomai l.com> writes:
[...]

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R


http://www.python.org/doc/current/li...e-objects.html
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
2902
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.
1
1495
by: Lüpher Cypher | last post by:
Hi, I'm trying to implement a simple MVC app, and I want to have the site map in an XML file. Anyways, here is the test xml file: <?xml version="1.0" encoding="ISO-8859-1"?> <site> <page id="" name="Home"> </page>
0
5571
by: gunimpi | last post by:
http://www.vbforums.com/showthread.php?p=2745431#post2745431 ******************************************************** VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help wanted ******************************************************** For this teeny job, please refer to: http://feeds.reddit.com/feed/8fu/?o=25
2
2102
by: hzgt9b | last post by:
I've written a simple javascript page that parses an XML file... (Actually I just modified the "Parsing an XML File" sample from http://www.w3schools.com/dom/dom_parser.asp) The page works great standalone... but when I try to make this work under frames I get "Error: Object required" when the following line executes: xmlDoc.getElementsByTagName("to"); The standalone file is named treeView.htm (attached). You should be
0
1180
by: bharathitm | last post by:
I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that to some 'more' text. Simple things like the above, i was able to write but the real problem lies when it comes to parsing lists and tables. For example, i write down some text as follows - # number one
0
1198
by: savj14 | last post by:
I have been driving myself crazy the past few days trying to figure this out. I have tried different Parsing Scripts and have read and searched various things trying to find a solution. I am officially lost. What I want to do is list some of My Xbox Live info on my Website. I found a Website where most of my information is stored as an XML File. This website is found here...
1
2858
by: andrewwan1980 | last post by:
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts. I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii. But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is...
1
2206
nine72
by: nine72 | last post by:
Ok, I am at a complete loss on this and have finally come to the XML Parsing Gods (and perhaps a PHP minor deity) for guidance… I will try my best to describe what I have going on… 1) I have 15 form pages, well over 500 potential fields, which are written in PHP. While most pages are one time entry forms, there are 5 that can be “recycled” as many times as needed. An example would be the Contacts Form. A user can give me 1 contact and move...
2
6356
by: embz | last post by:
this post concerns three pages. 1. this page: http://www.katherine-designs.com/sendemail.php i get the following errors: a lot of it seems to deal with the PHP code i inserted to the page. as my PHP skills are close to nil, i'm wary about fiddling with it myself. =\ 2. now this page: http://www.katherine-designs.com/contact.php
0
9388
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9241
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9160
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9090
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8062
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6685
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5996
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4504
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
2612
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.