By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,784 Members | 3,536 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,784 IT Pros & Developers. It's quick & easy.

Help with parsing web page

P: n/a
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Jul 18 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

thanks

R
Jul 18 '05 #2

P: n/a
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Jul 18 '05 #3

P: n/a
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga
Jul 18 '05 #4

P: n/a
Am Mon, 14 Jun 2004 17:48:33 +0100 schrieb RiGGa:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.


Hi,

Since HTML can be broken in several ways, I would
pipe the HTML thru tidy first. You can use the "-asxml"
option, and then parse the xml.

http://tidy.sourceforge.net/

Thomas

Jul 18 '05 #5

P: n/a
"RiGGa" <ri***@hasnomail.com> wrote in message
news:aF*********************@stones.force9.net...
Hi,

I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:

<td>ip-address</td><td>arbitrary text giving server location</td>

For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd

# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc

Jul 18 '05 #6

P: n/a
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga


RiGga,
If you want something, hopefully, not too simple. Frequently, you can
strip out the html and the resulting list will have a label followed by
the piece of data you want to save.
Do you need mysql code?
wes

def RemoveLessThanGreaterThanSectionsTokenize( s ):
state = 0
str = ""
list = []
for ch in s:
#grabbing good chars state
if state == 0: # s always starts with '<'
if ch == '<':
state = 1
if len(str) > 0:
list.append(str)
str = ""
else:
str += ch
#dumping bad chars state
elif state == 1: # looking for '>'
if ch == '>':
state = 0
return list

Jul 18 '05 #7

P: n/a
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Many thanks for all your help, I will go away and digest it.

R
Jul 18 '05 #8

P: n/a
RiGGa wrote:
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation
is not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so
a little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Said I would be back :)

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R
Jul 18 '05 #9

P: n/a
RiGGa <ri***@hasnomail.com> writes:
[...]

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R


http://www.python.org/doc/current/li...e-objects.html
Jul 18 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.