473,241 Members | 1,587 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,241 software developers and data experts.

Help with parsing web page

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Jul 18 '05 #1
9 2888
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

thanks

R
Jul 18 '05 #2
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Jul 18 '05 #3
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga
Jul 18 '05 #4
Am Mon, 14 Jun 2004 17:48:33 +0100 schrieb RiGGa:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.


Hi,

Since HTML can be broken in several ways, I would
pipe the HTML thru tidy first. You can use the "-asxml"
option, and then parse the xml.

http://tidy.sourceforge.net/

Thomas

Jul 18 '05 #5
"RiGGa" <ri***@hasnomail.com> wrote in message
news:aF*********************@stones.force9.net...
Hi,

I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:

<td>ip-address</td><td>arbitrary text giving server location</td>

For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd

# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc

Jul 18 '05 #6
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga


RiGga,
If you want something, hopefully, not too simple. Frequently, you can
strip out the html and the resulting list will have a label followed by
the piece of data you want to save.
Do you need mysql code?
wes

def RemoveLessThanGreaterThanSectionsTokenize( s ):
state = 0
str = ""
list = []
for ch in s:
#grabbing good chars state
if state == 0: # s always starts with '<'
if ch == '<':
state = 1
if len(str) > 0:
list.append(str)
str = ""
else:
str += ch
#dumping bad chars state
elif state == 1: # looking for '>'
if ch == '>':
state = 0
return list

Jul 18 '05 #7
RiGGa wrote:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Many thanks for all your help, I will go away and digest it.

R
Jul 18 '05 #8
RiGGa wrote:
Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation
is not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so
a little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Said I would be back :)

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R
Jul 18 '05 #9
RiGGa <ri***@hasnomail.com> writes:
[...]

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R


http://www.python.org/doc/current/li...e-objects.html
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
1
by: Lüpher Cypher | last post by:
Hi, I'm trying to implement a simple MVC app, and I want to have the site map in an XML file. Anyways, here is the test xml file: <?xml version="1.0" encoding="ISO-8859-1"?> <site> <page...
0
by: gunimpi | last post by:
http://www.vbforums.com/showthread.php?p=2745431#post2745431 ******************************************************** VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help wanted...
2
by: hzgt9b | last post by:
I've written a simple javascript page that parses an XML file... (Actually I just modified the "Parsing an XML File" sample from http://www.w3schools.com/dom/dom_parser.asp) The page works great...
0
by: bharathitm | last post by:
I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that...
0
by: savj14 | last post by:
I have been driving myself crazy the past few days trying to figure this out. I have tried different Parsing Scripts and have read and searched various things trying to find a solution. I am...
1
by: andrewwan1980 | last post by:
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts. I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions &...
1
nine72
by: nine72 | last post by:
Ok, I am at a complete loss on this and have finally come to the XML Parsing Gods (and perhaps a PHP minor deity) for guidance… I will try my best to describe what I have going on… 1) I have 15...
2
by: embz | last post by:
this post concerns three pages. 1. this page: http://www.katherine-designs.com/sendemail.php i get the following errors: a lot of it seems to deal with the PHP code i inserted to the page....
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.