Help with parsing web page

RiGGa

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Jul 18 '05 #1

Subscribe Post Reply

2898

RiGGa

RiGGa wrote:

Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

thanks

R

Jul 18 '05 #2

Miki Tebeka

Hello RiGGa,

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Jul 18 '05 #3

RiGGa

Miki Tebeka wrote:

Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Jul 18 '05 #4

Thomas Guettler

Am Mon, 14 Jun 2004 17:48:33 +0100 schrieb RiGGa:

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Hi,

Since HTML can be broken in several ways, I would
pipe the HTML thru tidy first. You can use the "-asxml"
option, and then parse the xml.

http://tidy.sourceforge.net/

Thomas

Jul 18 '05 #5

Paul McGuire

"RiGGa" <ri***@hasnomail.com> wrote in message
news:aF*********************@stones.force9.net...

Hi,

I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:

<td>ip-address</td><td>arbitrary text giving server location</td>

For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd

# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc

Jul 18 '05 #6

wes weston

RiGGa wrote:

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGga,
If you want something, hopefully, not too simple. Frequently, you can
strip out the html and the resulting list will have a label followed by
the piece of data you want to save.
Do you need mysql code?
wes

def RemoveLessThanGreaterThanSectionsTokenize( s ):
state = 0
str = ""
list = []
for ch in s:
#grabbing good chars state
if state == 0: # s always starts with '<'
if ch == '<':
state = 1
if len(str) > 0:
list.append(str)
str = ""
else:
str += ch
#dumping bad chars state
elif state == 1: # looking for '>'
if ch == '>':
state = 0
return list

Jul 18 '05 #7

RiGGa

RiGGa wrote:

Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Many thanks for all your help, I will go away and digest it.

R

Jul 18 '05 #8

RiGGa

RiGGa wrote:

Miki Tebeka wrote:
Hello RiGGa,
Anyone?, I have found out I can use sgmllib but find the documentation
is not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <mi*********@zoran.com>
The only difference between children and adults is the price of the toys.

Thanks for taking the time to help its appreciated, I am new to Python so
a little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Said I would be back :)

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R

Jul 18 '05 #9

David Fisher

RiGGa <ri***@hasnomail.com> writes:
[...]

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R

http://www.python.org/doc/current/li...e-objects.html

Jul 18 '05 #10

Similar topics

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

PHP and XML - Read/Structure help wanted

by: Lüpher Cypher | last post by:

Hi, I'm trying to implement a simple MVC app, and I want to have the site map in an XML file. Anyways, here is the test xml file: <?xml version="1.0" encoding="ISO-8859-1"?> <site> <page...

PHP

VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help

by: gunimpi | last post by:

http://www.vbforums.com/showthread.php?p=2745431#post2745431 ******************************************************** VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help wanted...

Microsoft Access / VBA

HELP: Parsing XML fails under frames...

by: hzgt9b | last post by:

I've written a simple javascript page that parses an XML file... (Actually I just modified the "Parsing an XML File" sample from http://www.w3schools.com/dom/dom_parser.asp) The page works great...

Javascript

Need help creating regular expression for html lists...

by: bharathitm | last post by:

I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that...

Visual Basic .NET

XML Parsing - Need some Help??

by: savj14 | last post by:

I have been driving myself crazy the past few days trying to figure this out. I have tried different Parsing Scripts and have read and searched various things trying to find a solution. I am...

General

HELP: parsing unicode web sites

by: andrewwan1980 | last post by:

I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts. I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions &...

Perl

Not new to xml generating but am new to really complex parsing...help.

by: nine72 | last post by:

Ok, I am at a complete loss on this and have finally come to the XML Parsing Gods (and perhaps a PHP minor deity) for guidance… I will try my best to describe what I have going on… 1) I have 15...

XML

my code won't validate, help!

by: embz | last post by:

this post concerns three pages. 1. this page: http://www.katherine-designs.com/sendemail.php i get the following errors: a lot of it seems to deal with the PHP code i inserted to the page....

HTML / CSS

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General