473,322 Members | 1,408 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

understanding htmllib

I'm trying to understand how to use the HTMLParser in htmllib but I'm not
seeing enough examples.

I just want to grab the contents of everything enclosed in a '<body>' tag,
i.e. items from where <bodybegins to where </bodyends. I start by doing

class HTMLBody(HTMLParser):
def __init__(self):
self.contents = []

def handle_starttag()..

Now I'm stuck. I cant see that there is a method on handle_starttag that
would return everthing to the end tag. And I haven't seen anything on how
to define my one handle_unknowntag..

Any pointers would be greatly appreciated. The documentation on this module
at python.org seems to assume a great deal about what the reader would
already know about which methods they should subclass.

--
David Bear
-- let me buy your intellectual property, I want to own your thoughts --
Oct 4 '06 #1
1 1753
David Bear wrote:
I'm trying to understand how to use the HTMLParser in htmllib but I'm not
seeing enough examples.

I just want to grab the contents of everything enclosed in a '<body>' tag,
i.e. items from where <bodybegins to where </bodyends. I start by doing

class HTMLBody(HTMLParser):
def __init__(self):
self.contents = []

def handle_starttag()..

Now I'm stuck. I cant see that there is a method on handle_starttag that
would return everthing to the end tag. And I haven't seen anything on how
to define my one handle_unknowntag..
htmllib is designed to be used together with a formatting object. if
you just want to work with tags, use sgmllib instead. some variation of
the SGMLFilter example on this page might be what you need:

http://effbot.org/librarybook/sgmllib.htm

if you want a DOM-like structure instead of an event stream, use

http://www.crummy.com/software/BeautifulSoup/

usage:
>>import BeautifulSoup as BS
soup = BS.BeautifulSoup(open("page.html"))
str(soup.body)
'<body>\n<h1>Body Title</h1>\n<p>Paragraph</p>\n</body>'
>>soup.body.renderContents()
'\n<h1>Body Title</h1>\n<p>Paragraph</p>\n'

</F>

Oct 4 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: KC | last post by:
I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes...
0
by: Achim Domma | last post by:
Hi, should the HTMLParser be able to handle unicode input? I get the following traceback: self.feed(self.data) File "C:\Python23\lib\sgmllib.py", line 94, in feed self.goahead(0) File...
7
by: jennyw | last post by:
I'm trying to parse a product catalog written in HTML. Some of the information I need are attributes of tags (like the product name, which is in an anchor). Some (like product description) are...
1
by: Dfenestr8 | last post by:
Hi. I want a routine that strips a line of html of all it's tags. e.g I want it to turn .... "<p><b>This is an <h1><blink>IRRITATING</blink></h1> line of </b>text</p>" .... into ...... ...
0
by: Morten W. Petersen | last post by:
Hi, I have an HTML page that displays some content, and a part of that content is HTML changed into regular text. The encoding of the page is UTF-8. Here's the code that makes the change...
18
by: Simon | last post by:
Hi, I understand what one the differences between std::vector, std::deque and std::list is, the std::vector can have data inserted/deleted at the end. The std::deque can have data...
3
by: geir.smestad | last post by:
Using Ubuntu Breezy Badger 5.10. I get the following traceback: ----- Traceback (most recent call last): File "/home/geir/programmering/htmlparse/formatter.py", line 1, in -toplevel- import...
8
by: boki_pfc | last post by:
Hi Everybody, I am looking for an advice on following: I have that "pleasure" of reading C++ codes that have been written by person(s) that have not attended the same C++ classes that I did or...
0
by: axjacob | last post by:
I am using html and formater as shown below. They are used as part of a larger program. Even though I don't use any print statements, the htmllib seems to be throwing parts of the html page on to...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.