By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,996 Members | 1,343 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,996 IT Pros & Developers. It's quick & easy.

understanding htmllib

P: n/a
I'm trying to understand how to use the HTMLParser in htmllib but I'm not
seeing enough examples.

I just want to grab the contents of everything enclosed in a '<body>' tag,
i.e. items from where <bodybegins to where </bodyends. I start by doing

class HTMLBody(HTMLParser):
def __init__(self):
self.contents = []

def handle_starttag()..

Now I'm stuck. I cant see that there is a method on handle_starttag that
would return everthing to the end tag. And I haven't seen anything on how
to define my one handle_unknowntag..

Any pointers would be greatly appreciated. The documentation on this module
at python.org seems to assume a great deal about what the reader would
already know about which methods they should subclass.

--
David Bear
-- let me buy your intellectual property, I want to own your thoughts --
Oct 4 '06 #1
Share this Question
Share on Google+
1 Reply


P: n/a
David Bear wrote:
I'm trying to understand how to use the HTMLParser in htmllib but I'm not
seeing enough examples.

I just want to grab the contents of everything enclosed in a '<body>' tag,
i.e. items from where <bodybegins to where </bodyends. I start by doing

class HTMLBody(HTMLParser):
def __init__(self):
self.contents = []

def handle_starttag()..

Now I'm stuck. I cant see that there is a method on handle_starttag that
would return everthing to the end tag. And I haven't seen anything on how
to define my one handle_unknowntag..
htmllib is designed to be used together with a formatting object. if
you just want to work with tags, use sgmllib instead. some variation of
the SGMLFilter example on this page might be what you need:

http://effbot.org/librarybook/sgmllib.htm

if you want a DOM-like structure instead of an event stream, use

http://www.crummy.com/software/BeautifulSoup/

usage:
>>import BeautifulSoup as BS
soup = BS.BeautifulSoup(open("page.html"))
str(soup.body)
'<body>\n<h1>Body Title</h1>\n<p>Paragraph</p>\n</body>'
>>soup.body.renderContents()
'\n<h1>Body Title</h1>\n<p>Paragraph</p>\n'

</F>

Oct 4 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.