By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,767 Members | 1,375 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,767 IT Pros & Developers. It's quick & easy.

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

P: n/a
I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
Jul 7 '06 #1
Share this Question
Share on Google+
1 Reply


P: n/a
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList
try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()
Kenneth McDonald wrote:
I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
Jul 7 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.