473,516 Members | 2,865 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
Jul 7 '06 #1
1 1863
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList
try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()
Kenneth McDonald wrote:
I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
Jul 7 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2482
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus causes an error when I exec the code, raises an EOL error. I have a work around for this as I use different set of characters rather that <tag> use...
11
6041
by: Sean Cody | last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of data so the 2D array cannot be sparse but in the HTML file I'm parsing there can be sparse entries which are repsented in the table as &nbsp...
4
3318
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that subclasses HTMLParser.HTMLParser. On the web page, however, there is javascript - and I think that is causing an error in parsing the data. Here's the...
0
1572
by: Fuzzyman | last post by:
I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works fine.... on good HTML. Having done a google it looks like parsing dodgy HTML and having HTMLParser choke is a common theme. I would have...
2
4765
by: David Morgan | last post by:
Hello I have been using the CDONTS.Newmail object for a number of years to send nicely formatted HTML Emails with inline images. I am now trying to switch over to using CDO and I cannot reproduce this functionality. I am using the AddAttachment method instead of the old AttachURL method but the attached images just show as separately...
9
2365
by: florent | last post by:
I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag, he raises an exception. Then it seems impossible to resume the parsing just after where the exception was raised. I'd like to continue parsing an...
6
1333
by: kumarboston | last post by:
Hi all, I have two files resultset.csv and sws.lst. Content of two files: resultset.csv has 5 columns(array index: 0,1,2,3,4) sws.lst has 8 columns(array index: 0...7) resultset.csv is small file containing 200 entries and sws.lst is big file containing around 10,000 entries. what i am trying to do is match the resultset.csv(array...
8
8530
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last): File "spider.py", line 38, in <module> s.crawl(site) File "spider.py", line 30, in crawl self.parse(url) File "spider.py", line 21, in parse
2
3597
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the...
0
7276
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
1
7142
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5110
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
4773
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3267
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3259
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1624
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
825
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
488
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.