Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

Kenneth McDonald

I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken

Jul 7 '06 #1

Subscribe Reply

1863

wes weston

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList
try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()
Kenneth McDonald wrote:

I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken

Jul 7 '06 #2

Similar topics

2482

Question regarding HTMLParser module.

by: Adonis | last post by:

When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus causes an error when I exec the code, raises an EOL error. I have a work around for this as I use different set of characters rather that <tag> use...

Python

6041

HTMLParser problems.

by: Sean Cody | last post by:

I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of data so the 2D array cannot be sparse but in the HTML file I'm parsing there can be sparse entries which are repsented in the table as &nbsp...

Python

3318

Help w/ HTMLParser lib

by: Kevin T. Ryan | last post by:

Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that subclasses HTMLParser.HTMLParser. On the web page, however, there is javascript - and I think that is causing an error in parsing the data. Here's the...

Python

1572

Parsing HTML - modify URLs

by: Fuzzyman | last post by:

I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works fine.... on good HTML. Having done a google it looks like parsing dodgy HTML and having HTMLParser choke is a common theme. I would have...

Python

4765

ASP, CDO for Windows 2000 & Embedded / In-line Images Showing as Attachments

by: David Morgan | last post by:

Hello I have been using the CDONTS.Newmail object for a number of years to send nicely formatted HTML Emails with inline images. I am now trying to switch over to using CDO and I cannot reproduce this functionality. I am using the AddAttachment method instead of the old AttachURL method but the attached images just show as separately...

ASP / Active Server Pages

2365

trying to parse non valid html documents with HTMLParser

by: florent | last post by:

I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag, he raises an exception. Then it seems impossible to resume the parsing just after where the exception was raised. I'd like to continue parsing an...

Python

1333

parsing problen in comparison of files.

by: kumarboston | last post by:

Hi all, I have two files resultset.csv and sws.lst. Content of two files: resultset.csv has 5 columns(array index: 0,1,2,3,4) sws.lst has 8 columns(array index: 0...7) resultset.csv is small file containing 200 entries and sws.lst is big file containing around 10,000 entries. what i am trying to do is match the resultset.csv(array...

Perl

8530

HTMLParser error

by: jonbutler88 | last post by:

Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last): File "spider.py", line 38, in <module> s.crawl(site) File "spider.py", line 30, in crawl self.parse(url) File "spider.py", line 21, in parse

Python

3597

HTML File Parsing

by: Felipe De Bene | last post by:

I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the...

Python

7276

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7142

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

5110

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

4773

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3267

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

3259

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1624

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

825

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

488

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General