473,498 Members | 1,776 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unexpected behaviour with HTMLParser...

HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.

First, it doesn't appear to translate HTML escape characters. I don't
know the actual terminology but things like & don't get translated into
& as one would like. Furthermore, not only does HTMLParser not translate it
properly, it seems to omit it altogether! This prevents me from even doing
the translation myself, so I can't even working around the issue.
Why is it doing this? Is there some mode I need to set? Can anyone
else duplicate this behaviour? Is it a bug?

Secondly, HTMLParser often calls handle_data() consecutively, without
any calls to handle_starttag() in between. I did not expect this. In HTML,
you either have text or you have tags. Why split up my text into successive
handle_data() calls? This makes no sense to me. At the very least, it does
this in response to text with & like escape sequences (or whatever
they're called), so that it may successively avoid those translations.
Again, why is it doing this? Is there some mode I need to set? Can
anyone else duplicate this behaviour? Is it a bug?

These are serious problems for me and I would greatly appreciate a
deeper understanding of these issues.
Thank you...


Oct 9 '07 #1
1 1408
Just Another Victim of the Ambient Morality wrote:
HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.
In case you also want an HTML library that is easy to use (and powerful and
flexible and...), look at lxml.html.

http://codespeak.net/lxml/dev/lxmlhtml.html

It's part of lxml 2.0, which is currently in alpha status (which does not mean
it's unstable or something, just not as complete as its authors want it to be).

http://codespeak.net/lxml/dev/

Stefan
Oct 10 '07 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2479
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
11
6041
by: Sean Cody | last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of...
4
3313
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
2
2153
by: Gerhard Esterhuizen | last post by:
Hi, I am observing unexpected behaviour, in the form of a corrupted class member access, from a simple C++ program that accesses an attribute declared in a virtual base class via a chain of...
6
3074
by: Sakcee | last post by:
html = '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah </body></html>' >>> import htmllib >>> import...
8
2197
by: Lawrence D'Oliveiro | last post by:
I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser...
1
1857
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
8
8529
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
2100
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
7004
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7167
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7379
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5464
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4593
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3095
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3085
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1423
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
657
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.