htmllib.py and parsing malformed HTML

I have written a parser using htmllib.HTMLParser and it functions fine
unless the HTML is malformed. For example, is some instances, the
provider of the HTML leaves out the <TR> tags but includes the </TR> tags.

Apparently, htmllib and more likely sgmllib do not parse an end tag if a
corresponding start tag was not found. Does anyone know a way to "fool"
the parser into handling the end tag is a start tag was not found?

Thanks,

Kevin

Jul 18 '05 #1

Subscribe Post Reply

4294

Thomas Güttler

KC wrote:

I have written a parser using htmllib.HTMLParser and it functions fine
unless the HTML is malformed. For example, is some instances, the
provider of the HTML leaves out the <TR> tags but includes the </TR> tags.

Apparently, htmllib and more likely sgmllib do not parse an end tag if a
corresponding start tag was not found. Does anyone know a way to "fool"
the parser into handling the end tag is a start tag was not found?

Hi,

You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
parse the html.

thomas

Jul 18 '05 #2

Thomas Güttler wrote:

Hi,

You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
parse the html.

I appreciate the suggestion but unfortunately this will not work well
for me as the parser runs as part of a cron job. I wouldn't be able to
review the tidy error log in a timely fashion if there was a problem.

What would be really nice is a way to tell the parser it was "inside" a
<TR> when I encountered a <TD> after a closing </TR>. Browsers still
display the HTML correctly without a starting <TR>, but if the closing
</TR> is omitted everything gets mangled.

Any other suggestions?

Jul 18 '05 #3

KC wrote:

What would be really nice is a way to tell the parser it was "inside" a
<TR> when I encountered a <TD> after a closing </TR>. Browsers still
display the HTML correctly without a starting <TR>, but if the closing
</TR> is omitted everything gets mangled.

I solved this problem, perhaps not the most elegant way, but it is still
solved. Any suggestions on improvements are welcome. I added the
following method to my parser class to make this work:
def parse_endtag(self, i) :
rawdata = self.rawdata
tag = rawdata[i+2:i+4].strip().lower()
if tag == 'tr' :
self.fmtr.writer.send_tag('</TR>')
return htmllib.HTMLParser.parse_endtag(self, i)
I should also mention that I added the send_tag method to my writer
implementation which simply writes the given text to the output stream.

Jul 18 '05 #4

John J. Lee

KC <ns********@bellsouth.net> writes:

Thomas Güttler wrote:
Hi,
You could use tidy (http://www.w3.org/People/Raggett/tidy/) before
you
parse the html.

I appreciate the suggestion but unfortunately this will not work well
for me as the parser runs as part of a cron job. I wouldn't be able
to review the tidy error log in a timely fashion if there was a
problem.

[...]

So, what about *your* code's error log (or the equivalent --
presumably an unhandled traceback)?? It's not obvious that your
solution (in a later post) will be any more robust than just piping
everything through HTMLTidy. In fact, since you will find a great
variety of nonsense in 'HTML as deployed', it seems likely that
HTMLTidy will do the better job.
John

Jul 18 '05 #5

John J. Lee wrote:

So, what about *your* code's error log (or the equivalent --
presumably an unhandled traceback)?? It's not obvious that your
solution (in a later post) will be any more robust than just piping
everything through HTMLTidy. In fact, since you will find a great
variety of nonsense in 'HTML as deployed', it seems likely that
HTMLTidy will do the better job.

If this parser was handling a "great variety of nonsense" I would
wholeheartedly agree with you. However, since this HTML is from a
single vendor and that vendor is a government entity, this solution was
better than integrating a third-party product. As with most
organizations, changing *our* code is much more acceptable to the powers
that be, than bringing in a third-party product that will have to be
evaluated and have countless meetings over its approval. For many of
us, business and policy decisions often forge the direction for
technology usage within our organizations.

Jul 18 '05 #6

John J. Lee

KC <ns********@bellsouth.net> writes:

John J. Lee wrote:
So, what about *your* code's error log (or the equivalent --
presumably an unhandled traceback)?? It's not obvious that your
[...] If this parser was handling a "great variety of nonsense" I would
wholeheartedly agree with you. However, since this HTML is from a
single vendor and that vendor is a government entity, this solution
Oh, got you. Fair enough
[...] for technology usage within our organizations.

You can always tell when someone's 'business button' has been pushed
when they use the word 'within' ;-)
John

Jul 18 '05 #7

Jeremy Bowers

On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:

As with most organizations,
changing *our* code is much more acceptable to the powers that be, than
bringing in a third-party product that will have to be evaluated and have
countless meetings over its approval. For many of us, business and policy
decisions often forge the direction for technology usage within our
organizations.

If you are having real problems with poor HTML, HTMLTidy may be worth
going to bat over. If you can find a simple solution that works on the
HTML you are processing, great, go with it, and it's worth researching in
your situation first. But HTML can go bad in more ways then you can
imagine (which is in fact part of the problem); if you are getting HTML
that's bad in a lot of little ways, you'll find the "apply a hack to fix
this file, apply a hack to fix that file" will start stepping on its own
toes.

HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
functionality that you can *not* replicate in a reasonable amount of time;
it's one of those packages that isn't so much a program that "does
something" as a program that represents many, many man-years of "knowledge
acquired".

I'm not trying to push anything, since I don't know your situation, but
HTMLTidy is one of those rare projects that you really shouldn't allow NMH
to scuttle unless you *really* need to. (Again, I mention if there's some
simple way you can characterize the bad HTML coming out of one single
program, go ahead and try to fix it; maybe you'll get lucky and a regex
will be enough.)

Jul 18 '05 #8

Jeremy Bowers wrote:

On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
....
that's bad in a lot of little ways, you'll find the "apply a hack to fix
this file, apply a hack to fix that file" will start stepping on its own
toes. Oh yeah, I couldn't agree more. Any more requests for "hacks" and
HTMLTidy gets brought into the picture.
HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
functionality that you can *not* replicate in a reasonable amount of time;
it's one of those packages that isn't so much a program that "does
something" as a program that represents many, many man-years of "knowledge
acquired".

Agreed. I like HTMLTidy very much and it's obvious it could save us
developers a lot of effort.

Jul 18 '05 #9

by: jennyw | last post by:

I'm trying to parse a product catalog written in HTML. Some of the information I need are attributes of tags (like the product name, which is in an anchor). Some (like product description) are...

Python

An example using htmllib?

by: Dfenestr8 | last post by:

Hi. I want a routine that strips a line of html of all it's tags. e.g I want it to turn .... "<p><b>This is an <h1><blink>IRRITATING</blink></h1> line of </b>text</p>" .... into ...... ...

Python

Third party tool for parsing HTML?

by: Brett | last post by:

Are there any good HTML parsing tools available for VB.NET? I'd like something that will list: - tables (table, tr, td) - anchor tags - image tabs - DIVs and so. For example, it may list...

Visual Basic .NET

Cannot import htmllib

by: geir.smestad | last post by:

Using Ubuntu Breezy Badger 5.10. I get the following traceback: ----- Traceback (most recent call last): File "/home/geir/programmering/htmlparse/formatter.py", line 1, in -toplevel- import...

Python

understanding htmllib

by: David Bear | last post by:

I'm trying to understand how to use the HTMLParser in htmllib but I'm not seeing enough examples. I just want to grab the contents of everything enclosed in a '<body>' tag, i.e. items from where...

Python

Parsing HTML Tables

by: Just Me | last post by:

Hi Geezers, I need some code which will parse and strip attributes from a table in a textbox. Basically, I need to paste in the table and run a little routing to convert the table into a ...

Visual Basic .NET

Parsing HTML

by: mtuller | last post by:

Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: <tr > <td headers="col1_1" style="width:21%" > <span class="hpPageText"...

Python

Htmllib help

by: axjacob | last post by:

I am using html and formater as shown below. They are used as part of a larger program. Even though I don't use any print statements, the htmllib seems to be throwing parts of the html page on to...

Python

Malformed Header from script. Bad header.

by: Shalako | last post by:

I check my error log and see these entries: malformed header from script. Bad header= Missing gauge reports are ind: padata.pl /perl/pema/padata.pl did not send an HTTP header malformed...

Apache Web Server

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

htmllib.py and parsing malformed HTML

Similar topics