HTMLParser fragility

Lawrence D'Oliveiro

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.

Apr 5 '06 #1

Subscribe Post Reply

2194

Rene Pijlman

Lawrence D'Oliveiro:

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not.

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

--
René Pijlman

Apr 5 '06 #2

Daniel Dittmar

Lawrence D'Oliveiro wrote:

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

Daniel

Apr 5 '06 #3

Richie Hindle

[Daniel]

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:

from mx.Tidy import tidy
results = tidy("<html><body><pree>Hello world!</pre></body></html>")
print results[3]

line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?

--
Richie

Apr 5 '06 #4

Walter Dörwald

Rene Pijlman wrote:

Lawrence D'Oliveiro:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not.

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
Walter Dörwald

Apr 6 '06 #5

Paul Boddie

Richie Hindle wrote:

But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
from mx.Tidy import tidy
results = tidy("<html><body><pree>Hello world!</pre></body></html>")
[Various error messages]

Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:
import libxml2dom
d = libxml2dom.parseString("<html><body><pree>Hello world!</pre></body></html>", html=1)
print d.toString()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul

Apr 6 '06 #6

Lawrence D'Oliveiro

In article <fr********************************@4ax.com>,
Rene Pijlman <re********************@my.address.is.invalid> wrote:

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

That sounds like what I'm after!

Apr 7 '06 #7

Richie Hindle

[Richie]

But Tidy fails on huge numbers of real-world HTML pages. [...]
Is there a Python HTML tidier which will do as good a job as a browser?
[Walter] You can also use the HTML parser from libxml2
[Paul] libxml2 will attempt to parse HTML if asked to [...] See how it fixes
up the mismatching tags.

Great! Many thanks.

--
Richie Hindle
ri****@entrian.com

Apr 7 '06 #8

John J. Lee

"Lawrence D'Oliveiro" <ld*@geek-central.gen.new_zealand> writes:

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

[...]

sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.
John

Apr 10 '06 #9

Similar topics

Question regarding HTMLParser module.

by: Adonis | last post by:

When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...

Python

HTMLParser problems.

by: Sean Cody | last post by:

I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of...

Python

Help w/ HTMLParser lib

by: Kevin T. Ryan | last post by:

Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...

Python

trying to parse non valid html documents with HTMLParser

by: florent | last post by:

I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag,...

Python

HTMLParser chokes on bad end tag in comment

by: Rene Pijlman | last post by:

The code below results in an exception (Python 2.4.2): HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4, column 6 Should it? The end tag it chokes on is in comment, isn't...

Python

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

by: Kenneth McDonald | last post by:

I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...

Python

HTMLParser's start_tag method never called ?

by: ychaouche | last post by:

Hi, python experts. <console trace> chaouche@CAY:~/TEST$ python nettoyageHTML.py chaouche@CAY:~/TEST$ </console trace> This is the nettoyageHTML.py python script <code>

Python

HTMLParser error

by: jonbutler88 | last post by:

Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...

Python

confused by HTMLParser class

by: globalrev | last post by:

tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA