BeautifulSoup vs. real-world HTML comments

John Nagle

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
John Nagle

Apr 4 '07 #1

Subscribe Post Reply

3146

Carl Banks

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Carl Banks

Apr 4 '07 #2

Robert Kern

Carl Banks wrote:

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:

>BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That's the problem.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #3

irstas

Carl Banks wrote:

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

That's a good suggestion. In fact it looks like there's a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments just fine.

Apr 4 '07 #4

Steve Holden

Carl Banks wrote:

On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
> The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

eGenix have produced the mxTidy library that handily incorporates these
features in a way that makes them easy for Python programmers to use.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 4 '07 #5

Carl Banks

On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:

Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

Well, BeautifulSoup is just such a dedicated library.

No, not really.

However, it defers its
handling of comments to HTMLParser. That's the problem.

Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.

(Though it's not like I read their mailing list. Maybe they aren't
happy with HTMLParser.)
Carl Banks

Apr 4 '07 #6

Paul Boddie

John Nagle wrote:

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

Anything based on libxml2 and its HTML parser will handle such broken
HTML just fine, even if they just ignore such erroneous attempts at
comments, discarding them as the plain nonsense they clearly are.
Certainly, libxml2dom seems to deal with the page:

import libxml2dom
d = libxml2dom.parseURI("http://www.webdirectory.com", html=1,
htmlencoding="iso-8859-1")

I guess lxml and the original libxml2 bindings work at least as well.
Note that some browsers won't be as happy if you give them such
content as XHTML.

Paul

Apr 4 '07 #7

Robert Kern

Carl Banks wrote:

On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
>Carl Banks wrote:
>>On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.

No, not really.

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML. That tidy doesn't
successfully handle some other subset of bad HTML doesn't mean it's not a
dedicated program for handling bad HTML.

>However, it defers its
handling of comments to HTMLParser. That's the problem.

Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.

Sorry, let me be clearer: The problem is that they haven't overridden the
handling of comments of SGMLParser (not HTMLParser, sorry) like it has many
other parts of SGMLParser. Yes, any fix should go into BeautifulSoup and not
SGMLParser.

All it takes is someone to code up their desired behavior for these perverse
comments and submit it to Leonard Richardson.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #8

Carl Banks

On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:

Carl Banks wrote:
On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.

No, not really.

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML.

I think the authors of BeautifulSoup have the right to decide what
their own mission is.
Carl Banks

Apr 4 '07 #9

Robert Kern

Carl Banks wrote:

On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:
>Carl Banks wrote:
>>On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
>BeautifulSoup can't parse this page usefully at all.
>It treats the entire page as a text chunk. It's actually
>HTMLParser that parses comments, so this is really an HTMLParser
>level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.
No, not really.
Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML.

I think the authors of BeautifulSoup have the right to decide what
their own mission is.

Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #10

John Nagle

Robert Kern wrote:

Carl Banks wrote:

>>On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:

>>>Carl Banks wrote:

On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:

>Carl Banks wrote:
>
>>On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
>>
>>>BeautifulSoup can't parse this page usefully at all.
>>>It treats the entire page as a text chunk. It's actually
>>>HTMLParser that parses comments, so this is really an HTMLParser
>>>level problem.

>>I think the authors of BeautifulSoup have the right to decide what
their own mission is.

Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""

That's a good summary of the issue. It's a real problem, because
BeautifulSoup's default behavior in the presence of a bad comment is to
silently suck up all remaining text, ignoring HTML markup.

The problem actually is in BeautifulSoup, in parse_declaration:

def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j

Note what happens when a bad declaration is found. SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"

This is untested, but this or something close to it should make
BeautifulSoup much more robust.

It might make sense to catch some SGMLParseError at some other
places, too, advance past the next ">", and restart parsing.

John Nagle

May 14 '07 #11

John Nagle

John Nagle wrote:

Note what happens when a bad declaration is found.
SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest
of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"

I've been testing this, and it's improved parsing considerably. Now,
common lines like

<!This is an invalid comment>

don't stop parsing.

John Nagle

May 14 '07 #12

Similar topics

BeautifulSoup error

by: William Xu | last post by:

Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com'...

Python

BeautifulSoup to get string inner 'p' and 'a' tags

by: GinTon | last post by:

I'm trying to get the 'FOO' string but the problem is that inner 'P' tag there is another tag, 'a'. So: <p class="contentBody">FOO <a name="f"></a</p> So if I run 'print...

Python

BeautifulSoup bug when ">>>" found in attribute value

by: John Nagle | last post by:

This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click...

Python

"Subscribing" to topics?

by: Mizipzor | last post by:

Is there a way to "subscribe" to individual topics? im currently getting bombarded with daily digests and i wish to only receive a mail when there is activity in a topic that interests me. Can this...

Python

Are weak refs slower than strong refs?

by: John Nagle | last post by:

Are weak refs slower than strong refs? I've been considering making the "parent" links in BeautifulSoup into weak refs, so the trees will release immediately when they're no longer needed. In...

Python

Some <head> clauses cases BeautifulSoup to choke?

by: Frank Stutzman | last post by:

I've got a simple script that looks like (watch the wrap): --------------------------------------------------- import BeautifulSoup,urllib ifile =...

Python

Installing BeautifulSoup with easy_install (broken?)

by: Larry Bates | last post by:

Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...

Python

Re: Where to get BeautifulSoup--www.crummy.com appears to be down.

by: John Nagle | last post by:

Mike Driscoll wrote: What on earth do you need a "Windows binary" for? "BeautifulSoup" is ONE PYTHON SOURCE FILE, "BeautifulSoup.py". It can be downloaded here: ...

Python

Importing module PIL vs beautifulSoup.

by: bsagert | last post by:

I downloaded BeautifulSoup.py from http://www.crummy.com/software/BeautifulSoup/ and being a n00bie, I just placed it in my Windows c:\python25\lib\ file. When I type "import beautifulsoup" from...

Python

BeautifulSoup and Problem Tables

by: academicedgar | last post by:

Hi I would appreciate some help. I am trying to learn Python and want to use BeautifulSoup to pull some data from tables. I was really psyched earlier tonight when I discovered that I could do...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware