473,385 Members | 1,357 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

BeautifulSoup vs. real-world HTML comments

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>
Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
John Nagle
Apr 4 '07 #1
11 3146
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Carl Banks

Apr 4 '07 #2
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
>BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That's the problem.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #3

Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
That's a good suggestion. In fact it looks like there's a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments just fine.

Apr 4 '07 #4
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
> The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

eGenix have produced the mxTidy library that handily incorporates these
features in a way that makes them easy for Python programmers to use.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 4 '07 #5
On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

Well, BeautifulSoup is just such a dedicated library.
No, not really.
However, it defers its
handling of comments to HTMLParser. That's the problem.
Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.

(Though it's not like I read their mailing list. Maybe they aren't
happy with HTMLParser.)
Carl Banks

Apr 4 '07 #6
John Nagle wrote:
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:
<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >
Anything based on libxml2 and its HTML parser will handle such broken
HTML just fine, even if they just ignore such erroneous attempts at
comments, discarding them as the plain nonsense they clearly are.
Certainly, libxml2dom seems to deal with the page:

import libxml2dom
d = libxml2dom.parseURI("http://www.webdirectory.com", html=1,
htmlencoding="iso-8859-1")

I guess lxml and the original libxml2 bindings work at least as well.
Note that some browsers won't be as happy if you give them such
content as XHTML.

Paul

Apr 4 '07 #7
Carl Banks wrote:
On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
>Carl Banks wrote:
>>On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.

No, not really.
Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML. That tidy doesn't
successfully handle some other subset of bad HTML doesn't mean it's not a
dedicated program for handling bad HTML.
>However, it defers its
handling of comments to HTMLParser. That's the problem.

Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.
Sorry, let me be clearer: The problem is that they haven't overridden the
handling of comments of SGMLParser (not HTMLParser, sorry) like it has many
other parts of SGMLParser. Yes, any fix should go into BeautifulSoup and not
SGMLParser.

All it takes is someone to code up their desired behavior for these perverse
comments and submit it to Leonard Richardson.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #8
On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.
No, not really.

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML.
I think the authors of BeautifulSoup have the right to decide what
their own mission is.
Carl Banks

Apr 4 '07 #9
Carl Banks wrote:
On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:
>Carl Banks wrote:
>>On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:
Carl Banks wrote:
On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
>BeautifulSoup can't parse this page usefully at all.
>It treats the entire page as a text chunk. It's actually
>HTMLParser that parses comments, so this is really an HTMLParser
>level problem.
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
Well, BeautifulSoup is just such a dedicated library.
No, not really.
Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML.

I think the authors of BeautifulSoup have the right to decide what
their own mission is.
Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '07 #10
Robert Kern wrote:
Carl Banks wrote:
>>On Apr 4, 4:55 pm, Robert Kern <robert.k...@gmail.comwrote:
>>>Carl Banks wrote:

On Apr 4, 2:43 pm, Robert Kern <robert.k...@gmail.comwrote:

>Carl Banks wrote:
>
>>On Apr 4, 2:08 pm, John Nagle <n...@animats.comwrote:
>>
>>>BeautifulSoup can't parse this page usefully at all.
>>>It treats the entire page as a text chunk. It's actually
>>>HTMLParser that parses comments, so this is really an HTMLParser
>>>level problem.
>>I think the authors of BeautifulSoup have the right to decide what
their own mission is.


Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""
That's a good summary of the issue. It's a real problem, because
BeautifulSoup's default behavior in the presence of a bad comment is to
silently suck up all remaining text, ignoring HTML markup.

The problem actually is in BeautifulSoup, in parse_declaration:

def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j

Note what happens when a bad declaration is found. SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"

This is untested, but this or something close to it should make
BeautifulSoup much more robust.

It might make sense to catch some SGMLParseError at some other
places, too, advance past the next ">", and restart parsing.

John Nagle
May 14 '07 #11
John Nagle wrote:
Note what happens when a bad declaration is found.
SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest
of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"
I've been testing this, and it's improved parsing considerably. Now,
common lines like

<!This is an invalid comment>

don't stop parsing.

John Nagle
May 14 '07 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: William Xu | last post by:
Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com'...
3
by: GinTon | last post by:
I'm trying to get the 'FOO' string but the problem is that inner 'P' tag there is another tag, 'a'. So: <p class="contentBody">FOO <a name="f"></a</p> So if I run 'print...
5
by: John Nagle | last post by:
This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click...
9
by: Mizipzor | last post by:
Is there a way to "subscribe" to individual topics? im currently getting bombarded with daily digests and i wish to only receive a mail when there is activity in a topic that interests me. Can this...
3
by: John Nagle | last post by:
Are weak refs slower than strong refs? I've been considering making the "parent" links in BeautifulSoup into weak refs, so the trees will release immediately when they're no longer needed. In...
2
by: Frank Stutzman | last post by:
I've got a simple script that looks like (watch the wrap): --------------------------------------------------- import BeautifulSoup,urllib ifile =...
5
by: Larry Bates | last post by:
Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...
11
by: John Nagle | last post by:
Mike Driscoll wrote: What on earth do you need a "Windows binary" for? "BeautifulSoup" is ONE PYTHON SOURCE FILE, "BeautifulSoup.py". It can be downloaded here: ...
3
by: bsagert | last post by:
I downloaded BeautifulSoup.py from http://www.crummy.com/software/BeautifulSoup/ and being a n00bie, I just placed it in my Windows c:\python25\lib\ file. When I type "import beautifulsoup" from...
2
by: academicedgar | last post by:
Hi I would appreciate some help. I am trying to learn Python and want to use BeautifulSoup to pull some data from tables. I was really psyched earlier tonight when I discovered that I could do...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.