Robert Kern wrote:
Carl Banks wrote:
>>On Apr 4, 4:55 pm, Robert Kern <robert.k...@gm ail.comwrote:
>>>Carl Banks wrote:
On Apr 4, 2:43 pm, Robert Kern <robert.k...@gm ail.comwrote:
>Carl Banks wrote:
>
>>On Apr 4, 2:08 pm, John Nagle <n...@animats.c omwrote:
>>
>>>Beautifu lSoup can't parse this page usefully at all.
>>>It treats the entire page as a text chunk. It's actually
>>>HTMLPars er that parses comments, so this is really an HTMLParser
>>>level problem.
>>I think the authors of BeautifulSoup have the right to decide what
their own mission is.
Yes, and he's stated it pretty clearly:
"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser."""
That's a good summary of the issue. It's a real problem, because
BeautifulSoup's default behavior in the presence of a bad comment is to
silently suck up all remaining text, ignoring HTML markup.
The problem actually is in BeautifulSoup, in parse_declarati on:
def parse_declarati on(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.fi nd(']]>', i)
if k == -1:
k = len(self.rawdat a)
data = self.rawdata[i+9:k]
j = k+3
self._toStringS ubclass(data, CData)
else:
try:
j = SGMLParser.pars e_declaration(s elf, i)
except SGMLParseError:
toHandle = self.rawdata[i:]
self.handle_dat a(toHandle)
j = i + len(toHandle)
return j
Note what happens when a bad declaration is found. SGMLParser.pars e_declaration
raises SGMLParseError, and the exception handler just sucks up the rest of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.
That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.
How about this patch?
except SGMLParseError: # bad decl, must recover
k = self.rawdata.fi nd('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdat a) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_dat a(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"
This is untested, but this or something close to it should make
BeautifulSoup much more robust.
It might make sense to catch some SGMLParseError at some other
places, too, advance past the next ">", and restart parsing.
John Nagle