By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,670 Members | 1,554 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,670 IT Pros & Developers. It's quick & easy.

Help w/ HTMLParser lib

P: n/a
Hi all -

I'm somewhat new to python (about 1 year), and I'm trying to write a program
that opens a file like object w/ urllib.urlopen, and then parse the data by
passing it to a class that subclasses HTMLParser.HTMLParser. On the web
page, however, there is javascript - and I think that is causing an error
in parsing the data. Here's the error:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "html_helper.py", line 30, in parse_data
p.feed(data)
File "//usr/lib/python2.2/HTMLParser.py", line 108, in feed
self.goahead(0)
File "//usr/lib/python2.2/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "//usr/lib/python2.2/HTMLParser.py", line 329, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "//usr/lib/python2.2/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 411,
column 7

I've tried to use a try/except clause both w/in my class and w/in a function
that wraps the class for easy access, but to no avail. The code works on
other websites, so I know that it's not *completely* off. Any help would
be greatly appreciated! TIA :)

Kevin
Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Kevin T. Ryan wrote:
Hi all -

I'm somewhat new to python (about 1 year), and I'm trying to write a program
that opens a file like object w/ urllib.urlopen, and then parse the data by
passing it to a class that subclasses HTMLParser.HTMLParser. On the web
page, however, there is javascript - and I think that is causing an error
in parsing the data. Here's the error:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "html_helper.py", line 30, in parse_data
p.feed(data)
File "//usr/lib/python2.2/HTMLParser.py", line 108, in feed
self.goahead(0)
File "//usr/lib/python2.2/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "//usr/lib/python2.2/HTMLParser.py", line 329, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "//usr/lib/python2.2/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 411,
column 7

I've tried to use a try/except clause both w/in my class and w/in a function
that wraps the class for easy access, but to no avail. The code works on
other websites, so I know that it's not *completely* off. Any help would
be greatly appreciated! TIA :)

You might be out of luck as far as HTMLParser goes. HTMLParser thinks
that's a closing tag (an illegal one), and there's no way to shut off
closing tags.

I suggest you work around it by removing the script tag before feeding
the file to HTMLParser. If you feed the file one line at a time, then
search for the string '<script>'. If it's there, feed only the part
of the line before it to HTMLParser, then scan for the closing tag
yourself, and when you find it, only feed the part after it to
HTMLParser, doing nothing with the stuff in between. Here is a HIGHLY
UNTESTED example:

_scriptopen = re.compile(r"<\s*script[^<>]*>")
_scriptclose = re.compile(r"</\s*script\s*>")
m = _scriptopen.search(line)
if m:
parserobject.feed(line[:m.start()])
line = line[m.end():]
while 1:
m2 = _scriptclose.search(line)
if m2:
parserobject.feed(line[m.end():])
break
line = urllibobject.readline()
if not line:
break
else:
parserobject.feed(line)
It's not good HTML, but (once it's debugged) it'll work most of the
time as a practical matter. If you feed the whole file at once, then
you could maybe do it with one regexp (again HIGHLY UNTESTED):

_scripttag = re.compile(r"<\s*script[^<>]*>.*?</\s*script\*>",re.DOTALL)
_scripttag.replace('',buffer)
--
CARL BANKS http://www.aerojockey.com/software
"If you believe in yourself, drink your school, stay on drugs, and
don't do milk, you can get work."
-- Parody of Mr. T from a Robert Smigel Cartoon
Jul 18 '05 #2

P: n/a

"Kevin T. Ryan" <ke*********@yahoo.com> wrote in message
news:40**********************@news.rcn.com...
I'm somewhat new to python (about 1 year), and I'm trying to write a program
that opens a file like object w/ urllib.urlopen, and then parse the data by
passing it to a class that subclasses HTMLParser.HTMLParser. On the web
page, however, there is javascript - and I think that is causing an error
in parsing the data.


The trouble is there is so much junk HTML on the web, which only vaguely
follows the syntax. If you are feeding your program with a wide variety of pages,
I would recommend sanitising the page using Tidy or uTidylib first.

Jul 18 '05 #3

P: n/a
Thanks to both of you - I will try to incorporate the regex's and I'll check
out tidy. Take care,

Kevin

"Kevin T. Ryan" <ke*********@yahoo.com> wrote in message
news:40**********************@news.rcn.com...
Hi all -

I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that subclasses HTMLParser.HTMLParser. On the web
page, however, there is javascript - and I think that is causing an error
in parsing the data. Here's the error:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "html_helper.py", line 30, in parse_data
p.feed(data)
File "//usr/lib/python2.2/HTMLParser.py", line 108, in feed
self.goahead(0)
File "//usr/lib/python2.2/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "//usr/lib/python2.2/HTMLParser.py", line 329, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "//usr/lib/python2.2/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 411,
column 7

I've tried to use a try/except clause both w/in my class and w/in a function that wraps the class for easy access, but to no avail. The code works on
other websites, so I know that it's not *completely* off. Any help would
be greatly appreciated! TIA :)

Kevin

Jul 18 '05 #4

P: n/a
Kevin T. Ryan <ke*********@yahoo.com> wrote:
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 411,


HTMLParser is correct. You can't include the sequence '</' in a <script>
block in standard HTML. The '</' (aka ETAGO) is taken as being the end
of the script; then the parser gets cross because "scr' + 'ipt" isn't a
valid element name.

Most browsers, on the other hand, will ignore any end-tag that isn't
</script>, hiding the problem.

The usual solution at the HTML side - for those who care - is to use a
JavaScript string literal escape like:

document.write('<\/script>');

At the Python side you can't hope to deal with all the crap markup on
the web, so unless you know the site you're reading uses valid HTML you
should use Tidy to sanitise the input first.

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.