473,406 Members | 2,713 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

HTMLParser chokes on bad end tag in comment

The code below results in an exception (Python 2.4.2):

HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4,
column 6

Should it? The end tag it chokes on is in comment, isn't it?

import HTMLParser
HTMLParser.HTMLParser().feed("""
<html><head><title></title></head><body><script>
<!--
x = '</foo' + 'bar>'
// -->
</script></body></html>
""")

--
René Pijlman
May 28 '06 #1
6 9449
Rene Pijlman wrote:
The code below results in an exception (Python 2.4.2):

HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4,
column 6

Should it? The end tag it chokes on is in comment, isn't it?


no. STYLE and SCRIPT elements contain character data, not parsed
character data, so comments are treated as characters, and the first
"</" ends the element.

if you have broken documents, you can tweak this by setting the
CDATA_CONTENT_ELEMENTS parser attribute before you start parsing.

</F>

May 29 '06 #2
Fredrik Lundh:
Rene Pijlman:
[end tag in html comment in script element]
The end tag it chokes on is in comment, isn't it?

no. STYLE and SCRIPT elements contain character data, not parsed
character data, so comments are treated as characters, and the first
"</" ends the element.
Ah, I see. I'll report the problem to the application that's generating
this broken code (vBulletin forum)...
if you have broken documents, you can tweak this by setting the
CDATA_CONTENT_ELEMENTS parser attribute before you start parsing.


.... and in the mean time that's a good workaround.

Thank you very much Fredrik.

--
René Pijlman
May 29 '06 #3
Hello Rene,

You can also check out BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/) which is less strict
than the regular HTML parser.

HTH,
Miki
http://pythonwise.blogspot.com/

May 29 '06 #4
Miki:
You can also check out BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/) which is less strict
than the regular HTML parser.


Yes, thanks. Ik this case it was my sitechecker which checks for syntax
and broken links, so it was supposed to find the syntax error.
BeautifulSoup is not very well suited for validators :-)

--
René Pijlman
May 29 '06 #5
Fredrik Lundh wrote:
Should it? The end tag it chokes on is in comment, isn't it?


no. STYLE and SCRIPT elements contain character data, not parsed
character data, so comments are treated as characters, and the first
"</" ends the element.


Rather than take your word for it, I checked the W3C HTML4 DTD and found
this:

http://www.w3.org/TR/html4/appendix/...pecifying-data

Element content

When script or style data is the content of an element (SCRIPT and STYLE),
the data begins immediately after the element start tag and ends at the
first ETAGO ("</") delimiter followed by a name start character ([a-zA-Z]);
note that this may not be the element's end tag. Authors should therefore
escape "</" within the content. Escape mechanisms are specific to each
scripting or style sheet language.

ILLEGAL EXAMPLE:
The following script data incorrectly contains a "</" sequence (as part of
"</EM>") before the SCRIPT end tag:

<SCRIPT type="text/javascript">
document.write ("<EM>This won't work</EM>")
</SCRIPT>

In JavaScript, this code can be expressed legally by hiding the ETAGO
delimiter before an SGML name start character:

<SCRIPT type="text/javascript">
document.write ("<EM>This will work<\/EM>")
</SCRIPT>
Guess you learn something new every day. Too bad there's so much illegal
code in the wild. :(

--
Edward Elliott
UC Berkeley School of Law (Boalt Hall)
complangpython at eddeye dot net
May 29 '06 #6
Edward Elliott wrote:
Guess you learn something new every day. Too bad there's so much illegal
code in the wild. :(


if more people learned something new every day, the wild would look a
lot different.

</F>
May 29 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
11
by: Sean Cody | last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of...
2
by: Matthew Wilson | last post by:
I want to parse an html file and extract my router's IP address. I wrote this code and I have python 2.3 installed: #! /usr/bin/env python import HTMLParser class...
4
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
8
by: Lawrence D'Oliveiro | last post by:
I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser...
1
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
8
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
5
by: Johannes Bauer | last post by:
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.