473,324 Members | 2,473 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

HTML parsing bug?

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

The html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
<!--
// </ht ml> - this is a comment in JavaScript, which is itself inside
an HTML comment
-->
</script>
</head>
<body>
Hey there
</body>
</html>
The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())

Jan 30 '06 #1
6 1519
G.
> // </ht ml> - this is a comment in JavaScript, which is itself inside
an HTML comment


This is supposed to be one line. Got wrapped during posting.

Jan 30 '06 #2

<g_**************@yahoo.com> wrote in message
news:11**********************@g44g2000cwa.googlegr oups.com...
Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.


Actually, you are technically incorrect; try validating the code you posted.
Google found this explanation: http://lachy.id.au/log/2005/05/script-comments
Feeding even slightly invalid HTML to the standard library parser will often
choke it. If you can't guarantee clean sources, best use Tidy first or another
parser entirely.

Jan 30 '06 #3
> this is a comment in JavaScript, which is itself inside an HTML comment

Don't nest HTML comments. Occasionaly it may break the browsers as
well.

(I remember this from one of the weirdest of bughunts : whenever the
number of characters between nested HTML comments was divisible by four
the page would render incorrectly ... or something of that sorts)

i.

Jan 30 '06 #4
"Istvan Albert" <is***********@gmail.com> wrote:
this is a comment in JavaScript, which is itself inside an HTML comment


Don't nest HTML comments. Occasionaly it may break the browsers as
well.


Did you read the post? He didn't nest HTML comments. He put a Javascript
comment inside an HTML comment, inside a <script></script> pair. Virtually
every page with Javascript does exactly the same thing.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Feb 1 '06 #5
g_**************@yahoo.com wrote:
Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.


nope. what's inside <!-- --> is not a comment if it's inside a <script>
or <style> tag. read the spec:

http://www.w3.org/TR/REC-html40/types.html#type-cdata

"Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by
user agents. Markup and entities must be treated as raw text and
passed to the application as is. The first occurrence of the
character sequence "</" (end-tag open delimiter) is treated as
terminating the end of the element's content. In valid documents,
this would be the end tag for the element."

in your case, the first occurrence of "</" is not the end tag.

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by default, it is
set to

CDATA_CONTENT_ELEMENTS = ("script", "style")

setting it to an empty tuple disables HTML-compliant handling for these
elements:

p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())

</F>

Feb 1 '06 #6
>> this is a comment in JavaScript, which is itself inside an HTML comment
Did you read the post?


misread it rather ...

Feb 2 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
1
by: anagai | last post by:
Im wondering if generating html objects such as tabels and rows in javascript is faster than typing the html directly? Seems when you do it in javascript you have to download alot of code and would...
59
by: Lennart Björk | last post by:
Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...
28
by: Timothy Larson | last post by:
A couple years ago it seemed like XHTML was the direction of most web markup, a foregone conclusion. Now I return to the scene and I see many here recommending that authors stick to HTML, albeit...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
22
by: John | last post by:
Hello, I have a php include command in my website and the script shows up. However, I need for the script to show up on the right side of page (there is enough room there)... but for some...
11
by: Tim Arnold | last post by:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.