473,405 Members | 2,279 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

BeautifulSoup bug when ">>>" found in attribute value

This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle

Dec 26 '06 #1
5 2623
John Nagle <na***@animats.comwrote:
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

Dec 27 '06 #2
Duncan Booth wrote:
John Nagle <na***@animats.comwrote:

>>And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.


I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle

Dec 27 '06 #3
John Nagle <na***@animats.comwrote:
It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
You don't actually *have* to escape when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:
Similarly, authors should use "&gt;" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.
Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
Dec 27 '06 #4
Duncan Booth schreef:
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.
You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.
The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/....html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
--
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Dec 28 '06 #5
"Anne van Kesteren" <an*************@gmail.comwrote:
>Mind you, the sentence before that says 'should' for quoting <
characters which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:
Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).
Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.
Dec 28 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Steven K | last post by:
Hello, I am using an asp page (upload.asp) to gather information and to upload files to the web server using SoftArtisans SAUpload Tool. In the first page (upload.asp), I have a form for...
3
by: Rupa | last post by:
Hi, I'm trying to write an xslt to convert an email in xml format to a new xml format. <descr> <xsl:choose> <xsl:value-of select="body"> </xsl:value-of select> <xsl:when test=" <xsl:value-of...
2
by: Matt | last post by:
If I do the following, the browse text box still cannot see C:/hello world/test.txt. <input type="file" name="fileName" value="C:/hello world/test.txt" size=80> Any ideas? and workarounds...
2
by: andrew007 | last post by:
I do xml / xslt transformation using asp.net but I found any value (w/xml format) in xml node html-encoded to &lt and &gt format if it's > or < tag. Since I have sub xml data in a parent xml node...
1
by: easy.lin | last post by:
.... <object id="10">door</object> .... I try to write this in clipse XSD editor <element name="object" type="string"> <attribute name="id" type="int"></attribute> </element> but get wrong...
5
by: planetbrett | last post by:
I have read through php.net manuals and have not see any mention about what these operands actually do. I have seen them used in a bunch of different code lately and don't really understand. ...
7
by: Christian Hackl | last post by:
Hi everyone, I've got a question about what makes the "img" element's width/height attributes valid HTML or XHTML. First of all, this is a rather theoretical question, but digging through the...
1
by: ChollaPete | last post by:
This code: <form action="processScan.php" method="get"> <p> <?php print "Scan name: <input type=\"file\" name=\"tScanFileName\" value= \"{$scanFileName}\"><br>"; addHiddenCarryons(); ?>...
36
by: Roedy Green | last post by:
The only browser I have encountered that supports <colgroup><col class="behold"></colgroup> to apply a CSS style to a whole column, is Microsoft Internet Explorer. I have been told it SHOULD NOT...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.