BeautifulSoup bug when ">>>" found in attribute value

John Nagle

This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here

>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">

>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle

Dec 26 '06 #1

Subscribe Post Reply

2623

Duncan Booth

John Nagle <na***@animats.comwrote:

And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">

>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.

I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

Dec 27 '06 #2

John Nagle

Duncan Booth wrote:

John Nagle <na***@animats.comwrote:

>>And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">

>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.

I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle

Dec 27 '06 #3

Duncan Booth

John Nagle <na***@animats.comwrote:

It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here

>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:

Similarly, authors should use ">" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.

Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.

Dec 27 '06 #4

Anne van Kesteren

Duncan Booth schreef:

The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here

>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.

You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/....html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
--
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Dec 28 '06 #5

Duncan Booth

"Anne van Kesteren" <an*************@gmail.comwrote:

>Mind you, the sentence before that says 'should' for quoting <
characters which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.

Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:

Authors wishing to put the "<" character in text should use "<"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).

Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.

Dec 28 '06 #6

by: Steven K | last post by:

Hello, I am using an asp page (upload.asp) to gather information and to upload files to the web server using SoftArtisans SAUpload Tool. In the first page (upload.asp), I have a form for...

ASP / Active Server Pages

<xsl:when test=""> help with xslt

by: Rupa | last post by:

Hi, I'm trying to write an xslt to convert an email in xml format to a new xml format. <descr> <xsl:choose> <xsl:value-of select="body"> </xsl:value-of select> <xsl:when test=" <xsl:value-of...

.NET Framework

cannot set the value in the html <input type="file">??

by: Matt | last post by:

If I do the following, the browse text box still cannot see C:/hello world/test.txt. <input type="file" name="fileName" value="C:/hello world/test.txt" size=80> Any ideas? and workarounds...

Javascript

xml transformation convert "<" or ">" value to &lt or &gt

by: andrew007 | last post by:

I do xml / xslt transformation using asp.net but I found any value (w/xml format) in xml node html-encoded to &lt and &gt format if it's > or < tag. Since I have sub xml data in a parent xml node...

ASP.NET

how to formulize <object id="10">door</object> in XSD?

by: easy.lin | last post by:

.... <object id="10">door</object> .... I try to write this in clipse XSD editor <element name="object" type="string"> <attribute name="id" type="int"></attribute> </element> but get wrong...

.NET Framework

what does "->" and "=>" do?

by: planetbrett | last post by:

I have read through php.net manuals and have not see any mention about what these operands actually do. I have seen them used in a bunch of different code lately and don't really understand. ...

PHP

<img width="100px" ...> valid (X)HTML?

by: Christian Hackl | last post by:

Hi everyone, I've got a question about what makes the "img" element's width/height attributes valid HTML or XHTML. First of all, this is a rather theoretical question, but digging through the...

HTML / CSS

<input type="file" value="path/file.jpeg"> in Firefox

by: ChollaPete | last post by:

This code: <form action="processScan.php" method="get"> <p> <?php print "Scan name: <input type=\"file\" name=\"tScanFileName\" value= \"{$scanFileName}\"><br>"; addHiddenCarryons(); ?>...

HTML / CSS

by: Roedy Green | last post by:

The only browser I have encountered that supports <colgroup><col class="behold"></colgroup> to apply a CSS style to a whole column, is Microsoft Internet Explorer. I have been told it SHOULD NOT...

HTML / CSS

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

BeautifulSoup bug when ">>>" found in attribute value

Similar topics