472,358 Members | 2,065 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,358 software developers and data experts.

BeautifulSoup bug when ">>>" found in attribute value

This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle

Dec 26 '06 #1
5 2503
John Nagle <na***@animats.comwrote:
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

Dec 27 '06 #2
Duncan Booth wrote:
John Nagle <na***@animats.comwrote:

>>And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.


I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle

Dec 27 '06 #3
John Nagle <na***@animats.comwrote:
It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
You don't actually *have* to escape when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:
Similarly, authors should use "&gt;" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.
Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
Dec 27 '06 #4
Duncan Booth schreef:
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.
You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.
The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/....html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
--
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Dec 28 '06 #5
"Anne van Kesteren" <an*************@gmail.comwrote:
>Mind you, the sentence before that says 'should' for quoting <
characters which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:
Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).
Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.
Dec 28 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Steven K | last post by:
Hello, I am using an asp page (upload.asp) to gather information and to upload files to the web server using SoftArtisans SAUpload Tool. In the first page (upload.asp), I have a form for...
3
by: Rupa | last post by:
Hi, I'm trying to write an xslt to convert an email in xml format to a new xml format. <descr> <xsl:choose> <xsl:value-of select="body"> </xsl:value-of select> <xsl:when test=" <xsl:value-of...
2
by: Matt | last post by:
If I do the following, the browse text box still cannot see C:/hello world/test.txt. <input type="file" name="fileName" value="C:/hello world/test.txt" size=80> Any ideas? and workarounds...
2
by: andrew007 | last post by:
I do xml / xslt transformation using asp.net but I found any value (w/xml format) in xml node html-encoded to &lt and &gt format if it's > or < tag. Since I have sub xml data in a parent xml node...
1
by: easy.lin | last post by:
.... <object id="10">door</object> .... I try to write this in clipse XSD editor <element name="object" type="string"> <attribute name="id" type="int"></attribute> </element> but get wrong...
5
by: planetbrett | last post by:
I have read through php.net manuals and have not see any mention about what these operands actually do. I have seen them used in a bunch of different code lately and don't really understand. ...
7
by: Christian Hackl | last post by:
Hi everyone, I've got a question about what makes the "img" element's width/height attributes valid HTML or XHTML. First of all, this is a rather theoretical question, but digging through the...
1
by: ChollaPete | last post by:
This code: <form action="processScan.php" method="get"> <p> <?php print "Scan name: <input type=\"file\" name=\"tScanFileName\" value= \"{$scanFileName}\"><br>"; addHiddenCarryons(); ?>...
36
by: Roedy Green | last post by:
The only browser I have encountered that supports <colgroup><col class="behold"></colgroup> to apply a CSS style to a whole column, is Microsoft Internet Explorer. I have been told it SHOULD NOT...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
1
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. header("Location:".$urlback); Is this the right layout the...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
1
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...
0
by: Ricardo de Mila | last post by:
Dear people, good afternoon... I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control. Than I need to discover what...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.