By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,742 Members | 1,206 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,742 IT Pros & Developers. It's quick & easy.

BeautifulSoup bug when ">>>" found in attribute value

P: n/a
This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle

Dec 26 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
John Nagle <na***@animats.comwrote:
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

Dec 27 '06 #2

P: n/a
Duncan Booth wrote:
John Nagle <na***@animats.comwrote:

>>And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.


I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle

Dec 27 '06 #3

P: n/a
John Nagle <na***@animats.comwrote:
It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
You don't actually *have* to escape when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:
Similarly, authors should use "&gt;" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.
Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
Dec 27 '06 #4

P: n/a
Duncan Booth schreef:
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.
You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.
The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/....html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
--
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Dec 28 '06 #5

P: n/a
"Anne van Kesteren" <an*************@gmail.comwrote:
>Mind you, the sentence before that says 'should' for quoting <
characters which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:
Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).
Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.
Dec 28 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.