473,724 Members | 2,290 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

BeautifulSoup bug when ">>>" found in attribute value

This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot =We offer
fantastic rates for selected weeks or days!!&blinkt=C lick here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
And this came out, via prettify:

<addresssnipp et siteurl="http%3 A//apartmentsapart .com"
url="http%3A//www.apartmentsa part.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot =We offer
fantastic rates for selected weeks or days!!&amp;blin kt=Click here
&gt;&gt;&gt;&am p;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;" , which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle

Dec 26 '06 #1
5 2657
John Nagle <na***@animats. comwrote:
And this came out, via prettify:

<addresssnippe t siteurl="http%3 A//apartmentsapart .com"
url="http%3A//www.apartmentsa part.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot =We offer
fantastic rates for selected weeks or days!!&amp;blin kt=Click here
&gt;&gt;&gt;&am p;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;" , which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

Dec 27 '06 #2
Duncan Booth wrote:
John Nagle <na***@animats. comwrote:

>>And this came out, via prettify:

<addresssnipp et siteurl="http%3 A//apartmentsapart .com"
url="http%3 A//www.apartmentsa part.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot =We offer
fantastic rates for selected weeks or days!!&amp;blin kt=Click here
&gt;&gt;&gt;& amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
>>>>&linkurl; =/Europe/Spain/Madrid/Apartments/Offer/2408" />

</param>

BeautifulSo up seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;" , which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.


I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle

Dec 27 '06 #3
John Nagle <na***@animats. comwrote:
It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot =We
offer fantastic rates for selected weeks or days!!&blinkt=C lick here
>>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />
You don't actually *have* to escape when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:
Similarly, authors should use "&gt;" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.
Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
Dec 27 '06 #4
Duncan Booth schreef:
The /was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot =We
offer fantastic rates for selected weeks or days!!&blinkt=C lick here
>>&linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408" />

You don't actually *have* to escape when it appears in html.
You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</paramto close the unclosed param tag.
The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/....html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
--
Anne van Kesteren
<http://annevankesteren .nl/>
<http://www.opera.com/>

Dec 28 '06 #5
"Anne van Kesteren" <an************ *@gmail.comwrot e:
>Mind you, the sentence before that says 'should' for quoting <
characters which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:
Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).
Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.
Dec 28 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
29112
by: Steven K | last post by:
Hello, I am using an asp page (upload.asp) to gather information and to upload files to the web server using SoftArtisans SAUpload Tool. In the first page (upload.asp), I have a form for gathering info with the following: <form method="POST" action="upload_ok.asp" name="frmUpload" enctype="multipart/form-data"> <input type="hidden" name="cmdSearch" value="SearchValue"> <input type="hidden" name="cmdName" value="NameValue">
3
7225
by: Rupa | last post by:
Hi, I'm trying to write an xslt to convert an email in xml format to a new xml format. <descr> <xsl:choose> <xsl:value-of select="body"> </xsl:value-of select> <xsl:when test=" <xsl:value-of select="body/size_in_chars"> > 1042"></xsl:when> <xsl:otherwise> SOMETHING HERE THAT I HAVEN'T SORTED OUT YET
2
28572
by: Matt | last post by:
If I do the following, the browse text box still cannot see C:/hello world/test.txt. <input type="file" name="fileName" value="C:/hello world/test.txt" size=80> Any ideas? and workarounds to this problem? thanks!!
2
6849
by: andrew007 | last post by:
I do xml / xslt transformation using asp.net but I found any value (w/xml format) in xml node html-encoded to &lt and &gt format if it's > or < tag. Since I have sub xml data in a parent xml node as a value. Check out the following problem. I want to convert the value in <WpDatesXml> node to have a valid "<" and ">" instead of &lt or &gt format so that I can use this xml for another use. Please help! <NewDataSet> <Table1>
1
1646
by: easy.lin | last post by:
.... <object id="10">door</object> .... I try to write this in clipse XSD editor <element name="object" type="string"> <attribute name="id" type="int"></attribute> </element> but get wrong message....
5
1921
by: planetbrett | last post by:
I have read through php.net manuals and have not see any mention about what these operands actually do. I have seen them used in a bunch of different code lately and don't really understand. Example 1: // Legacy Function: Renders the Footer of the Theme function themefooter() { global $engine, $index, $themepath;
7
9861
by: Christian Hackl | last post by:
Hi everyone, I've got a question about what makes the "img" element's width/height attributes valid HTML or XHTML. First of all, this is a rather theoretical question, but digging through the W3C HTML 4.01 standard and this group's archive didn't give me a satisfactory answer, so here we go: Is <img src="img.png" alt="" width="100px" height="100px"really valid?
1
14034
by: ChollaPete | last post by:
This code: <form action="processScan.php" method="get"> <p> <?php print "Scan name: <input type=\"file\" name=\"tScanFileName\" value= \"{$scanFileName}\"><br>"; addHiddenCarryons(); ?> <input type="submit">
36
5102
by: Roedy Green | last post by:
The only browser I have encountered that supports <colgroup><col class="behold"></colgroup> to apply a CSS style to a whole column, is Microsoft Internet Explorer. I have been told it SHOULD NOT do so, since this is not part of the specification. How then to you apply styles to entire columns? Surely you don't have to write <td class="behold"on every row item.
0
8741
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9389
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9160
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9090
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8063
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6689
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4771
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2613
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2149
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.