BeautifulSoup vs. loose & chars

John Nagle

I've been parsing existing HTML with BeautifulSoup, and occasionally
hit content which has something like "Design & Advertising", that is,
an "&" instead of an "&". Is there some way I can get BeautifulSoup
to clean those up? There are various parsing options related to "&"
handling, but none of them seem to do quite the right thing.

If I write the BeautifulSoup parse tree back out with "prettify",
the loose "&" is still in there. So the output is
rejected by XML parsers. Which is why this is a problem.
I need valid XML out, even if what went in wasn't quite valid.

John Nagle

Dec 26 '06 #1

Subscribe Reply

4624

placid

John Nagle wrote:

I've been parsing existing HTML with BeautifulSoup, and occasionally
hit content which has something like "Design & Advertising", that is,
an "&" instead of an "&". Is there some way I can get BeautifulSoup
to clean those up? There are various parsing options related to "&"
handling, but none of them seem to do quite the right thing.

If I write the BeautifulSoup parse tree back out with "prettify",
the loose "&" is still in there. So the output is
rejected by XML parsers. Which is why this is a problem.
I need valid XML out, even if what went in wasn't quite valid.

John Nagle

So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;

import urllib, sys

try:
location = urllib.urlopen( url)
except IOError, (errno, strerror):
sys.exit("I/O error(%s): %s" % (errno, strerror))

content = location.read()
content = content.replace ("&", "&")
To do this with BeautifulSoup, i think you need to go through every
Tag, get its content, see if it contains an "&" and then replace the
Tag with the same Tag but the content contains "&"

Hope this helps.
Cheers

Dec 26 '06 #2

Felipe Almeida Lessa

On 26 Dec 2006 04:22:38 -0800, placid <Bu****@gmail.c omwrote:

So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;

I think he wants to replace them, but just the invalid ones. I.e.,

This & this & that

would become

This & this & that
No, i don't know how to do this efficiently. =/...
I think some kind of regex could do it.

--
Felipe.

Dec 26 '06 #3

Duncan Booth

"Felipe Almeida Lessa" <fe**********@g mail.comwrote:

On 26 Dec 2006 04:22:38 -0800, placid <Bu****@gmail.c omwrote:
>So do you want to remove "&" or replace them with "&" ? If you
want to replace it try the following;

I think he wants to replace them, but just the invalid ones. I.e.,

This & this & that

would become

This & this & that
No, i don't know how to do this efficiently. =/...
I think some kind of regex could do it.

Since he's asking for valid xml as output, it isn't sufficient just to
ignore entity definitions: HTML has a lot of named entities such as
  but xml only has a very limited set of predefined named entities.
The safest technique is to convert them all to numeric escapes except
for the very limited set also guaranteed to be available in xml.

Try this:

from cgi import escape
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint. copy()
name2codepoint['apos']=ord("'")

EntityPattern =
re.compile('&(? :#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities( s, encoding='utf-8'):
def unescape(match) :
code = match.group(1)
if code:
return unichr(int(code , 10))
else:
code = match.group(2)
if code:
return unichr(int(code , 16))
else:
return unichr(name2cod epoint[match.group(3)])
return EntityPattern.s ub(unescape, s)

>>escape(

decodeEntities( "This & this & that &eacu te;")).encode(
'ascii', 'xmlcharrefrepl ace')
'This & this & that é'
P.S. apos is handled specially as it isn't technically a
valid html entity (and Python doesn't include it in its entity
list), but it is an xml entity and recognised by many browsers so some
people might use it in html.

Dec 26 '06 #4

John Nagle

Felipe Almeida Lessa wrote:

On 26 Dec 2006 04:22:38 -0800, placid <Bu****@gmail.c omwrote:

>So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;

I think he wants to replace them, but just the invalid ones. I.e.,

This & this & that

would become

This & this & that
No, i don't know how to do this efficiently. =/...
I think some kind of regex could do it.

Yes, and the appropriate one is:

krefindamp = re.compile(r'&( ?!(\w|#)+;)')
...
xmlsection = re.sub(krefinda mp,'&',xmls ection)

This will replace an '&' with '&amp' if the '&' isn't
immediately followed by some combination of letters, numbers,
and '#' ending with a ';' Admittedly this would let something
like '&xx#2;', which isn't a legal entity, through unmodified.

There's still a potential problem with unknown entities in the output XML, but
at least they're recognized as entities.

John Nagle

Dec 26 '06 #5

Andreas Lysdal

Duncan Booth skrev:

"Felipe Almeida Lessa" <fe**********@g mail.comwrote:

>On 26 Dec 2006 04:22:38 -0800, placid <Bu****@gmail.c omwrote:

>>So do you want to remove "&" or replace them with "&" ? If you
want to replace it try the following;

I think he wants to replace them, but just the invalid ones. I.e.,

This & this & that

would become

This & this & that
No, i don't know how to do this efficiently. =/...
I think some kind of regex could do it.

Since he's asking for valid xml as output, it isn't sufficient just to
ignore entity definitions: HTML has a lot of named entities such as
  but xml only has a very limited set of predefined named entities.
The safest technique is to convert them all to numeric escapes except
for the very limited set also guaranteed to be available in xml.

Try this:

from cgi import escape
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint. copy()
name2codepoint['apos']=ord("'")

EntityPattern =
re.compile('&(? :#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities( s, encoding='utf-8'):
def unescape(match) :
code = match.group(1)
if code:
return unichr(int(code , 10))
else:
code = match.group(2)
if code:
return unichr(int(code , 16))
else:
return unichr(name2cod epoint[match.group(3)])
return EntityPattern.s ub(unescape, s)

>>>escape(

decodeEntities( "This & this & that &eacu te;")).encode(
'ascii', 'xmlcharrefrepl ace')
'This & this & that é'
P.S. apos is handled specially as it isn't technically a
valid html entity (and Python doesn't include it in its entity
list), but it is an xml entity and recognised by many browsers so some
people might use it in html.

Hey i fund this site:
http://www.htmlhelp.com/reference/ht...s/symbols.html

I hope that its what you mean.

/Scripter47

Dec 26 '06 #6

Duncan Booth

Andreas Lysdal <an*****@ridder garn.dkwrote:

>P.S. apos is handled specially as it isn't technically a
valid html entity (and Python doesn't include it in its entity
list), but it is an xml entity and recognised by many browsers so some
people might use it in html.

Hey i fund this site:
http://www.htmlhelp.com/reference/ht...s/symbols.html

I hope that its what you mean.

Try
http://www.w3.org/TR/html4/sgml/entities.html#entities
for a more complete list.

Dec 26 '06 #7

Frederic Rentsch

John Nagle wrote:

Felipe Almeida Lessa wrote:

>On 26 Dec 2006 04:22:38 -0800, placid <Bu****@gmail.c omwrote:

>>So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;

I think he wants to replace them, but just the invalid ones. I.e.,

This & this & that

would become

This & this & that
No, i don't know how to do this efficiently. =/...
I think some kind of regex could do it.

Yes, and the appropriate one is:

krefindamp = re.compile(r'&( ?!(\w|#)+;)')
...
xmlsection = re.sub(krefinda mp,'&',xmls ection)

This will replace an '&' with '&amp' if the '&' isn't
immediately followed by some combination of letters, numbers,
and '#' ending with a ';' Admittedly this would let something
like '&xx#2;', which isn't a legal entity, through unmodified.

There's still a potential problem with unknown entities in the output XML, but
at least they're recognized as entities.

John Nagle

Here's another idea:

>>s = '''<htmlhtm tag should not translate

& should be &
&xx#2; isn't a legal entity and should translate
{ is a legal entity and should not translate

</html>

>>import SE # http://cheeseshop.python.org/pypi/SE/2.3
HTM_Escapes = SE.SE (definitions) # See definitions below the

dotted line

>>print HTM_Escapes (s)

<htmlhtm tag should not translate
> & should be &
> &xx#2; isn"t a legal entity and should translate
> { is a legal entity and should not translate
</html>

Regards

Frederic
------------------------------------------------------------------------------
definitions = '''

# Do # Don't do
# " = "  == # 32 20
(34)=&dquot; &dquot;== # 34 22
&=& &== # 38 26
'=" "== # 39 27
<=< <== # 60 3c

>=> >== # 62 3e

©=© ©== # 169 a9
·=· ·== # 183 b7
»=» »== # 187 bb
À=À À== # 192 c0
Á=Á Á== # 193 c1
Â=Â Â== # 194 c2
Ã=Ã Ã== # 195 c3
Ä=Ä Ä== # 196 c4
Å=Å Å== # 197 c5
Æ=Æ Æ== # 198 c6
Ç=Ç Ç== # 199 c7
È=È È== # 200 c8
É=É É== # 201 c9
Ê=Ê Ê== # 202 ca
Ë=Ë Ë== # 203 cb
Ì=Ì Ì== # 204 cc
Í=Í Í== # 205 cd
Î=Î Î== # 206 ce
Ï=Ï Ï== # 207 cf
Ð=&Eth; &Eth;== # 208 d0
Ñ=Ñ Ñ== # 209 d1
Ò=Ò Ò== # 210 d2
Ó=Ó Ó== # 211 d3
Ô=Ô Ô== # 212 d4
Õ=Õ Õ== # 213 d5
Ö=Ö Ö== # 214 d6
Ø=Ø Ø== # 216 d8
Ù=&Ugrve; &Ugrve;== # 217 d9
Ú=Ú Ú== # 218 da
Û=Û Û== # 219 db
Ü=Ü Ü== # 220 dc
Ý=Ý Ý== # 221 dd
Þ=&Thorn; &Thorn;== # 222 de
ß=ß ß== # 223 df
à=à à== # 224 e0
á=á á== # 225 e1
â=â â== # 226 e2
ã=ã ã== # 227 e3
ä=ä ä== # 228 e4
å=å å== # 229 e5
æ=æ æ== # 230 e6
ç=ç ç== # 231 e7
è=è è== # 232 e8
é=é é== # 233 e9
ê=ê ê== # 234 ea
ë=ë ë== # 235 eb
ì=ì ì== # 236 ec
í=í í== # 237 ed
î=î î== # 238 ee
ï=ï ï== # 239 ef
ð=ð ð== # 240 f0
ñ=ñ ñ== # 241 f1
ò=ò ò== # 242 f2
ó=ó ó== # 243 f3
ô=&ocric; &ocric;== # 244 f4
õ=õ õ== # 245 f5
ö=ö ö== # 246 f6
ø=ø ø== # 248 f8
ù=ù ù== # 249 f9
ú=ú ú== # 250 fa
û=û û== # 251 fb
ü=ü ü== # 252 fc
ý=ý ý== # 253 fd
þ=þ þ== # 254 fe
(xff)=ÿ # 255 ff
&#== # All numeric codes
"~<(.|\n)*?>~== " # All HTM tags '''

If the ampersand is all you need to handle you can erase the others
in the first column. You need to keep the second column though, except
the last entry, because the tags don't need protection if '<' and
'>' in the first column are gone.
Definitions are easily edited and can be kept in text files.
The SE constructor accepts a file name instead of a definitions string:

>>HTM_Escapes = SE.SE ('definition_fi le_name')

-------------------------------------------------------------------

Dec 26 '06 #8

Similar topics

3122

BeautifulSoup

by: Steve Young | last post by:

I tried using BeautifulSoup to make changes to the url links on html pages, but when the page was displayed, it was garbled up and didn't look right (even when I didn't actually change anything on the page yet). I ran these steps in python to see what was up: >>from BeautifulSoup import BeautifulSoup >>from urllib2 import build_opener, Request >> >>req = Request('http://www.python.org/')

Python

8603

scraping nested tables with BeautifulSoup

by: Gonzillaaa | last post by:

I'm trying to get the data on the "Central London Property Price Guide" box at the left hand side of this page http://www.findaproperty.com/regi0018.html I have managed to get the data :) but when I start looking for tables I only get tables of depth 1 how do I go about accessing inner tables? same happens for links... this is what I've go so far

Python

2731

BeautifulSoup error

by: William Xu | last post by:

Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com' >>> port = urllib.urlopen(url).read() >>> soup = BeautifulSoup()

Python

2658

BeautifulSoup bug when ">>>" found in attribute value

by: John Nagle | last post by:

This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click here And this came out, via prettify: <addresssnippet siteurl="http%3A//apartmentsapart.com" url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ"> <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer

Python

1765

"Subscribing" to topics?

by: Mizipzor | last post by:

Is there a way to "subscribe" to individual topics? im currently getting bombarded with daily digests and i wish to only receive a mail when there is activity in a topic that interests me. Can this be done? Thanks in advance.

Python

1889

BeautifulSoup vs. Microsoft

by: John Nagle | last post by:

Here's a construct with which BeautifulSoup has problems. It's from "http://support.microsoft.com/contactussupport/?ws=support". This is the original: <a href="http://www.microsoft.com/usability/enroll.mspx" id="L_75998" title="<!--http://www.microsoft.com/usability/information.mspx->" onclick="return MS_HandleClick(this,'C_32179', true);">

Python

3771

Installing BeautifulSoup with easy_install (broken?)

by: Larry Bates | last post by:

Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just fine. Unfortunately I get the following error from BeautifulSoup installation attempt: C:\Python25\Lib\SITE-P~1>easy_install BeautifulSoup

Python

9811

Extracting text from a Webpage using BeautifulSoup

by: Magnus.Moraberg | last post by:

Hi, I wish to extract all the words on a set of webpages and store them in a large dictionary. I then wish to procuce a list with the most common words for the language under consideration. So, my code below reads the page - http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm a welsh language page. I hope to then establish the 1000 most commonly

Python

2171

Importing module PIL vs beautifulSoup.

by: bsagert | last post by:

I downloaded BeautifulSoup.py from http://www.crummy.com/software/BeautifulSoup/ and being a n00bie, I just placed it in my Windows c:\python25\lib\ file. When I type "import beautifulsoup" from the interactive prompt it works like a charm. This seemed too easy in retrospect. Then I downloaded the PIL (Python Imaging Library) module from http://www.pythonware.com/products/pil/. Instead of a simple file that BeautifulSoup sent me, PIL is...

Python

8888

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8752

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9401

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9257

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9111

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8096

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6702

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

3221

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2634

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP