URLs and ampersands

Steven D'Aprano

I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.
--
Steven

Aug 4 '08 #1

Subscribe Post Reply

8888

Gabriel Genellina

En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribiï¿½:

I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

--
Gabriel Genellina

Aug 5 '08 #2

Steven D'Aprano

On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:

En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribiï¿½:

>I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?

--
Steven

Aug 5 '08 #3

Wojtek Walczak

Dnia 05 Aug 2008 09:59:20 GMT, Steven D'Aprano napisa³(a):

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?

$ cd /usr/lib/python2.5/
$ grep "\&amp\;" *.py
BaseHTTPServer.py: return html.replace("&", "&").replace("<",
"<").replace(">", ">")
cgi.py: s = s.replace("&", "&") # Must be done first!
cgitb.py: doc = doc.replace('&', '&').replace('<',
'<')
difflib.py:
text=text.replace("&","&").replace(">",">") .replace("<","<")
HTMLParser.py: s = s.replace("&", "&") # Must be last
pydoc.py: return replace(text, '&', '&', '<', '<', '>',
'>')
xmlrpclib.py: s = replace(s, "&", "&")

So it could be BaseHTTPServer, cgi, cgitb, difflib, HTMLParser,
pydoc or xmlrpclib. Do you use any of these? Or maybe some other
external module?

--
Regards,
Wojtek Walczak,
http://www.stud.umk.pl/~wojtekwa/

Aug 5 '08 #4

Richard Brodie

"Steven D'Aprano" <st***@REMOVE-THIS-cybersource.com.auwrote in message
news:00**********************@news.astraweb.com...

I could just do a string replace, but is there a "right" way to escape
and unescape URLs?

The right way is to parse your HTML with an HTML parser. URLs are not
exempt from the normal HTML escaping rules, although there are an awful lot
of pages that get this wrong.

You didn't post any code, so it's hard to tell but maybe something like
ElementTree or lxml would be a better tool than the ones you are currently using.

Aug 5 '08 #5

Duncan Booth

Steven D'Aprano <st****@REMOVE.THIS.cybersource.com.auwrote:

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have
searched for, but been unable to find, standard library functions that
escapes or unescapes URLs. Are there any such functions?

Whenever you put a URL into an HTML file you need to escape it, so
naturally you will also need to unescape it when it is retrieved from the
file. However, whatever you use to parse the HMTL ought to be unescaping
text and attributes as part of the parsing process, so you shouldn't need a
separate function for this.

e.g.

>>from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('''<a href="http://www.example.com/parrot.php?x=1

&y=2">link</a>''')

>>soup.contents[0]['href']

u'http://www.example.com/parrot.php?x=1&y=2'

>>>

Even Python's builtin HTMLParser class will do this for you. What parser
are you using?

--
Duncan Booth http://kupuguy.blogspot.com

Aug 5 '08 #6

Gabriel Genellina

En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <st****@remove.this.cybersource.com.auescribiÃ³:

On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:

>En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribiï¿½:

>>I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file.

(Ok, you didn't even menction you were scraping HTML pages...)

I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?

Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you.

--
Gabriel Genellina

Aug 5 '08 #7

Matthew Woodcraft

Steven D'Aprano wrote:

I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

I don't believe there is a concept of 'escaping a URL' as such. How you
escape or unescape a URL depends on what context you're embedding it in
or extracting it from.

In this case, it looks like you have URLs which have been escaped to go
into an html CDATA attribute value (such as <a href="...">).

I believe there is no documented function in the Python standard library
which reverses this escaping (short of putting your string into a
larger document and parsing that with a full html or xml parser).

-M-

Aug 5 '08 #8

Matthew Woodcraft

Gabriel Genellina wrote:

Steven D'Aprano wrote:

>I have searched for, but been unable to find, standard library
functions that escapes or unescapes URLs. Are there any such
functions?

Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.

I don't see a cgi.unescape in the standard library.

I don't think xml.sax.saxutils.unescape will be suitable for Steven's
purpose, because it doesn't process numeric character references (which
are both legal and seen in the wild in /href/ attributes).

-M-

Aug 5 '08 #9

Steven D'Aprano

On Tue, 05 Aug 2008 12:07:39 +0000, Duncan Booth wrote:

Whenever you put a URL into an HTML file you need to escape it, so
naturally you will also need to unescape it when it is retrieved from
the file. However, whatever you use to parse the HMTL ought to be
unescaping text and attributes as part of the parsing process, so you
shouldn't need a separate function for this.

....

Even Python's builtin HTMLParser class will do this for you. What parser
are you using?

A regex.

I know, I know, now I have two problems :-)

It's a quick and dirty hack, not a production piece of code, and I have a
quick and dirty fix by just using url.replace('&', '&').

Thanks to everybody who replied. I guess I really have to bite the bullet
and learn how to use a proper HTML parser.

--
Steven

Aug 5 '08 #10

Paul Rubin

Steven D'Aprano <st***@REMOVE-THIS-cybersource.com.auwrites:

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

xml.sax.utils.unescape()

Aug 6 '08 #11

Duncan Booth

Matthew Woodcraft <ma******@chiark.greenend.org.ukwrote:

Gabriel Genellina wrote:
>Steven D'Aprano wrote:

>>I have searched for, but been unable to find, standard library
functions that escapes or unescapes URLs. Are there any such
functions?

>Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.

I don't see a cgi.unescape in the standard library.

I don't think xml.sax.saxutils.unescape will be suitable for Steven's
purpose, because it doesn't process numeric character references (which
are both legal and seen in the wild in /href/ attributes).

Here's the code I use. It handles decimal and hex entity references as well
as all html named entities.

import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)

if isinstance(s, str):
s = s.decode(encoding)
return EntityPattern.sub(unescape, s)

--
Duncan Booth http://kupuguy.blogspot.com

Aug 6 '08 #12

Similar topics

ampersands in database fields

by: leegold2 | last post by:

The subject title is a sedgeway into my question that may slightly of topic but I've asked many sources and don't have an answer yet so I ask it here. I have a text fields of html marked up...

PHP

html entities and ampersands

by: micha | last post by:

my php script gets delivered text that contains special chars (like german umlauts), and these chars may, may partially or may not be coverted into html entities already. i don't know beforhand. ...

PHP

ampersands in urls (get)

by: kaeli | last post by:

Hey all, In trying to get my site to validate (html 4.01 transitional), I ran across an issue with this type of url in an href: http://www.server.com/somePage?param1=1&param2=2 etc Notably:...

HTML / CSS

Cookie values containing ampersands and semicolons

by: jjbutera | last post by:

How do I escape these? The backslash doesn't seem to be working.

Javascript

Working With Ampersands

by: AJ Brown | last post by:

How does one allow the use of ampersands (or other special characters for that matter) within Element text and Attribute text? I have problems using LoadXml from a string "<text value="Jack &...

.NET Framework

A97 - always necessary to encapsulate ampersands in spaces?

by: MLH | last post by:

I have tried the following in the immediate window. It yields an error... ?"Hello"&vbCrLf&"there"&vbCrLf&"next"&vbCrLf&"line." 'Type declaration character does not match declared data type. ...

Microsoft Access / VBA

how to properly encode ampersands in querystring

by: darrel | last post by:

I am creating a querystring to look like this: form_edit.aspx?collectionID=25&confirmationMessage=New+form+entry+saved Note that I'm escaping the ampersand. However, I can't grab the...

ASP.NET

ampersand in urls when using xhtml 1.0 strict

by: mark4asp | last post by:

When I write a url in xhtml, with an unencoded ampersand, like this: http://localhost:2063/Client/ViewReport.aspx?Ref=58&Type=SUMMARY the xhtml sytax checker correctly indicates an error,...

HTML / CSS

ampersands in URL string

by: Gene Kelley | last post by:

I'm having some trouble echoing string data that contains an ampersand in it. I am filling a WHERE clause in my SQL query with a string based upon a choice made in a select/option form element...

PHP

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server