473,396 Members | 1,866 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

URLs and ampersands

I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.
--
Steven
Aug 4 '08 #1
11 8888
En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribi�:
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&amp;y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.
This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

--
Gabriel Genellina

Aug 5 '08 #2
On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:
En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribi�:
>I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&amp;y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...
I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?

--
Steven
Aug 5 '08 #3
Dnia 05 Aug 2008 09:59:20 GMT, Steven D'Aprano napisa³(a):
I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?
$ cd /usr/lib/python2.5/
$ grep "\&amp\;" *.py
BaseHTTPServer.py: return html.replace("&", "&amp;").replace("<",
"&lt;").replace(">", "&gt;")
cgi.py: s = s.replace("&", "&amp;") # Must be done first!
cgitb.py: doc = doc.replace('&', '&amp;').replace('<',
'&lt;')
difflib.py:
text=text.replace("&","&amp;").replace(">","&gt;") .replace("<","&lt;")
HTMLParser.py: s = s.replace("&amp;", "&") # Must be last
pydoc.py: return replace(text, '&', '&amp;', '<', '&lt;', '>',
'&gt;')
xmlrpclib.py: s = replace(s, "&", "&amp;")

So it could be BaseHTTPServer, cgi, cgitb, difflib, HTMLParser,
pydoc or xmlrpclib. Do you use any of these? Or maybe some other
external module?

--
Regards,
Wojtek Walczak,
http://www.stud.umk.pl/~wojtekwa/
Aug 5 '08 #4

"Steven D'Aprano" <st***@REMOVE-THIS-cybersource.com.auwrote in message
news:00**********************@news.astraweb.com...
I could just do a string replace, but is there a "right" way to escape
and unescape URLs?
The right way is to parse your HTML with an HTML parser. URLs are not
exempt from the normal HTML escaping rules, although there are an awful lot
of pages that get this wrong.

You didn't post any code, so it's hard to tell but maybe something like
ElementTree or lxml would be a better tool than the ones you are currently using.
Aug 5 '08 #5
Steven D'Aprano <st****@REMOVE.THIS.cybersource.com.auwrote:
I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have
searched for, but been unable to find, standard library functions that
escapes or unescapes URLs. Are there any such functions?
Whenever you put a URL into an HTML file you need to escape it, so
naturally you will also need to unescape it when it is retrieved from the
file. However, whatever you use to parse the HMTL ought to be unescaping
text and attributes as part of the parsing process, so you shouldn't need a
separate function for this.

e.g.
>>from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('''<a href="http://www.example.com/parrot.php?x=1
&amp;y=2">link</a>''')
>>soup.contents[0]['href']
u'http://www.example.com/parrot.php?x=1&y=2'
>>>
Even Python's builtin HTMLParser class will do this for you. What parser
are you using?

--
Duncan Booth http://kupuguy.blogspot.com
Aug 5 '08 #6
En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <st****@remove.this.cybersource.com.auescribió:
On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:
>En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
<st***@REMOVE-THIS-cybersource.com.auescribi�:
>>I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&amp;y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

pyimport urllib
pyfn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
pyopen(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file.
(Ok, you didn't even menction you were scraping HTML pages...)
I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?
Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you.

--
Gabriel Genellina

Aug 5 '08 #7
Steven D'Aprano wrote:
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&amp;y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

I don't believe there is a concept of 'escaping a URL' as such. How you
escape or unescape a URL depends on what context you're embedding it in
or extracting it from.

In this case, it looks like you have URLs which have been escaped to go
into an html CDATA attribute value (such as <a href="...">).

I believe there is no documented function in the Python standard library
which reverses this escaping (short of putting your string into a
larger document and parsing that with a full html or xml parser).

-M-

Aug 5 '08 #8
Gabriel Genellina wrote:
Steven D'Aprano wrote:
>I have searched for, but been unable to find, standard library
functions that escapes or unescapes URLs. Are there any such
functions?
Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
I don't see a cgi.unescape in the standard library.

I don't think xml.sax.saxutils.unescape will be suitable for Steven's
purpose, because it doesn't process numeric character references (which
are both legal and seen in the wild in /href/ attributes).

-M-
Aug 5 '08 #9
On Tue, 05 Aug 2008 12:07:39 +0000, Duncan Booth wrote:
Whenever you put a URL into an HTML file you need to escape it, so
naturally you will also need to unescape it when it is retrieved from
the file. However, whatever you use to parse the HMTL ought to be
unescaping text and attributes as part of the parsing process, so you
shouldn't need a separate function for this.
....
Even Python's builtin HTMLParser class will do this for you. What parser
are you using?
A regex.

I know, I know, now I have two problems :-)

It's a quick and dirty hack, not a production piece of code, and I have a
quick and dirty fix by just using url.replace('&amp;', '&').

Thanks to everybody who replied. I guess I really have to bite the bullet
and learn how to use a proper HTML parser.

--
Steven
Aug 5 '08 #10
Steven D'Aprano <st***@REMOVE-THIS-cybersource.com.auwrites:
I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.
xml.sax.utils.unescape()
Aug 6 '08 #11
Matthew Woodcraft <ma******@chiark.greenend.org.ukwrote:
Gabriel Genellina wrote:
>Steven D'Aprano wrote:
>>I have searched for, but been unable to find, standard library
functions that escapes or unescapes URLs. Are there any such
functions?
>Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.

I don't see a cgi.unescape in the standard library.

I don't think xml.sax.saxutils.unescape will be suitable for Steven's
purpose, because it doesn't process numeric character references (which
are both legal and seen in the wild in /href/ attributes).
Here's the code I use. It handles decimal and hex entity references as well
as all html named entities.

import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)

if isinstance(s, str):
s = s.decode(encoding)
return EntityPattern.sub(unescape, s)

--
Duncan Booth http://kupuguy.blogspot.com
Aug 6 '08 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: leegold2 | last post by:
The subject title is a sedgeway into my question that may slightly of topic but I've asked many sources and don't have an answer yet so I ask it here. I have a text fields of html marked up...
2
by: micha | last post by:
my php script gets delivered text that contains special chars (like german umlauts), and these chars may, may partially or may not be coverted into html entities already. i don't know beforhand. ...
5
by: kaeli | last post by:
Hey all, In trying to get my site to validate (html 4.01 transitional), I ran across an issue with this type of url in an href: http://www.server.com/somePage?param1=1&param2=2 etc Notably:...
1
by: jjbutera | last post by:
How do I escape these? The backslash doesn't seem to be working.
5
by: AJ Brown | last post by:
How does one allow the use of ampersands (or other special characters for that matter) within Element text and Attribute text? I have problems using LoadXml from a string "<text value="Jack &...
4
by: MLH | last post by:
I have tried the following in the immediate window. It yields an error... ?"Hello"&vbCrLf&"there"&vbCrLf&"next"&vbCrLf&"line." 'Type declaration character does not match declared data type. ...
13
by: darrel | last post by:
I am creating a querystring to look like this: form_edit.aspx?collectionID=25&amp;confirmationMessage=New+form+entry+saved Note that I'm escaping the ampersand. However, I can't grab the...
13
by: mark4asp | last post by:
When I write a url in xhtml, with an unencoded ampersand, like this: http://localhost:2063/Client/ViewReport.aspx?Ref=58&Type=SUMMARY the xhtml sytax checker correctly indicates an error,...
7
by: Gene Kelley | last post by:
I'm having some trouble echoing string data that contains an ampersand in it. I am filling a WHERE clause in my SQL query with a string based upon a choice made in a select/option form element...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.