469,648 Members | 1,523 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,648 developers. It's quick & easy.

Unescaping URLs in Python

Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

Yes, that "&amp;" is in the source text of the page.

This is, in fact, correct HTML. See

http://www.htmlhelp.com/tools/valida...blems.html#amp

What's the appropriate Python function to call to unescape a URL which might
contain things like that? Will this interfere with the usual "%" type escapes
in URLs?

What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py", which works,
but this should be a standard library function.

John Nagle
Dec 25 '06 #1
3 2702
In message <hW******************@newssvr21.news.prodigy.net >, John Nagle
wrote:
Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?
Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.
Will this interfere with the usual "%" type escapes in URLs?
No. Just think of it as an HTML attribute value; the fact that it's a URL is
a question of later interpretation, nothing to do with the fact that it
comes from an HTML attribute.

Dec 25 '06 #2
Lawrence D'Oliveiro wrote:
In message <hW******************@newssvr21.news.prodigy.net >, John Nagle
wrote:

>>Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?


Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.
I'm using BeautifulSoup, because I need to process real world
HTML. At least by default, it doesn't unescape URLs like that.

Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.

John Nagle
Dec 25 '06 #3
John Nagle wrote:
What's the appropriate Python function to call to unescape a URL which
might contain things like that?
xml.sax.saxutils.unescape()

Will this interfere with the usual "%"
type escapes in URLs?
Nope, and urllib.unquote() can be used to translate URL escapes manually.

Jeffrey
Dec 25 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Daniel | last post: by
3 posts views Thread by pinkfloydhomer | last post: by
2 posts views Thread by Vance Kessler | last post: by
reply views Thread by Kurt B. Kaiser | last post: by
7 posts views Thread by Ben Finney | last post: by
9 posts views Thread by Frank Potter | last post: by
11 posts views Thread by Steven D'Aprano | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.