By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,173 Members | 796 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,173 IT Pros & Developers. It's quick & easy.

Unescaping URLs in Python

P: n/a
Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

Yes, that "&amp;" is in the source text of the page.

This is, in fact, correct HTML. See

http://www.htmlhelp.com/tools/valida...blems.html#amp

What's the appropriate Python function to call to unescape a URL which might
contain things like that? Will this interfere with the usual "%" type escapes
in URLs?

What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py", which works,
but this should be a standard library function.

John Nagle
Dec 25 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
In message <hW******************@newssvr21.news.prodigy.net >, John Nagle
wrote:
Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?
Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.
Will this interfere with the usual "%" type escapes in URLs?
No. Just think of it as an HTML attribute value; the fact that it's a URL is
a question of later interpretation, nothing to do with the fact that it
comes from an HTML attribute.

Dec 25 '06 #2

P: n/a
Lawrence D'Oliveiro wrote:
In message <hW******************@newssvr21.news.prodigy.net >, John Nagle
wrote:

>>Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?


Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.
I'm using BeautifulSoup, because I need to process real world
HTML. At least by default, it doesn't unescape URLs like that.

Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.

John Nagle
Dec 25 '06 #3

P: n/a
John Nagle wrote:
What's the appropriate Python function to call to unescape a URL which
might contain things like that?
xml.sax.saxutils.unescape()

Will this interfere with the usual "%"
type escapes in URLs?
Nope, and urllib.unquote() can be used to translate URL escapes manually.

Jeffrey
Dec 25 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.