By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,898 Members | 1,245 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,898 IT Pros & Developers. It's quick & easy.

HTML Encoded Translation

P: n/a
How can I translate this:

gi

to this:

"gi"

I've tried urllib.unencode and it doesn't work.

Thanks!

Oct 17 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Dave wrote:
How can I translate this:

gi

to this:

"gi"
the easiest way is to run it through an HTML or XML parser (depending on
what the source is). or you could use something like this:

import re

def fix_charrefs(text):
def fixup(m):
text = m.group(0)
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
>>fix_charrefs("gi")
'gi'

also see:

http://effbot.org/zone/re-sub.htm#strip-html
I've tried urllib.unencode and it doesn't work.
those are HTML/XML character references, not encoded URL characters.

</F>

Oct 17 '06 #2

P: n/a
Dave enlightened us with:
How can I translate this:

gi

to this:

"gi"

I've tried urllib.unencode and it doesn't work.
As you put so nicely in the subject: it is HTML encoding, not URL
encoding. Those are two very different things! Try a HTML decoder,
you'll have more luck with that...

Sybren
--
Sybren Stüvel
Stüvel IT - http://www.stuvel.eu/
Oct 17 '06 #3

P: n/a
Got it, great. This worked like a charm. I knew I was barking up the
wrong tree with urllib, but I didn't know which tree to bark up...

Thanks!

Fredrik Lundh wrote:
Dave wrote:
How can I translate this:

gi

to this:

"gi"

the easiest way is to run it through an HTML or XML parser (depending on
what the source is). or you could use something like this:

import re

def fix_charrefs(text):
def fixup(m):
text = m.group(0)
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
>>fix_charrefs("gi")
'gi'

also see:

http://effbot.org/zone/re-sub.htm#strip-html
I've tried urllib.unencode and it doesn't work.

those are HTML/XML character references, not encoded URL characters.

</F>
Oct 17 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.