469,359 Members | 1,647 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,359 developers. It's quick & easy.

unescaping xml escape codes

I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

--
To reply to me directly, please remove "_NoSpam_" from my email address
Jul 18 '05 #1
2 7517
On Sun, 10 Aug 2003 10:08:46 -0700, Daniel <dl**************@yahoo.com> wrote:
I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&amp;', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

If you want to do this properly, I think you have to parse the html a little and see
what the encoding is, and convert to unicode, and then do the conversions.

Very little tested!!
====< cvthtmlent.py >======================================
import re
rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
def ent2chr(m):
code = m.group(1)
if code.isdigit(): code = int(code)
else: code = int(code[1:], 16)
if code<256: return chr(code)
else: return '?' #XXX unichr(code).encode('utf-16le') ??

def cvthtmlent(s): return rxo.sub(ent2chr, s)

if __name__ == '__main__':
import sys; args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-test':
print cvthtmlent(
'blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9')
else:
if arg == '-': fi = sys.stdin
else: fi = file(arg)
for line in fi:
sys.stdout.write(cvthtmlent(line))
================================================== =========
If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

Martin can tell you the real scoop ;-)
from cvthtmlent import cvthtmlent as cvt
print cvt('blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9;')

blah [0] blah [÷] blah [123] ?

Regards,
Bengt Richter
Jul 18 '05 #2
On 11 Aug 2003 00:09:42 GMT, bo**@oz.net (Bengt Richter) wrote:
[...]

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

That should be &#x00; and &#xff; respectively. I did implement hex entites after all.
Botched reediting this commentary however ;-P

Regards,
Bengt Richter
Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

11 posts views Thread by yawnmoth | last post: by
5 posts views Thread by Steve Litvack | last post: by
18 posts views Thread by Steve Litvack | last post: by
2 posts views Thread by Vance Kessler | last post: by
1 post views Thread by marcvill | last post: by
5 posts views Thread by Micha│ Gancarski | last post: by
3 posts views Thread by John Nagle | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.