473,398 Members | 2,427 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

unescaping xml escape codes

I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

--
To reply to me directly, please remove "_NoSpam_" from my email address
Jul 18 '05 #1
2 7711
On Sun, 10 Aug 2003 10:08:46 -0700, Daniel <dl**************@yahoo.com> wrote:
I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&amp;', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

If you want to do this properly, I think you have to parse the html a little and see
what the encoding is, and convert to unicode, and then do the conversions.

Very little tested!!
====< cvthtmlent.py >======================================
import re
rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
def ent2chr(m):
code = m.group(1)
if code.isdigit(): code = int(code)
else: code = int(code[1:], 16)
if code<256: return chr(code)
else: return '?' #XXX unichr(code).encode('utf-16le') ??

def cvthtmlent(s): return rxo.sub(ent2chr, s)

if __name__ == '__main__':
import sys; args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-test':
print cvthtmlent(
'blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9')
else:
if arg == '-': fi = sys.stdin
else: fi = file(arg)
for line in fi:
sys.stdout.write(cvthtmlent(line))
================================================== =========
If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

Martin can tell you the real scoop ;-)
from cvthtmlent import cvthtmlent as cvt
print cvt('blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9;')

blah [0] blah [ö] blah [123] ?

Regards,
Bengt Richter
Jul 18 '05 #2
On 11 Aug 2003 00:09:42 GMT, bo**@oz.net (Bengt Richter) wrote:
[...]

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

That should be &#x00; and &#xff; respectively. I did implement hex entites after all.
Botched reediting this commentary however ;-P

Regards,
Bengt Richter
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: yawnmoth | last post by:
say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular...
2
by: Felix | last post by:
If I set a breakpoint in visual studio 2000 and viewed a local variable (in the "locals" panel), I would see something like: sql | " SELECT Operator.FirstName, .... now I am using visual...
5
by: Steve Litvack | last post by:
Hello, I have built an XMLDocument object instance and I get the following string when I examine the InnerXml property: <?xml version=\"1.0\"?><ROOT><UserData UserID=\"2282\"><Tag1...
7
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %%...
18
by: Steve Litvack | last post by:
Hello, I have built an XMLDocument object instance and I get the following string when I examine the InnerXml property: <?xml version=\"1.0\"?><ROOT><UserData UserID=\"2282\"><Tag1...
2
by: Vance Kessler | last post by:
We are trying write a new ASP.NET page to work with an existing stateless ASP application. The ASP application creates a cookie and of course stores the cookie values as escaped strings (using the...
1
by: marcvill | last post by:
I need to send printer specific escape codes a printer for a POS register. Can anyone tell me how to send these codes to a printer using VB .NET and the Win32 spooler functions? I have looked at the...
5
by: Micha³ Gancarski | last post by:
Hello! How do one unescape strings prepared with pg_escape_string() ? stripslashes() will not work because both these functions are not completely compatible. Thank you all in advance --...
3
by: John Nagle | last post by:
Here's a URL from a link on the home page of a major company. <a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a> Yes, that "&amp;" is in the source text of the page. This is, in...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.