unescaping xml escape codes

Daniel

I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

--
To reply to me directly, please remove "_NoSpam_" from my email address

Jul 18 '05 #1

Subscribe Post Reply

7711

Bengt Richter

On Sun, 10 Aug 2003 10:08:46 -0700, Daniel <dl**************@yahoo.com> wrote:

I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Thanks.

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

If you want to do this properly, I think you have to parse the html a little and see
what the encoding is, and convert to unicode, and then do the conversions.

Very little tested!!
====< cvthtmlent.py >======================================
import re
rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
def ent2chr(m):
code = m.group(1)
if code.isdigit(): code = int(code)
else: code = int(code[1:], 16)
if code<256: return chr(code)
else: return '?' #XXX unichr(code).encode('utf-16le') ??

def cvthtmlent(s): return rxo.sub(ent2chr, s)

if __name__ == '__main__':
import sys; args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-test':
print cvthtmlent(
'blah [0] blah [ö] blah [123] &#x3c9')
else:
if arg == '-': fi = sys.stdin
else: fi = file(arg)
for line in fi:
sys.stdout.write(cvthtmlent(line))
================================================== =========
If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

Martin can tell you the real scoop ;-)

from cvthtmlent import cvthtmlent as cvt
print cvt('blah [0] blah [ö] blah [123] ω')

blah [0] blah [ö] blah [123] ?

Regards,
Bengt Richter

Jul 18 '05 #2

Bengt Richter

On 11 Aug 2003 00:09:42 GMT, bo**@oz.net (Bengt Richter) wrote:
[...]

Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

That should be  and ÿ respectively. I did implement hex entites after all.
Botched reediting this commentary however ;-P

Regards,
Bengt Richter

Jul 18 '05 #3

Similar topics

in-line detection of html escape codes

by: yawnmoth | last post by:

say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular...

PHP

visual studio 2003 debugger shows string escape codes

by: Felix | last post by:

If I set a breakpoint in visual studio 2000 and viewed a local variable (in the "locals" panel), I would see something like: sql | " SELECT Operator.FirstName, .... now I am using visual...

.NET Framework

Escape codes embedded in XML

by: Steve Litvack | last post by:

Hello, I have built an XMLDocument object instance and I get the following string when I examine the InnerXml property: <?xml version=\"1.0\"?><ROOT><UserData UserID=\"2282\"><Tag1...

.NET Framework

printing % with printf(), use of \ (escape) character

by: teachtiro | last post by:

Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %%...

C / C++

Unwanted Escape Codes In String...

by: Steve Litvack | last post by:

Hello, I have built an XMLDocument object instance and I get the following string when I examine the InnerXml property: <?xml version=\"1.0\"?><ROOT><UserData UserID=\"2282\"><Tag1...

C# / C Sharp

Unescaping ASP vbscript escaped string

by: Vance Kessler | last post by:

We are trying write a new ASP.NET page to work with an existing stateless ASP application. The ASP application creates a cookie and of course stores the cookie values as escaped strings (using the...

ASP.NET

Printing Escape codes to a printer

by: marcvill | last post by:

I need to send printer specific escape codes a printer for a POS register. Can anyone tell me how to send these codes to a printer using VB .NET and the Win32 spooler functions? I have looked at the...

Visual Basic .NET

[PostgreSQL] Unescaping escaped strings?

by: Micha³ Gancarski | last post by:

Hello! How do one unescape strings prepared with pg_escape_string() ? stripslashes() will not work because both these functions are not completely compatible. Thank you all in advance --...

PHP

Unescaping URLs in Python

by: John Nagle | last post by:

Here's a URL from a link on the home page of a major company. <a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a> Yes, that "&" is in the source text of the page. This is, in...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA