467,182 Members | 1,066 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,182 developers. It's quick & easy.

Removing invalid characters from XML values when writing

I have a string value that may contain some unprintable characters. I am using the xml.sax.escape to remove the the <, > and & characters fine but it seems to leave in the \n characters. Later this string gets added to some XML, which is UTF-8 encoded, which fails since charcater code 8 is invalid UTF-8.

I figured it was a problem with ASCII -> UTF-8 (by the way how do you do newlines in UTF-8 encoded XML?). So I tried the following little function:

Expand|Select|Wrap|Line Numbers
  1. def convert_str_to_xml_encoded (str):
  2.     new_str = saxutils.escape (str.rstrip ()) 
  3.     return new_str.encode ("utf-8", "replace")
And now it throws an exception:
return new_str.encode ("utf-8", "replace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

So first, why would it do this since I use 'replace'? Second why cannot it not convert? I guess I could just replace the \n values but that seems to leave me open to other bad characters in the future. Any ideas how to simply convert from a string to a string that can be written to some XML file?
Dec 5 '07 #1
  • viewed: 4312

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

2 posts views Thread by matt | last post: by
9 posts views Thread by Safalra | last post: by
6 posts views Thread by Martin Lacoste | last post: by
7 posts views Thread by Nadav | last post: by
reply views Thread by Hannibal111111 | last post: by
2 posts views Thread by joakim.hove@gmail.com | last post: by
3 posts views Thread by =?Utf-8?B?Vmlub2Q=?= | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.