By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,100 Members | 2,846 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,100 IT Pros & Developers. It's quick & easy.

Removing invalid characters from XML values when writing

P: 92
I have a string value that may contain some unprintable characters. I am using the xml.sax.escape to remove the the <, > and & characters fine but it seems to leave in the \n characters. Later this string gets added to some XML, which is UTF-8 encoded, which fails since charcater code 8 is invalid UTF-8.

I figured it was a problem with ASCII -> UTF-8 (by the way how do you do newlines in UTF-8 encoded XML?). So I tried the following little function:

Expand|Select|Wrap|Line Numbers
  1. def convert_str_to_xml_encoded (str):
  2.     new_str = saxutils.escape (str.rstrip ()) 
  3.     return new_str.encode ("utf-8", "replace")
And now it throws an exception:
str_strip_newlines
return new_str.encode ("utf-8", "replace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)


So first, why would it do this since I use 'replace'? Second why cannot it not convert? I guess I could just replace the \n values but that seems to leave me open to other bad characters in the future. Any ideas how to simply convert from a string to a string that can be written to some XML file?
Dec 5 '07 #1
Share this question for a faster answer!
Share on Google+

Post your reply

Sign in to post your reply or Sign up for a free account.