By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,694 Members | 1,846 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,694 IT Pros & Developers. It's quick & easy.

decode unicode string using 'unicode_escape' codecs

P: n/a
I have some unicode string with some characters encode using python
notation like '\n' for LF. I need to convert that to the actual LF
character. There is a 'unicode_escape' codec that seems to suit my purpose.
encoded = u'A\\nA'
decoded = encoded.decode('unicode_escape')
print len(decoded)

3

Note that both encoded and decoded are unicode string. I'm trying to use
the builtin codec because I assume it has better performance that for me
to write pure Python decoding. But I'm not converting between byte string
and unicode string.

However it runs into problem in some cases.

encoded = u'\\n'
decoded = encoded.decode('unicode_escape')

Traceback (most recent call last):
File "g:\bin\py_repos\mindretrieve\trunk\minds\x.py ", line 9, in ?
decoded = encoded.decode('unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in
position 0: ordinal not in range(128)

Reading the docuemnt more carefully, I found out what has happened.
decode('unicode_escape') takes byte string as operand and convert it into
unicode string. Since encoded is already unicode, it is first implicitly
converted to byte string using 'ascii' encoding. In this case it fails
because of the '' character.

So I resigned to the fact that 'unicode_escape' doesn't do what I want.
But I think more deeply. I come up with this Python source code. It runs
OK and outputs 3.

---------------------------------
# -*- coding: utf-8 -*-
print len(u'\n') # 3
---------------------------------

Think about what happened in the second line. First the parser decodes the
bytes into an unicode string with UTF-8 encoding. Then it applies syntax
run to decode the unicode characters '\n' to LF. The second is what I
want. There must be something available to the Python interpreter that is
not available to the user. So it there something I have overlook?

Anyway I just want to leverage the builtin codecs for performance. I
figure this would be faster than

encoded.replace('\\n', '\n')
...and so on...

If there are other suggestion it would be greatly appriciated :)

wy

Jan 13 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a
aurora wrote:
I have some unicode string with some characters encode using python
notation like '\n' for LF. I need to convert that to the actual LF
character. There is a 'unicode_escape' codec that seems to suit my purpose.
encoded = u'A\\nA'
decoded = encoded.decode('unicode_escape')
print len(decoded) 3

Note that both encoded and decoded are unicode string. I'm trying to
use the builtin codec because I assume it has better performance that
for me to write pure Python decoding. But I'm not converting between
byte string and unicode string.

However it runs into problem in some cases.

encoded = u'\\n'
decoded = encoded.decode('unicode_escape')
Traceback (most recent call last):
File "g:\bin\py_repos\mindretrieve\trunk\minds\x.py ", line 9, in ?
decoded = encoded.decode('unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in
position 0: ordinal not in range(128)


Does this do what you want?
u'\\n' u'\x80\\n\x80' len(u'\\n') 4 u'\\n'.encode('utf-8').decode('string_escape').decode('utf-8') u'\x80\n\x80' len(u'\\n'.encode('utf-8').decode('string_escape').decode('utf-8'))

3

Basically, I convert the unicode string to bytes, escape the bytes using
the 'string_escape' codec, and then convert the bytes back into a
unicode string.

HTH,

STeVe
Jan 13 '06 #2

P: n/a
Cool, it works! I have also done some due diligence that the utf-8
encoding would not introduce any Python escape accidentially. I have
written a recipe in the Python cookbook:

Efficient character escapes decoding
http://aspn.activestate.com/ASPN/Coo.../Recipe/466293

wy
Does this do what you want?
>>> u'\\n' u'\x80\\n\x80' >>> len(u'\\n') 4 >>> u'\\n'.encode('utf-8').decode('string_escape').decode('utf-8') u'\x80\n\x80' >>>

len(u'\\n'.encode('utf-8').decode('string_escape').decode('utf-8'))
3

Basically, I convert the unicode string to bytes, escape the bytes using
the 'string_escape' codec, and then convert the bytes back into a
unicode string.

HTH,

STeVe


Jan 14 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.