469,643 Members | 1,514 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,643 developers. It's quick & easy.

unicode wrap unicode object?

>>> import sys
sys.setdefaultencoding("utf-8")
s='\xe9\xab\x98' #this uff-8 string
ss=U'\xe9\xab\x98'
s '\xe9\xab\x98' ss u'\xe9\xab\x98'

how do I get ss from s?
Can there be a way do this?
thanks!

Apr 7 '06 #1
6 3809
"ygao" <yg******@gmail.com> wrote:
import sys
sys.setdefaultencoding("utf-8")
hmm. what kind of bootleg python is that ?
import sys
sys.setdefaultencoding("utf-8") Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'module' object has no attribute 'setdefaultencoding'

(you're not supposed to change the default encoding. don't
do that; it'll only cause problems in the long run).
s='\xe9\xab\x98' #this uff-8 string
ss=U'\xe9\xab\x98'
s '\xe9\xab\x98' ss u'\xe9\xab\x98' how do I get ss from s?
Can there be a way do this?


you have UTF-8 *bytes* in a Unicode text string? sounds like
someone's made a mistake earlier on...

anyway, iso-8859-1 is, in practice, a null transform, that simply
converts unicode characters to bytes:
s = ss.encode("iso-8859-1")
s '\xe9\xab\x98' s.decode("utf-8") u'\u9ad8' import unicodedata
unicodedata.name(s.decode("utf-8"))

'CJK UNIFIED IDEOGRAPH-9AD8'

but it's probably better to fix the code that puts UTF-8 data in your
Unicode strings (look for bogus iso-8859-1 conversions)

</F>

Apr 8 '06 #2
sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
s='\xe9\xab\x98' #this uff-8 string
ss=U'\xe9\xab\x98'
ss1=ss.encode('unicode_escape').decode('string_esc ape')
s1=s.decode('unicode_escape')
s1==ss True ss1==s True


Apr 8 '06 #3
sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
s='\xe9\xab\x98' #this uff-8 string
ss=U'\xe9\xab\x98'
ss1=ss.encode('unicode_escape').decode('string_esc ape')
s1=s.decode('unicode_escape')
s1==ss True ss1==s

True

Apr 8 '06 #4
"ygao" wrpte_
I must use utf-8 for chinese.


yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
are designed to hold things that you've already decoded (that is, your
chinese text), not the raw UTF-8 bytes.

if you store the UTF-8 in an ordinary 8-bit string instead, you can use
the unicode constructor to convert things properly:

b = "... some utf-8 data ..."

# turn it into a unicode string
u = unicode(b, "utf-8")

# ... do something with it ...

# turn it back into a utf-8 string
s = u.encode("utf-8")

# or use some other encoding
s = u.encode("big5")

e.g.
b = '\xe9\xab\x98'
u = unicode(b, "utf-8")
u.encode("utf-8") '\xe9\xab\x98' u.encode("big5")

'\xb0\xaa'

</F>

Apr 8 '06 #5
thanks for your advice.

Apr 8 '06 #6
ygao wrote:
I must use utf-8 for chinese.


Sure. But please don't do that:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
As Fredrik says, you should really avoid changing the
default encoding.
s='\xe9\xab\x98' #this uff-8 string
ss=U'\xe9\xab\x98'
ss1=ss.encode('unicode_escape').decode('string_esc ape')
s1=s.decode('unicode_escape')
s1==ss True ss1==s

True


Ok. But how about that:

py> s='\xe9\xab\x98'
py> ss=u'\u9ad8'
py> s1=s.decode('utf-8')
py> s1==ss
True

Here, ss is a single character, which uses 3 bytes in UTF-8.
In your example, ss has three characters, which are not Chinese,
but European.

Regards,
Martin
Apr 8 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by ..... | last post: by
3 posts views Thread by Michael Weir | last post: by
9 posts views Thread by Thomas Heller | last post: by
13 posts views Thread by Tomás | last post: by
14 posts views Thread by abhi147 | last post: by
17 posts views Thread by Stuart McGraw | last post: by
9 posts views Thread by Jim | last post: by
4 posts views Thread by Rehceb Rotkiv | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.