Connecting Tech Pros Worldwide Forums | Help | Site Map

unicode wrap unicode object?

ygao
Guest
 
Posts: n/a
#1: Apr 8 '06
>>> import sys[color=blue][color=green][color=darkred]
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> s[/color][/color][/color]
'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>> ss[/color][/color][/color]
u'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]
how do I get ss from s?
Can there be a way do this?
thanks!


Fredrik Lundh
Guest
 
Posts: n/a
#2: Apr 8 '06

re: unicode wrap unicode object?


"ygao" <ygao2004@gmail.com> wrote:
[color=blue][color=green][color=darkred]
> >>> import sys
> >>> sys.setdefaultencoding("utf-8")[/color][/color][/color]

hmm. what kind of bootleg python is that ?
[color=blue][color=green][color=darkred]
>>> import sys
>>> sys.setdefaultencoding("utf-8")[/color][/color][/color]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'module' object has no attribute 'setdefaultencoding'

(you're not supposed to change the default encoding. don't
do that; it'll only cause problems in the long run).
[color=blue][color=green][color=darkred]
> >>> s='\xe9\xab\x98' #this uff-8 string
> >>> ss=U'\xe9\xab\x98'
> >>> s[/color][/color]
> '\xe9\xab\x98'[color=green][color=darkred]
> >>> ss[/color][/color]
> u'\xe9\xab\x98'[color=green][color=darkred]
> >>>[/color][/color]
> how do I get ss from s?
> Can there be a way do this?[/color]

you have UTF-8 *bytes* in a Unicode text string? sounds like
someone's made a mistake earlier on...

anyway, iso-8859-1 is, in practice, a null transform, that simply
converts unicode characters to bytes:
[color=blue][color=green][color=darkred]
>>> s = ss.encode("iso-8859-1")
>>> s[/color][/color][/color]
'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>> s.decode("utf-8")[/color][/color][/color]
u'\u9ad8'[color=blue][color=green][color=darkred]
>>> import unicodedata
>>> unicodedata.name(s.decode("utf-8"))[/color][/color][/color]
'CJK UNIFIED IDEOGRAPH-9AD8'

but it's probably better to fix the code that puts UTF-8 data in your
Unicode strings (look for bogus iso-8859-1 conversions)

</F>



ygao
Guest
 
Posts: n/a
#3: Apr 8 '06

re: unicode wrap unicode object?


sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.

[color=blue][color=green][color=darkred]
>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>> s1=s.decode('unicode_escape')
>>> s1==ss[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> ss1==s[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

ygao
Guest
 
Posts: n/a
#4: Apr 8 '06

re: unicode wrap unicode object?


sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.[color=blue][color=green][color=darkred]
>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding("utf-8")
>>> s='\xe9\xab\x98' #this uff-8 string
>>> ss=U'\xe9\xab\x98'
>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>> s1=s.decode('unicode_escape')
>>> s1==ss[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> ss1==s[/color][/color][/color]
True

Fredrik Lundh
Guest
 
Posts: n/a
#5: Apr 8 '06

re: unicode wrap unicode object?


"ygao" wrpte_
[color=blue]
> I must use utf-8 for chinese.[/color]

yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
are designed to hold things that you've already decoded (that is, your
chinese text), not the raw UTF-8 bytes.

if you store the UTF-8 in an ordinary 8-bit string instead, you can use
the unicode constructor to convert things properly:

b = "... some utf-8 data ..."

# turn it into a unicode string
u = unicode(b, "utf-8")

# ... do something with it ...

# turn it back into a utf-8 string
s = u.encode("utf-8")

# or use some other encoding
s = u.encode("big5")

e.g.
[color=blue][color=green][color=darkred]
>>> b = '\xe9\xab\x98'
>>> u = unicode(b, "utf-8")
>>> u.encode("utf-8")[/color][/color][/color]
'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>> u.encode("big5")[/color][/color][/color]
'\xb0\xaa'

</F>



ygao
Guest
 
Posts: n/a
#6: Apr 8 '06

re: unicode wrap unicode object?


thanks for your advice.

Martin v. Löwis
Guest
 
Posts: n/a
#7: Apr 8 '06

re: unicode wrap unicode object?


ygao wrote:[color=blue]
> I must use utf-8 for chinese.[/color]

Sure. But please don't do that:
[color=blue][color=green][color=darkred]
>>>> import sys
>>>> reload(sys)
>>>> sys.setdefaultencoding("utf-8")[/color][/color][/color]

As Fredrik says, you should really avoid changing the
default encoding.
[color=blue][color=green][color=darkred]
>>>> s='\xe9\xab\x98' #this uff-8 string
>>>> ss=U'\xe9\xab\x98'
>>>> ss1=ss.encode('unicode_escape').decode('string_esc ape')
>>>> s1=s.decode('unicode_escape')
>>>> s1==ss[/color][/color]
> True[color=green][color=darkred]
>>>> ss1==s[/color][/color]
> True[/color]

Ok. But how about that:

py> s='\xe9\xab\x98'
py> ss=u'\u9ad8'
py> s1=s.decode('utf-8')
py> s1==ss
True

Here, ss is a single character, which uses 3 bytes in UTF-8.
In your example, ss has three characters, which are not Chinese,
but European.

Regards,
Martin
Closed Thread