471,055 Members | 1,891 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,055 software developers and data experts.

Unicode question : turn "José" into u"José"

This is probably stupid and/or misguided but supposing I'm passed a byte-string value that I want to be unicode, this is what I do. I'm sure I'm missing something very important.

Short version :
s = "José" #Start with non-unicode string
unicoded = eval("u'%s'" % "José")
Long version :
s = "José" #Start with non-unicode string
s #Lets look at it 'Jos\xe9' escaped = s.encode('string_escape')
escaped 'Jos\\xe9' unicoded = eval("u'%s'" % escaped)
unicoded u'Jos\xe9'
test = u"José" #What they should have passed me
test == unicoded #Am I really getting the same thing?

True #Yay!


Apr 5 '06 #1
4 1350
First of all, if you run this on the console, find out your console's
encoding. In my case it is English Windows XP. It uses 'cp437'.

C:\>chcp
Active code page: 437

Then
s = "José"
u = u"Jos\u00e9" # same thing in unicode escape
s.decode('cp437') == u # use encoding that match your console True

wy

This is probably stupid and/or misguided but supposing I'm passed a
byte-string value that I want to be unicode, this is what I do. I'm sure
I'm missing something very important.

Short version :
s = "José" #Start with non-unicode string
unicoded = eval("u'%s'" % "José")
Long version :
s = "José" #Start with non-unicode string
s #Lets look at it 'Jos\xe9' escaped = s.encode('string_escape')
escaped 'Jos\\xe9' unicoded = eval("u'%s'" % escaped)
unicoded u'Jos\xe9'
test = u"José" #What they should have passed me
test == unicoded #Am I really getting the same thing?

True #Yay!


Apr 5 '06 #2
maybe a bit off topic, but how does one find the console's encoding
from within python?

Apr 5 '06 #3
The most important thing that you are missing is that you need to know
the encoding used for the 8-bit-character string. Let's guess that it's
Latin1.
Then all you have to do is use the unicode() builtin function, or the
string decode method.
# >>> s = 'Jos\xe9'
# >>> s
# 'Jos\xe9'
# >>> u = unicode(s, 'latin1')
# >>> u
# u'Jos\xe9'
# >>> u2 = s.decode('latin1')
# >>> u2
# u'Jos\xe9'

Other important things:
(1) Using eval() is not usually the best way to do things.
(2) If your code is not in entirely in ASCII, put a coding declaration
at the top of the source file.

Apr 5 '06 #4
ianaré wrote:
maybe a bit off topic, but how does one find the console's encoding
from within python?

In [1]: import sys

In [3]: sys.stdout.encoding
Out[3]: 'cp437'

In [4]: sys.stdin.encoding
Out[4]: 'cp437'

Kent
Apr 5 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

18 posts views Thread by Clark Nu | last post: by
13 posts views Thread by José Joye | last post: by
8 posts views Thread by Maarten | last post: by
6 posts views Thread by José Joye | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.