Connecting Tech Pros Worldwide Forums | Help | Site Map

UTF-8 to unicode or latin-1 (and yes, I read the FAQ)

NoelByron@gmx.net
Guest
 
Posts: n/a
#1: Oct 19 '06
Hi!

I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

How can I convert this string to latin-1?

How would you write a function like:

def encode_string(string, from_encoding, to_encoding):
#????

Best regards,
Noel


Fredrik Lundh
Guest
 
Posts: n/a
#2: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


NoelByron@gmx.net wrote:
Quote:
I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?
>
So I tried:
>
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
"Köni", to be precise.
Quote:
contains a german 'umlaut'
>
but failed since python assumes every string to decode to be ASCII?
that should work, and it sure works for me:
Quote:
Quote:
Quote:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
>>s
u'K\xf6ni'
Quote:
Quote:
Quote:
>>print s
Köni

what did you do, and how did it fail?

</F>

Duncan Booth
Guest
 
Posts: n/a
#3: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


NoelByron@gmx.net wrote:
Quote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'
>
but failed since python assumes every string to decode to be ASCII?
No, Python would assume the string to be utf-8 encoded in this case:
Quote:
Quote:
Quote:
>>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.

NoelByron@gmx.net
Guest
 
Posts: n/a
#4: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


Quote:
Quote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
>
"Köni", to be precise.
Äh, yes.
;o)
Quote:
Quote:
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
>
that should work, and it sure works for me:
>
Quote:
Quote:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
>>s
u'K\xf6ni'
Quote:
Quote:
>>print s
Köni
>
what did you do, and how did it fail?
First, thank you so much for answering so fast. I proposed python for a
project and it would be very embarrassing for me if I would fail
converting a UTF-8 string to latin-1.

I realized that my problem ist not the decode to UTF-8. The exception
is raised by print if I try to print the unicode string.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 1: ordinal not in range(128)

But that is not a problem at all since I can now turn my UTF-8 strings
to unicode! Once again the problem was sitting right in front of my
screen. Silly me...
;o)

Again, thank you for your reply!

Best regards,
Noel

NoelByron@gmx.net
Guest
 
Posts: n/a
#5: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


Duncan Booth wrote:
Quote:
NoelByron@gmx.net wrote:
>
Quote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
>
No, Python would assume the string to be utf-8 encoded in this case:
>
Quote:
Quote:
>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'
>
Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.
You are right. My test code was:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception. I didn't realize that
the exception was actually raised by print and thought it was the
decode. That explains the fact that a 'ignore' in the decode showed no
effect at all, too.

Thank you for helping!

Best regards,
Noel

Michael Ströder
Guest
 
Posts: n/a
#6: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


NoelByron@gmx.net wrote:
Quote:
>
print 'K\xc3\xb6ni'.decode('utf-8')
>
and this line raised a UnicodeDecode exception.
Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console

Check this out for understanding it:
Quote:
Quote:
Quote:
>>u = 'K\xc3\xb6ni'.decode('utf-8')
>>s=u.encode('iso-8859-1')
>>u
u'K\xf6ni'
Quote:
Quote:
Quote:
>>s
'K\xf6ni'
Quote:
Quote:
Quote:
>>>
Ciao, Michael.
NoelByron@gmx.net
Guest
 
Posts: n/a
#7: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


Michael Ströder wrote:
Quote:
NoelByron@gmx.net wrote:
Quote:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.
>
Works for me.
>
Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console
And that was the problem. I'm developing with eclipse (PyDev). The
console is integrated in the development environment. As I print out an
unicode string python tries to encode it to ASCII. And since the string
contains non ASCII characters it fails. That is no problem if you are
aware of this.

My mistake was that I thought the exception was raised by my call to
decode('UTF-8') because print and decode were on the same line and I
thought print could never raise an exception. Seems like I've learned
something today.

Best regards,
Noel

Neil Cerutti
Guest
 
Posts: n/a
#8: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


On 2006-10-19, Michael Ströder <michael@stroeder.comwrote:
Quote:
NoelByron@gmx.net wrote:
Quote:
>>
>print 'K\xc3\xb6ni'.decode('utf-8')
>>
>and this line raised a UnicodeDecode exception.
>
Works for me.
>
Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console
No, the setting of the console encoding (sys.stdout.encoding) is
ignored. It's a good thing, too, since it's pretty flaky. It uses
sys.getdefaultencoding(), which is always 'ascii' as far as I
know.


--
Neil Cerutti
Marc 'BlackJack' Rintsch
Guest
 
Posts: n/a
#9: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


In <slrnejf84b.rk.horpner@FIAD06.norwich.edu>, Neil Cerutti wrote:
Quote:
Quote:
>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
>object. With print this is implicitly converted to string. The
>char set used depends on your console
>
No, the setting of the console encoding (sys.stdout.encoding) is
ignored.
Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'


Ciao,
Marc 'BlackJack' Rintsch
Neil Cerutti
Guest
 
Posts: n/a
#10: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


On 2006-10-19, Marc 'BlackJack' Rintsch <bj_666@gmx.netwrote:
Quote:
In <slrnejf84b.rk.horpner@FIAD06.norwich.edu>, Neil Cerutti wrote:
>
Quote:
Quote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
>>object. With print this is implicitly converted to string. The
>>char set used depends on your console
>>
>No, the setting of the console encoding (sys.stdout.encoding) is
>ignored.
>
Nope, it is not ignored. This would not work then::
>
In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König
>
In [3]: import sys
>
In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
Interesting! Thanks for the correction.

--
Neil Cerutti
This scene has a lot of activity. It is busy like a bee dive.
--Michael Curtis
Neil Cerutti
Guest
 
Posts: n/a
#11: Oct 19 '06

re: UTF-8 to unicode or latin-1 (and yes, I read the FAQ)


On 2006-10-19, Marc 'BlackJack' Rintsch <bj_666@gmx.netwrote:
Quote:
In <slrnejf84b.rk.horpner@FIAD06.norwich.edu>, Neil Cerutti wrote:
Quote:
Quote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
>>object. With print this is implicitly converted to string. The
>>char set used depends on your console
>>
>No, the setting of the console encoding (sys.stdout.encoding) is
>ignored.
>
Nope, it is not ignored. This would not work then::
>
In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König
>
In [3]: import sys
>
In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
OK, I was thinking of the behavior of file.write(s). Thanks again
for the correction.

--
Neil Cerutti
Closed Thread