By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,551 Members | 1,159 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,551 IT Pros & Developers. It's quick & easy.

UTF-8 to unicode or latin-1 (and yes, I read the FAQ)

P: n/a
Hi!

I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

How can I convert this string to latin-1?

How would you write a function like:

def encode_string(string, from_encoding, to_encoding):
#????

Best regards,
Noel

Oct 19 '06 #1
Share this Question
Share on Google+
10 Replies


P: n/a
No*******@gmx.net wrote:
I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
"Köni", to be precise.
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
that should work, and it sure works for me:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
s
u'K\xf6ni'
>>print s
Köni

what did you do, and how did it fail?

</F>

Oct 19 '06 #2

P: n/a
No*******@gmx.net wrote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
No, Python would assume the string to be utf-8 encoded in this case:
>>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.

Oct 19 '06 #3

P: n/a
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',

"Köni", to be precise.
Äh, yes.
;o)
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

that should work, and it sure works for me:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
>>s
u'K\xf6ni'
>>print s
Köni

what did you do, and how did it fail?
First, thank you so much for answering so fast. I proposed python for a
project and it would be very embarrassing for me if I would fail
converting a UTF-8 string to latin-1.

I realized that my problem ist not the decode to UTF-8. The exception
is raised by print if I try to print the unicode string.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 1: ordinal not in range(128)

But that is not a problem at all since I can now turn my UTF-8 strings
to unicode! Once again the problem was sitting right in front of my
screen. Silly me...
;o)

Again, thank you for your reply!

Best regards,
Noel

Oct 19 '06 #4

P: n/a
Duncan Booth wrote:
No*******@gmx.net wrote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

No, Python would assume the string to be utf-8 encoded in this case:
>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.
You are right. My test code was:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception. I didn't realize that
the exception was actually raised by print and thought it was the
decode. That explains the fact that a 'ignore' in the decode showed no
effect at all, too.

Thank you for helping!

Best regards,
Noel

Oct 19 '06 #5

P: n/a
No*******@gmx.net wrote:
>
print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.
Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console

Check this out for understanding it:
>>u = 'K\xc3\xb6ni'.decode('utf-8')
s=u.encode('iso-8859-1')
u
u'K\xf6ni'
>>s
'K\xf6ni'
>>>
Ciao, Michael.
Oct 19 '06 #6

P: n/a
Michael Ströder wrote:
No*******@gmx.net wrote:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console
And that was the problem. I'm developing with eclipse (PyDev). The
console is integrated in the development environment. As I print out an
unicode string python tries to encode it to ASCII. And since the string
contains non ASCII characters it fails. That is no problem if you are
aware of this.

My mistake was that I thought the exception was raised by my call to
decode('UTF-8') because print and decode were on the same line and I
thought print could never raise an exception. Seems like I've learned
something today.

Best regards,
Noel

Oct 19 '06 #7

P: n/a
On 2006-10-19, Michael Ströder <mi*****@stroeder.comwrote:
No*******@gmx.net wrote:
>>
print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console
No, the setting of the console encoding (sys.stdout.encoding) is
ignored. It's a good thing, too, since it's pretty flaky. It uses
sys.getdefaultencoding(), which is always 'ascii' as far as I
know.
--
Neil Cerutti
Oct 19 '06 #8

P: n/a
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.
Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
Ciao,
Marc 'BlackJack' Rintsch
Oct 19 '06 #9

P: n/a
On 2006-10-19, Marc 'BlackJack' Rintsch <bj****@gmx.netwrote:
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
Interesting! Thanks for the correction.

--
Neil Cerutti
This scene has a lot of activity. It is busy like a bee dive.
--Michael Curtis
Oct 19 '06 #10

P: n/a
On 2006-10-19, Marc 'BlackJack' Rintsch <bj****@gmx.netwrote:
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
OK, I was thinking of the behavior of file.write(s). Thanks again
for the correction.

--
Neil Cerutti
Oct 19 '06 #11

This discussion thread is closed

Replies have been disabled for this discussion.