472,371 Members | 1,470 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,371 software developers and data experts.

UTF-8 to unicode or latin-1 (and yes, I read the FAQ)

Hi!

I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

How can I convert this string to latin-1?

How would you write a function like:

def encode_string(string, from_encoding, to_encoding):
#????

Best regards,
Noel

Oct 19 '06 #1
10 2237
No*******@gmx.net wrote:
I'm struggling with the conversion of a UTF-8 string to latin-1. As far
as I know the way to go is to decode the UTF-8 string to unicode and
then encode it back again to latin-1?

So I tried:

'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
"Köni", to be precise.
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
that should work, and it sure works for me:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
s
u'K\xf6ni'
>>print s
Köni

what did you do, and how did it fail?

</F>

Oct 19 '06 #2
No*******@gmx.net wrote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?
No, Python would assume the string to be utf-8 encoded in this case:
>>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.

Oct 19 '06 #3
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',

"Köni", to be precise.
Äh, yes.
;o)
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

that should work, and it sure works for me:
>>s = 'K\xc3\xb6ni'.decode('utf-8')
>>s
u'K\xf6ni'
>>print s
Köni

what did you do, and how did it fail?
First, thank you so much for answering so fast. I proposed python for a
project and it would be very embarrassing for me if I would fail
converting a UTF-8 string to latin-1.

I realized that my problem ist not the decode to UTF-8. The exception
is raised by print if I try to print the unicode string.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 1: ordinal not in range(128)

But that is not a problem at all since I can now turn my UTF-8 strings
to unicode! Once again the problem was sitting right in front of my
screen. Silly me...
;o)

Again, thank you for your reply!

Best regards,
Noel

Oct 19 '06 #4
Duncan Booth wrote:
No*******@gmx.net wrote:
'K\xc3\xb6ni'.decode('utf-8') # 'K\xc3\xb6ni' should be 'König',
contains a german 'umlaut'

but failed since python assumes every string to decode to be ASCII?

No, Python would assume the string to be utf-8 encoded in this case:
>'K\xc3\xb6ni'.decode('utf-8').encode('latin1')
'K\xf6ni'

Your code must have failed somewhere else. Try posting actual failing code
and actual traceback.
You are right. My test code was:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception. I didn't realize that
the exception was actually raised by print and thought it was the
decode. That explains the fact that a 'ignore' in the decode showed no
effect at all, too.

Thank you for helping!

Best regards,
Noel

Oct 19 '06 #5
No*******@gmx.net wrote:
>
print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.
Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console

Check this out for understanding it:
>>u = 'K\xc3\xb6ni'.decode('utf-8')
s=u.encode('iso-8859-1')
u
u'K\xf6ni'
>>s
'K\xf6ni'
>>>
Ciao, Michael.
Oct 19 '06 #6
Michael Ströder wrote:
No*******@gmx.net wrote:

print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode object. With
print this is implicitly converted to string. The char set used depends
on your console
And that was the problem. I'm developing with eclipse (PyDev). The
console is integrated in the development environment. As I print out an
unicode string python tries to encode it to ASCII. And since the string
contains non ASCII characters it fails. That is no problem if you are
aware of this.

My mistake was that I thought the exception was raised by my call to
decode('UTF-8') because print and decode were on the same line and I
thought print could never raise an exception. Seems like I've learned
something today.

Best regards,
Noel

Oct 19 '06 #7
On 2006-10-19, Michael Ströder <mi*****@stroeder.comwrote:
No*******@gmx.net wrote:
>>
print 'K\xc3\xb6ni'.decode('utf-8')

and this line raised a UnicodeDecode exception.

Works for me.

Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console
No, the setting of the console encoding (sys.stdout.encoding) is
ignored. It's a good thing, too, since it's pretty flaky. It uses
sys.getdefaultencoding(), which is always 'ascii' as far as I
know.
--
Neil Cerutti
Oct 19 '06 #8
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.
Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
Ciao,
Marc 'BlackJack' Rintsch
Oct 19 '06 #9
On 2006-10-19, Marc 'BlackJack' Rintsch <bj****@gmx.netwrote:
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
Interesting! Thanks for the correction.

--
Neil Cerutti
This scene has a lot of activity. It is busy like a bee dive.
--Michael Curtis
Oct 19 '06 #10
On 2006-10-19, Marc 'BlackJack' Rintsch <bj****@gmx.netwrote:
In <sl*******************@FIAD06.norwich.edu>, Neil Cerutti wrote:
>>Note that 'K\xc3\xb6ni'.decode('utf-8') returns a Unicode
object. With print this is implicitly converted to string. The
char set used depends on your console

No, the setting of the console encoding (sys.stdout.encoding) is
ignored.

Nope, it is not ignored. This would not work then::

In [2]: print 'K\xc3\xb6nig'.decode('utf-8')
König

In [3]: import sys

In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
OK, I was thinking of the behavior of file.write(s). Thanks again
for the correction.

--
Neil Cerutti
Oct 19 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Aleksandar Matijaca | last post by:
Hi there, I am in some need of help. I am trying to parse using the apache sax parser a file that has vaid UTF-8 characters - I keep end up getting a sun.io.MalformedInputException error. ...
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
16
by: lawrence | last post by:
I was told in another newsgroup (about XML, I was wondering how to control user input) that most modern browsers empower the designer to cast the user created input to a particular character...
1
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...
11
by: Jean-François Michaud | last post by:
Hello all, I'm having a little problem, The UTF-8 parser we are using converts the newline entity ( ) within an attribute that we are using to paliate CSS limitations. After the parser has...
4
by: Steve | last post by:
I wish my aspx pages to be interpreted as UTF-8 by browsers. Apart from setting the following in the web.config file: <globalization fileEncoding="utf-8" requestEncoding="utf-8"...
3
by: majna | last post by:
I have character counter for textarea wich counting the characters. Special character needs same place as two normal characters because of 16-bit encoding. Counter is counting -2 when special...
2
by: MichaelSchoeler | last post by:
Hi, I'm having problems with the WebClient class regarding UTF-8 encoded data. When I access a specific webservice directly I can see the data arrives in corretly formatted UTF-8. But when I...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
1
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
2
by: Ricardo de Mila | last post by:
Dear people, good afternoon... I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control. Than I need to discover what...
1
by: Johno34 | last post by:
I have this click event on my form. It speaks to a Datasheet Subform Private Sub Command260_Click() Dim r As DAO.Recordset Set r = Form_frmABCD.Form.RecordsetClone r.MoveFirst Do If...
0
by: jack2019x | last post by:
hello, Is there code or static lib for hook swapchain present? I wanna hook dxgi swapchain present for dx11 and dx9.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.