473,385 Members | 1,934 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

q: how to output a unicode string?

A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"héllô wórld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

héllô wórld

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

So I seem to be stuck.

I have checked several online python+unicode pages, including

http://boodebr.org/main/python/all-a...ode#WHYNOPRINT
http://evanjones.ca/python-utf8.html
http://www.reportlab.com/i18n/python..._tutorial.html
http://www.amk.ca/python/howto/unicode
http://www.example-code.com/python/python-charset.asp
http://docs.python.org/lib/csv-examples.html

but none of them was sufficient to make me understand how to deal with
this simple problem. I'm sure it's easy, maybe too easy to be worth
explaining in a tutorial...

Help gratefully received.
Apr 24 '07 #1
5 2105
Frank Stajano wrote:
A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"héllô wórld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

héllô wórld
Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.
Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

Diez
Apr 24 '07 #2
Diez B. Roggisch wrote:
Frank Stajano wrote:
>A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"héllô wórld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

héllô wórld

Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.
Aha! Thanks for spotting this. You are right about the terminal
(rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a

cat t2.py

(t2.py being the program above) which displays the source code garbled
in the same way.

If I do

s1 = u"héllô wórld"
print s1

at the interactive prompt of Idle, I get the proper output

héllô wórld

So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?
>Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.
OK, many thanks, I got this to work!
The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.
I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Thanks again
Apr 25 '07 #3
So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?
I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.
>>Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will
work.

OK, many thanks, I got this to work!
>The error you get stems from f.write wanting a unicode-object, but s2 is
a bytestring (you explicitly converted it before), so python tries to
encode the bytestring with the default encoding - ascii - to a unicode
string. This of course fails.

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".
Yes. That is exactly the difference.

Diez
Apr 25 '07 #4
Diez B. Roggisch wrote:
>So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.
Here's a minimal example for you.
I put these four lines into a utf-8 file.

# -*- coding: utf-8 -*-
# this file is called t3.py
s1 = u"héllô wórld"
print s1
If I invoke "python t3.py" at the cygwin/rxvt/bash prompt, I get:

Traceback (most recent call last):
File "t3.py", line 4, in <module>
print s1
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 1: ordinal not in range(128)

If I load the exact same file in Idle and press F5 (for Run), I get:

héllô wórld

So obviously "the system" is not behaving in the same way in the two
cases. Maybe Python senses that it can do utf-8 when it's inside Idle
and sets the default to utf-8 without me asking for it, and senses that
it can't do (or more precisely output) utf-8 when it's in
cygwin/rxvt/bash so there it sets the default codec to ascii. That's my
best guess so far...

I find the encode/decode terminology somewhat confusing, because
arguably both sides are "encoded". For example, a unicode-encoded string
(I mean a sequence of unicode code points) should count as "decoded" in
the terminology of this framework, right?

Anyway, thanks again for your help, for deepening my modest
understanding of the issue and for solving my original problem!
Apr 25 '07 #5

"Frank Stajano" <us*************@neverbox.comwrote in message
news:f0**********@gemini.csx.cam.ac.uk...
I find the encode/decode terminology somewhat confusing, because arguably both sides are
"encoded". For example, a unicode-encoded string (I mean a sequence of unicode code
points) should count as "decoded" in the terminology of this framework, right?
Yes. Unicode is the one true Universal Character Set, and everything else
(including ASCII and UTF-8) is a mere encoding. Once you've got your head
round that, things may make more sense.
Apr 25 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Marko Faldix | last post by:
Hello, with Python 2.3 I can write umlauts (a,o,u umlaut) to a file with this piece of code: import codecs f = codecs.open("klotentest.txt", "w", "latin-1") print >>f, unicode("My umlauts...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
5
by: Michael | last post by:
I mean how to use _tprintf().
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
11
by: Roger Leigh | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The program listed below demonstrates the use of wcsftime() and std::time_put<wchar_t> which is a C++ wrapper around it. (I know this isn't C; but...
4
by: Nick | last post by:
Hi, I am trying to output a string of chinese characters as a text file. When I open a file for writing from VB, the file is automatically set to UTF-8 encoding (can tell by opening the file...
1
by: Sathyaish | last post by:
I have the following scenario: Algorithm: 3DES Cipher Mode: CBC Key Size: 128-bit Block Size: 64 bit IV: 0x0000000000000000 (an eight byte array of zeros) The results I get using .NET with...
2
by: Michael B. Trausch | last post by:
I am having a slight problem with UTF-8 output with Python. I have the following program: x = 0 while x < 0x4000: print u"This is Unicode code point %d (0x%x): %s" % (x, x, unichr(x)) x +=...
2
by: jo3c | last post by:
Hi.. Im trying to get some information out of a windows sever 2003 chinese active directory system so let's say encoding is probably big5 or utf-8 what im doing is simliar to ldapsearch in...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.