q: how to output a unicode string?

Frank Stajano

A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"hÃ©llÃ´ wÃ³rld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

hÃƒÂ©llÃƒÂ´ wÃƒÂ³rld

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

So I seem to be stuck.

I have checked several online python+unicode pages, including

http://boodebr.org/main/python/all-a...ode#WHYNOPRINT
http://evanjones.ca/python-utf8.html
http://www.reportlab.com/i18n/python..._tutorial.html
http://www.amk.ca/python/howto/unicode
http://www.example-code.com/python/python-charset.asp
http://docs.python.org/lib/csv-examples.html

but none of them was sufficient to make me understand how to deal with
this simple problem. I'm sure it's easy, maybe too easy to be worth
explaining in a tutorial...

Help gratefully received.

Apr 24 '07 #1

Subscribe Post Reply

2105

Diez B. Roggisch

Frank Stajano wrote:

A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"hÃ©llÃ´ wÃ³rld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

hÃƒÂ©llÃƒÂ´ wÃƒÂ³rld

Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

Diez

Apr 24 '07 #2

Frank Stajano

Diez B. Roggisch wrote:

Frank Stajano wrote:

>A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"hÃ©llÃ´ wÃ³rld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)
What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

hÃƒÂ©llÃƒÂ´ wÃƒÂ³rld

Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.

Aha! Thanks for spotting this. You are right about the terminal
(rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a

cat t2.py

(t2.py being the program above) which displays the source code garbled
in the same way.

If I do

s1 = u"hÃ©llÃ´ wÃ³rld"
print s1

at the interactive prompt of Idle, I get the proper output

hÃ©llÃ´ wÃ³rld

So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

>Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

OK, many thanks, I got this to work!

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Thanks again

Apr 25 '07 #3

Diez B. Roggisch

So why is it that in the first case I got UnicodeEncodeError: 'ascii'

codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.

>>Then, in the hope of being able to write the string to a file if not to
stdout, I also tried
import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will
work.

OK, many thanks, I got this to work!

>The error you get stems from f.write wanting a unicode-object, but s2 is
a bytestring (you explicitly converted it before), so python tries to
encode the bytestring with the default encoding - ascii - to a unicode
string. This of course fails.

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Yes. That is exactly the difference.

Diez

Apr 25 '07 #4

Frank Stajano

Diez B. Roggisch wrote:

>So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.

Here's a minimal example for you.
I put these four lines into a utf-8 file.

# -*- coding: utf-8 -*-
# this file is called t3.py
s1 = u"héllô wórld"
print s1
If I invoke "python t3.py" at the cygwin/rxvt/bash prompt, I get:

Traceback (most recent call last):
File "t3.py", line 4, in <module>
print s1
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 1: ordinal not in range(128)

If I load the exact same file in Idle and press F5 (for Run), I get:

héllô wórld

So obviously "the system" is not behaving in the same way in the two
cases. Maybe Python senses that it can do utf-8 when it's inside Idle
and sets the default to utf-8 without me asking for it, and senses that
it can't do (or more precisely output) utf-8 when it's in
cygwin/rxvt/bash so there it sets the default codec to ascii. That's my
best guess so far...

I find the encode/decode terminology somewhat confusing, because
arguably both sides are "encoded". For example, a unicode-encoded string
(I mean a sequence of unicode code points) should count as "decoded" in
the terminology of this framework, right?

Anyway, thanks again for your help, for deepening my modest
understanding of the issue and for solving my original problem!

Apr 25 '07 #5

Richard Brodie

"Frank Stajano" <us*************@neverbox.comwrote in message
news:f0**********@gemini.csx.cam.ac.uk...

I find the encode/decode terminology somewhat confusing, because arguably both sides are
"encoded". For example, a unicode-encoded string (I mean a sequence of unicode code
points) should count as "decoded" in the terminology of this framework, right?

Yes. Unicode is the one true Universal Character Set, and everything else
(including ASCII and UTF-8) is a mere encoding. Once you've got your head
round that, things may make more sense.

Apr 25 '07 #6

by: Marko Faldix | last post by:

Hello, with Python 2.3 I can write umlauts (a,o,u umlaut) to a file with this piece of code: import codecs f = codecs.open("klotentest.txt", "w", "latin-1") print >>f, unicode("My umlauts...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

How to output a Unicode character?

by: Michael | last post by:

I mean how to use _tprintf().

C / C++

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

wcsftime output encoding

by: Roger Leigh | last post by:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The program listed below demonstrates the use of wcsftime() and std::time_put<wchar_t> which is a C++ wrapper around it. (I know this isn't C; but...

C / C++

output ANSI encoding for unicode character

by: Nick | last post by:

Hi, I am trying to output a string of chinese characters as a text file. When I open a file for writing from VB, the file is automatically set to UTF-8 encoding (can tell by opening the file...

Visual Basic .NET

Length of encrypted output under 3DES in CBC cipher mode

by: Sathyaish | last post by:

I have the following scenario: Algorithm: 3DES Cipher Mode: CBC Key Size: 128-bit Block Size: 64 bit IV: 0x0000000000000000 (an eight byte array of zeros) The results I get using .NET with...

C# / C Sharp

UTF-8 output problems

by: Michael B. Trausch | last post by:

I am having a slight problem with UTF-8 output with Python. I have the following program: x = 0 while x < 0x4000: print u"This is Unicode code point %d (0x%x): %s" % (x, x, unichr(x)) x +=...

Python

windows active directory ldap output encoding

by: jo3c | last post by:

Hi.. Im trying to get some information out of a windows sever 2003 chinese active directory system so let's say encoding is probably big5 or utf-8 what im doing is simliar to ldapsearch in...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

q: how to output a unicode string?

Similar topics