unicode - Python

7stud

Based on this example and the error:

-----
u_str = u"abc\u9999"
print u_str

UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------

it looks like when I try to display the string, the ascii decoder
parses each character in the string and fails when it can't convert a
numerical code that is higher than 127 to a character, i.e. the
character \u9999.

In the following example, I use encode() to convert a unicode string
to a regular string:

-----
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print repr(reg_str)
-----

and the output is:

'abc\xe9\xa6\x99'

1) Why aren't the characters 'a', 'b', and 'c' in hex notation? It
looks like python must be using the ascii decoder to parse the
characters in the string again--with the result being python converts
only the 1 byte numerical codes to characters. 2) Why didn't that
cause an error like above for the 3 byte character?

Then if I try this:

---
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print reg_str
---

I get the output:

abc<some chinese character>

Here it looks like python isn't using the ascii decoder anymore. 2)
What determines which decoder python uses?

Jul 1 '07 #1

Subscribe Post Reply

3997

Erik Max Francis

7stud wrote:

Based on this example and the error:

-----
u_str = u"abc\u9999"
print u_str

UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------

it looks like when I try to display the string, the ascii decoder
parses each character in the string and fails when it can't convert a
numerical code that is higher than 127 to a character, i.e. the
character \u9999.

If you try to print a Unicode string, then Python will attempt to first
encode it using the default encoding for that file. Here, it's apparent
the default encoding is 'ascii', so it attempts to encode it into ASCII,
which it can't do, hence the exception. The error is no different from
this:

>>u_str = u'abc\u9999'
u_str.encode('ascii')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)

In the following example, I use encode() to convert a unicode string
to a regular string:

-----
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print repr(reg_str)
-----

and the output is:

'abc\xe9\xa6\x99'

1) Why aren't the characters 'a', 'b', and 'c' in hex notation? It
looks like python must be using the ascii decoder to parse the
characters in the string again--with the result being python converts
only the 1 byte numerical codes to characters. 2) Why didn't that
cause an error like above for the 3 byte character?

Since you've already encoded the Unicode object as a normal string,
Python isn't trying to do any implicit encoding. As for why 'abc'
appears in plain text, that's just the way repr works:

>>s = 'a'
print repr(s)

'a'

>>t = '\x99'
print repr(t)

'\x99'

repr is attempting to show the string in the most readable fashion. If
the character is printable, then it just shows it as itself. If it's
unprintable, then it shows it in hex string escape notation.

Then if I try this:

---
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print reg_str
---

I get the output:

abc<some chinese character>

Here it looks like python isn't using the ascii decoder anymore. 2)
What determines which decoder python uses?

Again, that's because by already encoding it as a string, Python isn't
doing any implicit encoding. So it prints the raw string, which happens
to be UTF-8, and which your terminal obviously supports, so you see the
proper character.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Let us not seek the Republican answer or the Democratic answer but
the right answer. -- John F. Kennedy

Jul 1 '07 #2

Sander Steffann

Hi,

"Erik Max Francis" <ma*@alcyone.comwrote in message
news:Qp******************************@speakeasy.ne t...

7stud wrote:

>Based on this example and the error:

-----
u_str = u"abc\u9999"
print u_str

UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------

it looks like when I try to display the string, the ascii decoder
parses each character in the string and fails when it can't convert a
numerical code that is higher than 127 to a character, i.e. the
character \u9999.

If you try to print a Unicode string, then Python will attempt to first
encode it using the default encoding for that file. Here, it's apparent
the default encoding is 'ascii', so it attempts to encode it into ASCII,
which it can't do, hence the exception.

If you want to change the default encoding of your stdout and stderr, you
can do something like this:

import codecs, sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stderr = codecs.getwriter('utf-8')(sys.stderr)

After doing this, print u_str will work as expected (when using an utf-8
terminal)

- Sander

Jul 1 '07 #3

7stud

Erik Max Francis wrote:

7stud wrote:

Based on this example and the error:

-----
u_str = u"abc\u9999"
print u_str

UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------

it looks like when I try to display the string, the ascii decoder
parses each character in the string and fails when it can't convert a
numerical code that is higher than 127 to a character, i.e. the
character \u9999.

If you try to print a Unicode string, then Python will attempt to first
encode it using the default encoding for that file. Here, it's apparent
the default encoding is 'ascii', so it attempts to encode it into ASCII,
which it can't do, hence the exception. The error is no different from
this:

>>u_str = u'abc\u9999'
>>u_str.encode('ascii')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)

In the following example, I use encode() to convert a unicode string
to a regular string:

-----
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print repr(reg_str)
-----

and the output is:

'abc\xe9\xa6\x99'

1) Why aren't the characters 'a', 'b', and 'c' in hex notation? It
looks like python must be using the ascii decoder to parse the
characters in the string again--with the result being python converts
only the 1 byte numerical codes to characters. 2) Why didn't that
cause an error like above for the 3 byte character?

Since you've already encoded the Unicode object as a normal string,
Python isn't trying to do any implicit encoding. As for why 'abc'
appears in plain text, that's just the way repr works:

>>s = 'a'
>>print repr(s)

'a'

>>t = '\x99'
>>print repr(t)

'\x99'

repr is attempting to show the string in the most readable fashion. If
the character is printable, then it just shows it as itself. If it's
unprintable, then it shows it in hex string escape notation.

Then if I try this:

---
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print reg_str
---

I get the output:

abc<some chinese character>

Here it looks like python isn't using the ascii decoder anymore. 2)
What determines which decoder python uses?

Again, that's because by already encoding it as a string, Python isn't
doing any implicit encoding. So it prints the raw string, which happens
to be UTF-8, and which your terminal obviously supports, so you see the
proper character.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Let us not seek the Republican answer or the Democratic answer but
the right answer. -- John F. Kennedy

So let me see if I have this right:

Here is some code:
-----
print "print unicode string:"
#print u"abc\u9999" #error
print repr(u'abc\u9999')
print

print "print regular string containing chars in unicode syntax:"
print 'abc\u9999'
print repr('abc\u9999')
print

print "print regular string containing chars in utf-8 syntax:"
#encode() converts unicode strings to regular strings
print u'abc\u9999'.encode("utf-8")
print repr(u'abc\u9999'.encode("utf-8") )
-----

Here is the output:
-------
print unicode string:
u'abc\u9999'

print regular string containing chars in unicode syntax:
abc\u9999
'abc\\u9999'

print regular string containing chars in utf-8 syntax:
abc<chinese character>
'abc\xe9\xa6\x99'
------

1) If you print a unicode string:

*print implicitly calls str()*

a) str() calls encode(), and encode() tries to convert the unicode
string to a regular string. encode() uses the default encoding, which
is ascii. If encode() can't convert a character, then encode() raises
an exception.

b) repr() calls encode(), but if encode() raises an exception for a
character, repr() catches the exception and skips over the character
leaving the character unchanged.

2) If you print a regular string containing characters in unicode
syntax:

a) str() calls encode(), but if encode() raises an exception for a
character, str() catches the exception and skips over the character
leaving the character unchanged. Same as 1b.

b) repr() similar to a), but repr() then escapes the escapes in the
string.
3) If you print a regular string containing characters in utf-8
syntax:

a) str() outputs the string to your terminal, and if your terminal can
convert the utf-8 numerical codes to characters it does so.

b) repr() blocks your terminal from interpreting the characters by
escaping the escapes in your string. Why don't I see two slashes like
in the output for 2b?

Jul 1 '07 #4

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

1) If you print a unicode string:

>
*print implicitly calls str()*

No. print does nothing if the object is already a string or unicode
object, and calls str() only otherwise.

a) str() calls encode(), and encode() tries to convert the unicode
string to a regular string. encode() uses the default encoding, which
is ascii. If encode() can't convert a character, then encode() raises
an exception.

Yes and no. This is what str() does, but str() isn't called. Instead,
print inspects sys.stdout.encoding, and uses that encoding to encode
the string. That, in turn, may raise an exception (in particular if
sys.stdout.encoding is "ascii" or not set).

b) repr() calls encode(), but if encode() raises an exception for a
character, repr() catches the exception and skips over the character
leaving the character unchanged.

No. repr() never calls encode. Instead, each type, including unicode,
may have its own __repr__ which is called. unicode.__repr__ escapes
all non-ASCII characters.

2) If you print a regular string containing characters in unicode
syntax:

No. There is no such thing:

pylen("\u")
2
py"\u"[0]
'\\'
py"\u"[1]
'u'

In a regular string, \u has no meaning, so \ stands just for itself.

a) str() calls encode(), but if encode() raises an exception for a
character, str() catches the exception and skips over the character
leaving the character unchanged. Same as 1b.

No. Printing a string never invokes .encode(), and no exception occurs
at all. Instead, the \ just gets printed as is.

b) repr() similar to a), but repr() then escapes the escapes in the
string.

str.__repr__ escapes the backslash just in case, so that it won't have
to check for the next character; in that sense, it generates a normal
form.

3) If you print a regular string containing characters in utf-8
syntax:

a) str() outputs the string to your terminal, and if your terminal can
convert the utf-8 numerical codes to characters it does so.

Correct. In general, you should always use the terminal's encoding
when printing to the terminal. That way, you can print everything
just fine what the terminal can display, and get an exception if
you try to print something that the terminal would be unable to
display.

b) repr() blocks your terminal from interpreting the characters by
escaping the escapes in your string. Why don't I see two slashes like
in the output for 2b?

str.__repr__ produces an output that is legal Python syntax for a string
literal. len(u'\u9999'.encode('utf-8')) is 3, so this Chinese character
really encodes as three separate bytes. As these are non-ASCII bytes,
__repr__ choses a representation that is legal Python syntax. For that
characters, only \xe9, \xa6 and \x99 are valid Python syntax (each
representing a single byte). For a backslash, Python could have
generated \x5c or \134 as well, which are all different spellings
of "backslash in a string literal". Python chose the most legible
one, which is the double-backslash.

HTH,
Martin

Jul 1 '07 #5

7stud

Hi,

Thanks for the detailed response.

On Jul 1, 2:14 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

1) If you print a unicode string:

a) str() calls encode(), and encode() tries to convert the unicode
string to a regular string. encode() uses the default encoding, which
is ascii. If encode() can't convert a character, then encode() raises
an exception.

Yes and no. This is what str() does, but str() isn't called. Instead,
print inspects sys.stdout.encoding, and uses that encoding to encode
the string. That, in turn, may raise an exception (in particular if
sys.stdout.encoding is "ascii" or not set).

Is that the same as print calling encode(u_str, sys.stdout.encoding)

Jul 2 '07 #6

7stud

On Jul 1, 9:51 pm, 7stud <bbxx789_0...@yahoo.comwrote:

Hi,

Thanks for the detailed response.

On Jul 1, 2:14 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

1) If you print a unicode string:

a) str() calls encode(), and encode() tries to convert the unicode
string to a regular string. encode() uses the default encoding, which
is ascii. If encode() can't convert a character, then encode() raises
an exception.

Yes and no. This is what str() does, but str() isn't called. Instead,
print inspects sys.stdout.encoding, and uses that encoding to encode
the string. That, in turn, may raise an exception (in particular if
sys.stdout.encoding is "ascii" or not set).

Is that the same as print calling encode(u_str, sys.stdout.encoding)

ooops. I mean is that the same as print calling
u_str.encode(sys.stdout.encoding)?

Jul 2 '07 #7

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

ooops. I mean is that the same as print calling

u_str.encode(sys.stdout.encoding)?

Almost. It's rather

u_str.encode(sys.stdout.encoding or sys.getdefaultencoding())

(in case sys.stdout.encoding isn't set)

Regards,
Martin

Jul 2 '07 #8

Similar topics

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...

HTML / CSS

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Convert DOS Cyrillic text to Unicode

by: Nikolay Petrov | last post by:

How can I convert DOS cyrillic text to Unicode

Visual Basic .NET

ASCII vs Unicode

by: Jeff | last post by:

Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode...

Visual Basic .NET

Portable Code that supports Unicode

by: Tomás | last post by:

Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const

C / C++

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice