Python strings outside the 128 range

Sébastien Boisgérault

Hi,

Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

"é" is not present in the 7-bit ASCII table that is the default
encoding, right ? So is the mapping "é" -"\xe9" portable ?
(site-)configuration dependent ? Can anyone have something
different of "é" when 'print "\xe9"' is executed ? If the process
is config-dependent, what kind of config info is used ?

Regards,

SB

Jul 13 '06 #1

Subscribe Post Reply

2327

Diez B. Roggisch

Sébastien Boisgérault schrieb:

Hi,

Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

"é" is not present in the 7-bit ASCII table that is the default
encoding, right ? So is the mapping "é" -"\xe9" portable ?
(site-)configuration dependent ? Can anyone have something
different of "é" when 'print "\xe9"' is executed ? If the process
is config-dependent, what kind of config info is used ?

The default encoding has nothing to do with this. "\xe9" is just a byte.
You can write it into a file (which the terminal is basically), and no
default encoding whatsoever in the mix.

The default-encoding comes into play when you write unicode(!) strings
to a file. Then the unicode string is converted to a byte string using
the default-eocoding. Which will fail miserably if the default encoding
is ascii (as it is supposed to be) and your unicode string contains any
"funny" characters.

But even if you encode the unicode string explicitely with an encoding
like latin1 or utf-8, the resulting byte strings will just be written to
the file. And it is a totally different question (and actually not
controllable by you/python) if the terminal will interpret the bytes
correct or not.

Diez

Jul 13 '06 #2

Fredrik Lundh

Sébastien Boisgérault wrote:

Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.

>>s = "é"
len(s)

>>ord(s)

233

>>hex(ord(s))

'0xe9'

>>s

'\xe9'

>>print repr(s)

'\xe9'

>>print s

>>print chr(233)

é

</F>

Jul 13 '06 #3

Sébastien Boisgérault

Fredrik Lundh wrote:

in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.

Crystal clear. Thanks !

SB

Jul 13 '06 #4

Piet van Oostrum

>>>>"Sébastien Boisgérault" <Se*******************@gmail.com(SB) wrote:

>SBHi,

>SBCould anyone explain me how the python string "é" is mapped to
SBthe binary code "\xe9" in my python interpreter ?

That is not done in the python interpreter. It is done in the editor in
which you prepare your python source.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org

Jul 13 '06 #5

Gerhard Fiedler

On 2006-07-13 07:42:51, Fredrik Lundh wrote:

>Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.

If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?

Is there a standard way to do this?

Gerhard
Gerhard

Jul 13 '06 #6

Fredrik Lundh

Gerhard Fiedler wrote:

If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?

if you put a, say, chr(233) in an 8-bit string literal in your source code, whoever runs
your program will get a chr(233) byte (unless someone's recoded the file on the way;
ordinary file copies and installation tools usually don't do that). how your program is
treating that chr(233) is up to your program.

to write robust and future-proof code,

- use Unicode literals if you want to put non-ASCII *text* in Python string literals,
and use a PEP 263-style coding directive to tell the parser what encoding your file
is using:

http://www.python.org/dev/peps/pep-0263/

- avoid putting non-ASCII characters in 8-bit literal strings; use escape sequences if
you need to embed binary data in a string literal.

also see the "lexical analysis" section in the language reference:

http://pyref.infogami.com/lexical-analysis

</F>

Jul 13 '06 #7

Richard Brodie

"Gerhard Fiedler" <ge*****@gmail.comwrote in message
news:ma***************************************@pyt hon.org...

If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?

Is there a standard way to do this?

Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.

Jul 13 '06 #8

Gerhard Fiedler

On 2006-07-13 12:04:58, Richard Brodie wrote:

> s = "é"
print s

>Is there a standard way to do this?

Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.

So ...

# coding: utf-8
s = u'é'
print s

(Of course stored with an editor that writes the file in utf-8 encoding.)

Is this the proper way?

Will print take care of encoding translation according to the encoding used
in the target console?

Thanks,
Gerhard

Jul 13 '06 #9

Diez B. Roggisch

Gerhard Fiedler schrieb:

On 2006-07-13 12:04:58, Richard Brodie wrote:

>> s = "é"
print s

>>Is there a standard way to do this?
Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.

So ...

# coding: utf-8
s = u'é'
print s

(Of course stored with an editor that writes the file in utf-8 encoding.)

Is this the proper way?

Will print take care of encoding translation according to the encoding used
in the target console?

Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option that
allwos selection of the output encoding.

And when using print, don't forget to wrap sys.stdout with a
codecs.EncodedFile to properly convert the unicode strings.
Diez

Jul 14 '06 #10

Diez B. Roggisch

Sybren Stuvel schrieb:

Diez B. Roggisch enlightened us with:
>Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

Diez

Jul 14 '06 #11

Gerhard Fiedler

On 2006-07-14 10:52:22, Diez B. Roggisch wrote:

>>>Will print take care of encoding translation according to the encoding
used in the target console?

Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

Right... without being a cross-platform specialist, I figured that much :)

I just thought that maybe the Python runtime had platform-specific
implementations for retrieving the platform-specific information about the
encoding used in the runtime environment (which is probably there on many
platforms) -- similar to maybe the platform-specific implementations of
file access, process and thread handling etc.

Anyway, it seems that anything non-ASCII is a bit problematic and needs
"manual" handling of the runtime environment encoding. Seems a bit odd,
given the worldwide distribution of Python... I would have thought that
such a rather basic task like printing an accented character on a console
had been solved in a standard way, rather than relying on individual
(wheel-reinventing) custom coding. Isn't that something that pretty much
everybody (outside the USA, at least) needs?

Thanks for sharing your thoughts,
Gerhard

Jul 14 '06 #12

Fredrik Lundh

Gerhard Fiedler wrote:

Anyway, it seems that anything non-ASCII is a bit problematic and needs
"manual" handling of the runtime environment encoding. Seems a bit odd,
given the worldwide distribution of Python... I would have thought that
such a rather basic task like printing an accented character on a console
had been solved in a standard way, rather than relying on individual
(wheel-reinventing) custom coding. Isn't that something that pretty much
everybody (outside the USA, at least) needs?

umm. what are we talking about here, really ?

$ python

>>import sys
sys.platform

'linux2'

>>sys.stdout.encoding

'UTF-8'

>>print unichr(233)

python

>>import sys
sys.platform

'win32'

>>sys.stdout.encoding

'cp850'

>>print unichr(233)

é

</F>

Jul 14 '06 #13

Gerhard Fiedler

On 2006-07-14 12:07:12, Fredrik Lundh wrote:

umm. what are we talking about here, really ?

Aha! You took a big load off my chest -- this is pretty much what I thought
should be there :)

What I was talking about is that Diez responded with a clear "no" to my
question whether print would do the automatic encoding conversion
(according to the runtime environment) you showed so succinctly. Which I
found surprising...

Thanks,
Gerhard

Jul 14 '06 #14

Michael Piotrowski

On 2006-07-14 "Diez B. Roggisch" <de***@nospam.web.dewrote:

Sybren Stuvel schrieb:
>Diez B. Roggisch enlightened us with:
>>Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

If LANG is not set, it's equivalent to setting it to "C". However,
you shouldn't look directly at these variables (LANG and LC_*) but
rather use the functions from the locale module, e.g.:

import locale
locale.setlocale(locale.LC_ALL, '') # use the current locale settings
encoding = locale.nl_langinfo(locale.CODESET)

--
Michael Piotrowski, M.A. <mx*@dynalabs.de>
Public key at <http://www.dynalabs.de/mxp/pubkey.txt>

Jul 17 '06 #15

Piet van Oostrum

>>>>Michael Piotrowski <mx*@dynalabs.de(MP) wrote:

>MPOn 2006-07-14 "Diez B. Roggisch" <de***@nospam.web.dewrote:

>>Sybren Stuvel schrieb:
Diez B. Roggisch enlightened us with:
Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

>MPIf LANG is not set, it's equivalent to setting it to "C". However,
MPyou shouldn't look directly at these variables (LANG and LC_*) but
MPrather use the functions from the locale module, e.g.:

>MP import locale
MP locale.setlocale(locale.LC_ALL, '') # use the current locale settings
MP encoding = locale.nl_langinfo(locale.CODESET)

But if LANG isn't set (like on Mac OS X) this doesn't give you the proper
encoding.
On my system I have added LANG to .profile.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org

Jul 17 '06 #16

Michael Piotrowski

On 2006-07-17 Piet van Oostrum <pi**@cs.uu.nlwrote:

>>>That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

>>If LANG is not set, it's equivalent to setting it to "C". However,
you shouldn't look directly at these variables (LANG and LC_*) but
rather use the functions from the locale module, e.g.:

>> import locale
locale.setlocale(locale.LC_ALL, '') # use the current locale settings
encoding = locale.nl_langinfo(locale.CODESET)

But if LANG isn't set (like on Mac OS X) this doesn't give you the proper
encoding.

Well, yes, but it gives you something "safe" and you can advise the
user to set the locale.

On my system I have added LANG to .profile.

That's certainly the right thing to do.

--
Michael Piotrowski, M.A. <mx*@dynalabs.de>
Public key at <http://www.dynalabs.de/mxp/pubkey.txt>

Jul 17 '06 #17

Similar topics

226

reduce() anomaly?

by: Stephen C. Waterbury | last post by:

This seems like it ought to work, according to the description of reduce(), but it doesn't. Is this a bug, or am I missing something? Python 2.3.2 (#1, Oct 20 2003, 01:04:35) on linux2 Type...

Python

Weekly Python Patch/Bug Summary

by: Kurt B. Kaiser | last post by:

Patch / Bug Summary ___________________ Patches : 241 open ( -6) / 2622 closed (+26) / 2863 total (+20) Bugs : 764 open ( +6) / 4453 closed (+38) / 5217 total (+44) RFE : 150 open...

Python

author index for Python Cookbook 2?

by: Andrew Dalke | last post by:

Is there an author index for the new version of the Python cookbook? As a contributor I got my comp version delivered today and my ego wanted some gratification. I couldn't find my entries. ...

Python

Thinking Outside the Box with Python

by: Motoma | last post by:

This article is cross posted from my personal blog. You can find the original article, in all its splendor, at http://motomastyle.com/thinking-outside-the-box-with-python/ . Introduction: I...

Python

python tr equivalent (non-ascii)

by: kettle | last post by:

Hi, I was wondering how I ought to be handling character range translations in python. What I want to do is translate fullwidth numbers and roman alphabet characters into their halfwidth ascii...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice