By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,743 Members | 1,072 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,743 IT Pros & Developers. It's quick & easy.

Python strings outside the 128 range

P: n/a

Hi,

Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

"é" is not present in the 7-bit ASCII table that is the default
encoding, right ? So is the mapping "é" -"\xe9" portable ?
(site-)configuration dependent ? Can anyone have something
different of "é" when 'print "\xe9"' is executed ? If the process
is config-dependent, what kind of config info is used ?

Regards,

SB

Jul 13 '06 #1
Share this Question
Share on Google+
16 Replies


P: n/a
Sébastien Boisgérault schrieb:
Hi,

Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

"é" is not present in the 7-bit ASCII table that is the default
encoding, right ? So is the mapping "é" -"\xe9" portable ?
(site-)configuration dependent ? Can anyone have something
different of "é" when 'print "\xe9"' is executed ? If the process
is config-dependent, what kind of config info is used ?
The default encoding has nothing to do with this. "\xe9" is just a byte.
You can write it into a file (which the terminal is basically), and no
default encoding whatsoever in the mix.

The default-encoding comes into play when you write unicode(!) strings
to a file. Then the unicode string is converted to a byte string using
the default-eocoding. Which will fail miserably if the default encoding
is ascii (as it is supposed to be) and your unicode string contains any
"funny" characters.

But even if you encode the unicode string explicitely with an encoding
like latin1 or utf-8, the resulting byte strings will just be written to
the file. And it is a totally different question (and actually not
controllable by you/python) if the terminal will interpret the bytes
correct or not.

Diez
Jul 13 '06 #2

P: n/a
Sébastien Boisgérault wrote:
Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?
in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.
>>s = "é"
len(s)
1
>>ord(s)
233
>>hex(ord(s))
'0xe9'
>>s
'\xe9'
>>print repr(s)
'\xe9'
>>print s
é
>>print chr(233)
é

</F>

Jul 13 '06 #3

P: n/a

Fredrik Lundh wrote:
in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.
Crystal clear. Thanks !

SB

Jul 13 '06 #4

P: n/a
>>>>"Sébastien Boisgérault" <Se*******************@gmail.com(SB) wrote:
>SBHi,
>SBCould anyone explain me how the python string "é" is mapped to
SBthe binary code "\xe9" in my python interpreter ?
That is not done in the python interpreter. It is done in the editor in
which you prepare your python source.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org
Jul 13 '06 #5

P: n/a
On 2006-07-13 07:42:51, Fredrik Lundh wrote:
>Could anyone explain me how the python string "é" is mapped to
the binary code "\xe9" in my python interpreter ?

in the iso-8859-1 character set, the character é is represented by the code
0xE9 (233 in decimal). there's no mapping going on here; there's only one
character in the string. how it appears on your screen depends on how you
print it, and what encoding your terminal is using.
If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?

Is there a standard way to do this?

Gerhard
Gerhard

Jul 13 '06 #6

P: n/a
Gerhard Fiedler wrote:
If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?
if you put a, say, chr(233) in an 8-bit string literal in your source code, whoever runs
your program will get a chr(233) byte (unless someone's recoded the file on the way;
ordinary file copies and installation tools usually don't do that). how your program is
treating that chr(233) is up to your program.

to write robust and future-proof code,

- use Unicode literals if you want to put non-ASCII *text* in Python string literals,
and use a PEP 263-style coding directive to tell the parser what encoding your file
is using:

http://www.python.org/dev/peps/pep-0263/

- avoid putting non-ASCII characters in 8-bit literal strings; use escape sequences if
you need to embed binary data in a string literal.

also see the "lexical analysis" section in the language reference:

http://pyref.infogami.com/lexical-analysis

</F>

Jul 13 '06 #7

P: n/a

"Gerhard Fiedler" <ge*****@gmail.comwrote in message
news:ma***************************************@pyt hon.org...
If I understand you correctly, you are saying that if I distribute a file
with the following lines:

s = "é"
print s

I basically need to distribute also the information how the file is encoded
and every user needs to use the same (or a compatible) encoding for reading
this file?

Is there a standard way to do this?
Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.
Jul 13 '06 #8

P: n/a
On 2006-07-13 12:04:58, Richard Brodie wrote:
> s = "é"
print s
>Is there a standard way to do this?

Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.
So ...

# coding: utf-8
s = u'é'
print s

(Of course stored with an editor that writes the file in utf-8 encoding.)

Is this the proper way?

Will print take care of encoding translation according to the encoding used
in the target console?

Thanks,
Gerhard

Jul 13 '06 #9

P: n/a
Gerhard Fiedler schrieb:
On 2006-07-13 12:04:58, Richard Brodie wrote:
>> s = "é"
print s
>>Is there a standard way to do this?
Use Unicode strings, with an explicit encoding. Say no to ISO-8859-1 centrism.
See: http://www.amk.ca/python/howto/unicode particularly the
"Unicode Literals in Python Source Code" section.

So ...

# coding: utf-8
s = u'é'
print s

(Of course stored with an editor that writes the file in utf-8 encoding.)

Is this the proper way?

Will print take care of encoding translation according to the encoding used
in the target console?
Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option that
allwos selection of the output encoding.

And when using print, don't forget to wrap sys.stdout with a
codecs.EncodedFile to properly convert the unicode strings.
Diez
Jul 14 '06 #10

P: n/a
Sybren Stuvel schrieb:
Diez B. Roggisch enlightened us with:
>Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.
That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".

Diez
Jul 14 '06 #11

P: n/a
On 2006-07-14 10:52:22, Diez B. Roggisch wrote:
>>>Will print take care of encoding translation according to the encoding
used in the target console?

Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".
Right... without being a cross-platform specialist, I figured that much :)

I just thought that maybe the Python runtime had platform-specific
implementations for retrieving the platform-specific information about the
encoding used in the runtime environment (which is probably there on many
platforms) -- similar to maybe the platform-specific implementations of
file access, process and thread handling etc.

Anyway, it seems that anything non-ASCII is a bit problematic and needs
"manual" handling of the runtime environment encoding. Seems a bit odd,
given the worldwide distribution of Python... I would have thought that
such a rather basic task like printing an accented character on a console
had been solved in a standard way, rather than relying on individual
(wheel-reinventing) custom coding. Isn't that something that pretty much
everybody (outside the USA, at least) needs?

Thanks for sharing your thoughts,
Gerhard

Jul 14 '06 #12

P: n/a
Gerhard Fiedler wrote:
Anyway, it seems that anything non-ASCII is a bit problematic and needs
"manual" handling of the runtime environment encoding. Seems a bit odd,
given the worldwide distribution of Python... I would have thought that
such a rather basic task like printing an accented character on a console
had been solved in a standard way, rather than relying on individual
(wheel-reinventing) custom coding. Isn't that something that pretty much
everybody (outside the USA, at least) needs?
umm. what are we talking about here, really ?

$ python
>>import sys
sys.platform
'linux2'
>>sys.stdout.encoding
'UTF-8'
>>print unichr(233)
é
python
>>import sys
sys.platform
'win32'
>>sys.stdout.encoding
'cp850'
>>print unichr(233)
é

</F>

Jul 14 '06 #13

P: n/a
On 2006-07-14 12:07:12, Fredrik Lundh wrote:
umm. what are we talking about here, really ?
Aha! You took a big load off my chest -- this is pretty much what I thought
should be there :)

What I was talking about is that Diez responded with a clear "no" to my
question whether print would do the automatic encoding conversion
(according to the runtime environment) you showed so succinctly. Which I
found surprising...

Thanks,
Gerhard

Jul 14 '06 #14

P: n/a
On 2006-07-14 "Diez B. Roggisch" <de***@nospam.web.dewrote:
Sybren Stuvel schrieb:
>Diez B. Roggisch enlightened us with:
>>Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".
If LANG is not set, it's equivalent to setting it to "C". However,
you shouldn't look directly at these variables (LANG and LC_*) but
rather use the functions from the locale module, e.g.:

import locale
locale.setlocale(locale.LC_ALL, '') # use the current locale settings
encoding = locale.nl_langinfo(locale.CODESET)

--
Michael Piotrowski, M.A. <mx*@dynalabs.de>
Public key at <http://www.dynalabs.de/mxp/pubkey.txt>
Jul 17 '06 #15

P: n/a
>>>>Michael Piotrowski <mx*@dynalabs.de(MP) wrote:
>MPOn 2006-07-14 "Diez B. Roggisch" <de***@nospam.web.dewrote:
>>Sybren Stuvel schrieb:
Diez B. Roggisch enlightened us with:
Of course not. AFAIK there is no way figuring out which encoding the
target console supports. The best you can do is to offer an option
that allwos selection of the output encoding.

You can use the LANG environment variable on many systems. On mine,
it's set to en_GB.UTF-8, which causes a lot of software to
automatically choose the right encoding.

That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".
>MPIf LANG is not set, it's equivalent to setting it to "C". However,
MPyou shouldn't look directly at these variables (LANG and LC_*) but
MPrather use the functions from the locale module, e.g.:
>MP import locale
MP locale.setlocale(locale.LC_ALL, '') # use the current locale settings
MP encoding = locale.nl_langinfo(locale.CODESET)
But if LANG isn't set (like on Mac OS X) this doesn't give you the proper
encoding.
On my system I have added LANG to .profile.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org
Jul 17 '06 #16

P: n/a
On 2006-07-17 Piet van Oostrum <pi**@cs.uu.nlwrote:
>>>That might be a good heuristic - but on my Mac no LANG is set. So I
should paraphrase my statement to "There is no reliable and
cross-platform way figuring out which encoding the console uses".
>>If LANG is not set, it's equivalent to setting it to "C". However,
you shouldn't look directly at these variables (LANG and LC_*) but
rather use the functions from the locale module, e.g.:
>> import locale
locale.setlocale(locale.LC_ALL, '') # use the current locale settings
encoding = locale.nl_langinfo(locale.CODESET)

But if LANG isn't set (like on Mac OS X) this doesn't give you the proper
encoding.
Well, yes, but it gives you something "safe" and you can advise the
user to set the locale.
On my system I have added LANG to .profile.
That's certainly the right thing to do.

--
Michael Piotrowski, M.A. <mx*@dynalabs.de>
Public key at <http://www.dynalabs.de/mxp/pubkey.txt>
Jul 17 '06 #17

This discussion thread is closed

Replies have been disabled for this discussion.