By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,915 Members | 1,323 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,915 IT Pros & Developers. It's quick & easy.

Umlauts, encodings, sitecustomize.py

P: n/a
I'm on WinXP, Python 2.3.

I don't have problems with umlauts (, , and their uppercase instances)
in my wxPython-GUIs, when displayed as static texts. But when filling
controls with text containing umlauts, or in the Python console, or when
writing to files umlauts are escaped:

Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
"" '\x84'


I have defined a sitecustomize.py with these lines in it

import sys
sys.setdefaultencoding("iso-8859-1")

What else do I have to adjust?

Kind regards
Franz GEIGER


Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
F. GEIGER wrote:
I'm on WinXP, Python 2.3.

I don't have problems with umlauts (, , and their uppercase instances)
in my wxPython-GUIs, when displayed as static texts. But when filling
controls with text containing umlauts, or in the Python console, or when
writing to files umlauts are escaped:

Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
""


'\x84'
I have defined a sitecustomize.py with these lines in it

import sys
sys.setdefaultencoding("iso-8859-1")

What else do I have to adjust?


Try the line
# _*_ coding: latin1 _*_

at the very beginning (or at least after a #! line on Unix)
This works under Linux, at least.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
Jul 18 '05 #2

P: n/a

No matter what you do, you won't change this behavior:
chr(0x84) '\x84'

str.__repr__ always escapes characters in the range 0..31 and 127..255,
no matter what the locale is.
print chr(0x84)

will behave differently (it will write that byte to standard output,
followed by a newline)

You should note that chr(0x84) is *not* a-umlaut in iso-8859-1. That's
chr(0xe4). You may be using one of these Windows-specific encodings:
cp437.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp775.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp850.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp852.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp857.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp861.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp865.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBkPXwJd01MZaTXX0RAlH2AJ9QlAt7j8TDvMxcy4SrOy ZIoTj1KgCdHZT2
7x1JhTR0w8+1zIahIHhNMDc=
=W5ff
-----END PGP SIGNATURE-----

Jul 18 '05 #3

P: n/a
"Jeff Epler" <je****@unpythonic.net> schrieb im Newsbeitrag
news:ma**************************************@pyth on.org...
You should note that chr(0x84) is *not* a-umlaut in iso-8859-1. That's chr(0xe4). You may be using one of these Windows-specific encodings: cp437.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp775.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp850.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp852.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp857.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp861.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp865.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS


I'm not sure what you mean by this. Do mean I use one of these
accidentially? Or should I switch to one of these in my sitecutsomize.py?

I'm a bit confused. When I let Python print an (umlaut a) by simply
entering the 1-char string "", it prints '\x84'. When I let a tiny script
print the umlauts, I get:

sys:1: DeprecationWarning: Non-ASCII character '\xe4' in file
D:\Project\SchAG\Programme.Python\test.py on line 1, but no encoding
declared;
see http://www.python.org/peps/pep-0263.html for details
These are Umlauts: and ??.
These are Umlauts: ?? and .
Press any key to exit...

There's the '\xe4' you are missing.
Thanks and kind regards
Franz GEIGER

P.S.: Do you know a site, where this whole matter is explained somehow?

P.P.S.: The script:

print "These are Umlauts: and . "
s = "These are Umlauts: and . "
print s
raw_input("Press any key to exit...")

Jul 18 '05 #4

P: n/a

"Helmut Jarausch" <ja******@skynet.be> schrieb im Newsbeitrag
news:41**************@skynet.be...
Try the line
# _*_ coding: latin1 _*_

at the very beginning (or at least after a #! line on Unix)
This works under Linux, at least.


Thank you, Helmut, I had already added
# -*- coding: iso-8859-1 -*-
to the scripts in question.

Kind regards
Franz GEIGER
Jul 18 '05 #5

P: n/a
On Tue, Nov 09, 2004 at 07:52:58PM +0100, F. GEIGER wrote:
"Jeff Epler" <je****@unpythonic.net> schrieb im Newsbeitrag
news:ma**************************************@pyth on.org...
You should note that chr(0x84) is *not* a-umlaut in iso-8859-1. That's

chr(0xe4). You may be using one of these Windows-specific encodings:
cp437.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp775.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp850.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp852.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp857.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp861.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS
cp865.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS


I'm not sure what you mean by this. Do mean I use one of these
accidentially? Or should I switch to one of these in my sitecutsomize.py?

I'm a bit confused. When I let Python print an � (umlaut a) by simply
entering the 1-char string "�", it prints '\x84'.


In the encoding iso-8859-1, the character chr(0xe4) is LATIN SMALL
LETTER A WITH DIAERESIS. chr(0x84) is not a printable character.

In the encodings I named above, chr(0x84) is LATIN SMALL LETTER A WITH
DIAERESIS.

Now, consider this program that creates a program:
def maker(filename, encoding, ch):
f = open(filename, "w")
f.write("# -*- coding: %s -*-\n" % encoding)
f.write("print '%s'\n" % ch)
if you call
maker("coded.py", "iso-8859-1", "\xe4")
the created script will contain a byte string literal with the byte
'\xe4' in it. When you run the script, it will print that byte followed
by the byte '\n'. *In fact, this behavior (sequence of bytes written to
sys.stdout) doesn't depend on encoding, as long as
'\xe4'.decode(encoding).encode(encoding) == '\xe4'
which should hold true in almost all single-byte encodings.*

What you *see* when you run the script depends on the meaning your
terminal window ("DOS box") assigns to the byte sequence '\xe4\n'. On
mine, which expects output in UTF-8, I get a mark which indicates an
incomplete multi-byte character and then a newline. On yours, you
apparently get some other character, possibly LATIN SMALL LETTER O WITH
TILDE if your terminal uses cp770, cp850, or cp857.

Now, consider this program with a u''-string literal:
def umaker(filename, encoding, ch):
f = open(filename, "w")
f.write("# -*- coding: %s -*-\n" % encoding)
f.write("print u'%s'\n" % ch)
If you call
umaker("ucoded.py", "iso-8859-1", "\xe4")
the created script will again contain the literal byte "\xe4". When you
run the script, you may get an error that says
UnicodeError: ASCII encoding error: ordinal not in range(128)
this is because the string to be printed is a unicode string containing
the letter LATIN SMALL LETTER A WITH DIAERESIS, but Python believes the
terminal can only accept ASCII-encoded strings for display. In my
Python 2.3 on Unix, sys.stdout.encoding is "UTF-8", and running
ucoded.py outputs the 3 byte sequence "\303\244\n", which in UTF-8 is a
LATIN SMALL LETTER A WITH DIAERESIS followed by a carriage return.

I suspect that wxpython is like tkinter: It is designed so that
u''-strings (unicode strings) can be given as arguments anywhere strings
can, and that internally the necessary steps are taken to find the
proper glyphs in the font to display that string. Otherwise, there may
be a particular encoding assumed for all byte strings, which will have
no relationship to the -*- coding -*- of your scripts.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBkSOeJd01MZaTXX0RAh43AJ9VpG9OSuU9KoyGh99sBy zaaAEx+gCfYYUl
4SS1dlgoIfe4W2oPQ4R488E=
=ekFI
-----END PGP SIGNATURE-----

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.