Unicode entries on sys.path

Thomas Heller

I was trying to track down a bug in py2exe where the executable did
not work when it is in a directory containing japanese characters.

Then, I discovered that part of the problem is in the zipimporter that
py2exe uses, and finally I found that it didn't even work in Python
itself.

If the entry in sys.path contains normal western characters, umlauts for
example, it works fine. But when I copied some japanese characters from
a random web page, and named a directory after that, it didn't work any
longer.

The windows command prompt is not able to print these characters,
although windows explorer has no problems showing them.

Here's the script, the subdirectory contains the file 'somemodule.py',
but importing this fails:

import sys
sys.path = [u'\u5b66\u6821\u30c7xx']
print sys.path

import somemodule

It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?

Hm, maybe more a windows question than a python question...

Thanks,
Thomas

Jul 18 '05 #1

Subscribe Post Reply

2605

Martin v. Löwis

Thomas Heller wrote:

It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?
You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.
Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Regards,
Martin

Jul 18 '05 #2

Just

In article <41**************@v.loewis.de>,
"Martin v. Lowis" <ma****@v.loewis.de> wrote:

Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.
I'm not sure how well it works together with zipimport, though.

Just

Jul 18 '05 #3

vincent wehren

Just wrote:

In article <41**************@v.loewis.de>,
"Martin v. Lowis" <ma****@v.loewis.de> wrote:

Hm, maybe more a windows question than a python question...
The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.

For this conversion "mbcs" will be used on Windows machines, implying
that such conversions are made using the current system Ansi codepage.
(As a matter of interest: What is this on OSX?). This conversion is
likely to be useless for unicode directory names containing characters
that do not have a mapping to a character in this particular codepage.

The technique described by Martin may solve the problem for what in this
case are Japanese characters, but what if I have directory names from
another language group, such as simpliefied Chinese, as well?

The only way to get around this is to allow - as Martin suggests -
arbitrary unicode strings in sys.path on those platforms that may have
unicode file names.

--
Vincen Wehren
I'm not sure how well it works together with zipimport, though.
Just

Jul 18 '05 #4

Just

In article <cq**********@news6.zwoll1.ov.home.nl>,
vincent wehren <vi*****@visualtrans.de> wrote:

Just wrote:
In article <41**************@v.loewis.de>,
"Martin v. Lowis" <ma****@v.loewis.de> wrote:

Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.

For this conversion "mbcs" will be used on Windows machines, implying
that such conversions are made using the current system Ansi codepage.
(As a matter of interest: What is this on OSX?).

UTF-8.

Just

Jul 18 '05 #5

Martin v. Löwis

Just wrote:

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

Works for me on OSX 10.3.6, as it should: prior to using the sys.path
entry, a unicode string is encoded with Py_FileSystemDefaultEncoding.
I'm not sure how well it works together with zipimport, though.

As Vincent's message already implies, I'm asking for Windows patches.
In a Windows system, there are path names which just *don't have*
a representation in the file system default encoding. So you just
can't use the standard file system API (open, read, write) to access
those files - instead, you have to use specific Unicode variants
of the file system API.

The only operating system in active use that can reliably represent
all file names in the standard API is OS X. Unix can do that as
long as the locale is UTF-8; for all other systems, there are
restrictions when you try to use the file system API to access
files with "funny" characters.

Regards,
Martin

Jul 18 '05 #6

Bengt Richter

On Thu, 23 Dec 2004 19:24:58 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Thomas Heller wrote:
It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?

You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.
Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

What about removable drives? And mountable multiple file system types?
Maybe some collections of potentially homogenous file system references
such as sys.path need to be virtualized to carry relevant file system
encoding and protocol info etc. That could cover synthetic or compressed
info sources too, IWT. Homogeneous package representation could be a similar
problem, I guess.

Regards,
Bengt Richter

Jul 18 '05 #7

Thomas Heller

"Martin v. Löwis" <ma****@v.loewis.de> writes:

Thomas Heller wrote:
It seems that Python itself converts unicode entries in sys.path to
normal strings using windows default conversion rules - is this a
problem that I can fix by changing some regional setting on my machine?

You can set the system code page on the third tab on the XP
regional settings (character set for non-unicode applications).
This, of course, assumes that there is a character set that supports
all directories in sys.path. If you have Japanese characters on
sys.path, you certainly need to set the system locale to Japanese
(is that CP932?).

Changing this setting requires a reboot.
Hm, maybe more a windows question than a python question...

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

How should these patches be approached? On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen. Is there a similar thing in the
non-windows world?

Thomas

Jul 18 '05 #8

Martin v. Löwis

Bengt Richter wrote:

The real question here is: why does Python not support arbitrary
Unicode strings on sys.path? It could, in principle, atleast on
Windows NT+ (and also on OSX). Patches are welcome.

What about removable drives? And mountable multiple file system types?

I'm not sure I understand the question. What about them?

On Windows, a removable drive will typically have its file names encoded
in UCS-2LE (i.e. "Unicode proper"), through the vfat, ntfs, or joliet
file systems. So if a Unicode file name in sys.path refers to them, and
a proper patch to use wide APIs is incorporated in Python, Python will
transparently find the files on these media.
Maybe some collections of potentially homogenous file system references
such as sys.path need to be virtualized to carry relevant file system
encoding and protocol info etc.

No no no. sys.path contains path names on the local system, nothing
virtualized (unless one of the existing hook mechanisms is used, which
would be OT for this thread).

Regards,
Martin

Jul 18 '05 #9

Martin v. Löwis

Thomas Heller wrote:

How should these patches be approached?
Please have a look as to how posixmodule.c and fileobject.c deal with
this issue.
On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen.

No. This fails for two reasons:
1. We don't compile Python with _UNICODE, and never will do so. This
macro is only a mechanism to simplify porting code from ANSI APIs
to Unicode APIs, so you don't have to reformulate all the API calls.
For new code, it is better to use the Unicode APIs directly if you
plan to use them.
2. On Win9x, the Unicode APIs don't work (*). So you need to chose at
run-time whether you want to use wide or narrow API. Unless
a) we ship two binaries in the future, one for W9x, one for NT+
(I hope this won't happen), or
b) we drop support for W9x. I'm in favour of doing so sooner or
later, but perhaps not for Python 2.5.

Regards,
Martin

(*) Can somebody please report whether the *W file APIs fail on W9x
because the entry points are not there (so you can't even run the
binary), or because they fail with an error when called?

Jul 18 '05 #10

Thomas Heller

"Martin v. Löwis" <ma****@v.loewis.de> writes:

Thomas Heller wrote:
How should these patches be approached?
Please have a look as to how posixmodule.c and fileobject.c deal with
this issue.
On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen.

No. This fails for two reasons:
1. We don't compile Python with _UNICODE, and never will do so. This
macro is only a mechanism to simplify porting code from ANSI APIs
to Unicode APIs, so you don't have to reformulate all the API calls.
For new code, it is better to use the Unicode APIs directly if you
plan to use them.
2. On Win9x, the Unicode APIs don't work (*). So you need to chose at
run-time whether you want to use wide or narrow API. Unless
a) we ship two binaries in the future, one for W9x, one for NT+
(I hope this won't happen), or
b) we drop support for W9x. I'm in favour of doing so sooner or
later, but perhaps not for Python 2.5.

I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?
For example, find_module should change from
static struct filedescr *find_module(char *, char *, PyObject *,
char *, size_t, FILE **, PyObject **);

to

static struct filedescr *find_module(char *, char *, PyObject *,
PyObject **, FILE **, PyObject **);

where the fourth argument would now be either a PyString or PyUnicode
object pointer?
(*) Can somebody please report whether the *W file APIs fail on W9x
because the entry points are not there (so you can't even run the
binary), or because they fail with an error when called?

I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Thomas

Jul 18 '05 #11

Martin v. Löwis

Thomas Heller wrote:

I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?
Yes, that would be one solution. Another solution would be to provide an
additional Py_UNICODE*, and to allow that pointer to be NULL. Most
systems would ignore that pointer (and it would be NULL most of the
time), except on NT+, which would use the Py_UNICODE* if available,
and the char* otherwise.
I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Thanks, that is convincing.

Regards,
Martin

Jul 18 '05 #12

vincent wehren

Thomas Heller wrote:

"Martin v. Löwis" <ma****@v.loewis.de> writes:

Thomas Heller wrote:
How should these patches be approached?
Please have a look as to how posixmodule.c and fileobject.c deal with
this issue.

On windows, it would probably
be easiest to use the MS generic text routines: _tcslen instead of
strlen, for example, and to rely on the _UNICODE preprocessor symbol to
map this function to strlen or wcslen.

No. This fails for two reasons:
1. We don't compile Python with _UNICODE, and never will do so. This
macro is only a mechanism to simplify porting code from ANSI APIs
to Unicode APIs, so you don't have to reformulate all the API calls.
For new code, it is better to use the Unicode APIs directly if you
plan to use them.
2. On Win9x, the Unicode APIs don't work (*). So you need to chose at
run-time whether you want to use wide or narrow API. Unless
a) we ship two binaries in the future, one for W9x, one for NT+
(I hope this won't happen), or
b) we drop support for W9x. I'm in favour of doing so sooner or
later, but perhaps not for Python 2.5.

I wasn't asking about the *W functions, I'm asking about string/unicode
handling in Python source files. Looking into Python/import.c, wouldn't
it be required to change the signature of a lot of functions to receive
PyObject* arguments, instead of char* ?
For example, find_module should change from
static struct filedescr *find_module(char *, char *, PyObject *,
char *, size_t, FILE **, PyObject **);

to

static struct filedescr *find_module(char *, char *, PyObject *,
PyObject **, FILE **, PyObject **);

where the fourth argument would now be either a PyString or PyUnicode
object pointer?

(*) Can somebody please report whether the *W file APIs fail on W9x
because the entry points are not there (so you can't even run the
binary), or because they fail with an error when called?

I always thought that the *W apis would not be there in win98, but it
seems that is wrong. Fortunately, how could Python, which links to the
FindFirstFileW exported function for example, run on win98 otherwise...

Normally I would have thought this would require using the Microsoft
Layer for Unicode (unicows.dll).

According to MSDN 9x already does have a handful of unicode APIs.

FindFirstFile does not seem to be one of them - unless the list on

htpp://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp)

is bogus (?).

--

Vincent Wehren

Thomas

Jul 18 '05 #13

Martin v. Löwis

vincent wehren wrote:

FindFirstFile does not seem to be one of them - unless the list on

htpp://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp)

is bogus (?).

It might perhaps be misleading: I think the entry points are there, but
calling the functions will always fail.

Regards,
Martin

Jul 18 '05 #14

JanC

vincent wehren schreef:

Normally I would have thought this would require using the Microsoft
Layer for Unicode (unicows.dll).

If Python is going to use unicows.dll, it might want to use libunicows for
compatibility with mingw etc.: <http://libunicows.sourceforge.net/>
--
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9

Jul 18 '05 #15

Similar topics

UNICODE support in VB 6.0

by: ..... | last post by:

I have an established program that I am changing to allow users to select one of eight languages and have all the label captions change accordingly. I have no problems with English, French, Dutch,...

Visual Basic 4 / 5 / 6

pep 277, Unicode filenames & mbcs encoding &c.

by: Edward K. Ream | last post by:

Am I reading pep 277 correctly? On Windows NT/XP, should filenames always be converted to Unicode using the mbcs encoding? For example, myFile = unicode(__file__, "mbcs", "strict") This...

Python

Unicode and Zipfile problems

by: Gerson Kurz | last post by:

AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...

Python

unicode encoding usablilty problem

by: aurora | last post by:

I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up...

Python

Unicode-aware file shortcuts in Windows

by: Stanislaw Findeisen | last post by:

Does anyone know how to create file shortcuts in Windows? The only way I know is like: --------------------------------------------------------------- import win32com.client ...

Python

sys.path and unicode folder names

by: Nir Aides | last post by:

Hello, Is there a solution or a work around for the sys.path problem with unicode folder names on Windows XP? I need to be able to import modules from a folder with a non-ascii name. ...

Python

Array of Bytes to Unicode chars (ISO-8859-1)

by: abhi147 | last post by:

Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...

C / C++

Python nuube needs Unicode help

by: gheissenberger | last post by:

HELP! Guy who was here before me wrote a script to parse files in Python. Includes line: print u where u is a line from a file we are parsing. However, we have started recieving data from...

Python

LANG, locale, unicode, setup.py and Debian packaging

by: Donn Ingle | last post by:

Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice