473,382 Members | 1,766 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

LC_ALL and os.listdir()

I have some confusion regarding the relationship between locale,
os.listdir() and unicode pathnames. I'm running Python 2.3.5 on a
Debian system. If it matters, all of the files I'm dealing with are on
an ext3 filesystem.

The real code this problem comes from takes a configured set of
directories to deal with and walks through each of those directories
using os.listdir().

Today, I accidentally ran across a directory containing three "normal"
files (with ASCII filenames) and one file with a two-character unicode
filename. My code, which was doing something like this:

for entry in os.listdir(path): # path is <type 'unicode'>
entrypath = os.path.join(path, entry)

suddenly started blowing up with the dreaded unicode error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
position 1: ordinal not in range(128)

To add insult to injury, it only happend for one of my test users, not
the others.

I ultimately traced the difference in behavior to the LC_ALL setting in
the environment. One user had LC_ALL set to en_US, and the other didn't
have it set at all.

For the user with LC_ALL set, the os.listdir() call returned this, and
the os.path.join() call succeeded:

[u'README.strange-name', u'\xe2\x99\xaa\xe2\x99\xac',
u'utflist.long.gz', u'utflist.cp437.gz', u'utflist.short.gz']

For the other user without LC_ALL set, the os.listdir() call returned
this, and the os.path.join() call failed with the UnicodeDecodeError
exception:

[u'README.strange-name', '\xe2\x99\xaa\xe2\x99\xac',
u'utflist.long.gz', u'utflist.cp437.gz', u'utflist.short.gz']

Note that in this second result, element [1] is not a unicode string
while the other three elements are.

Can anyone explain:

1) Why LC_ALL has any effect on the os.listdir() result?
2) Why only 3 of the 4 files come back as unicode strings?
3) The proper "general" way to deal with this situation?

My goal is to build generalized code that consistently works with all
kinds of filenames. Ultimately, all I'm trying to do is copy some files
around. I'd really prefer to find a programmatic way to make this work
that was independent of the user's configured locale, if possible.

Thanks for the help,

KEN

--
Kenneth J. Pronovici <pr******@ieee.org>
Jul 18 '05 #1
7 3842
Kenneth Pronovici wrote:
1) Why LC_ALL has any effect on the os.listdir() result?
The operating system (POSIX) does not have the inherent notion
that file names are character strings. Instead, in POSIX, file
names are primarily byte strings. There are some bytes which
are interpreted as characters (e.g. '\x2e', which is '.',
or '\x2f', which is '/'), but apart from that, most OS layers
think these are just bytes.

Now, most *people* think that file names are character strings.
To interpret a file name as a character string, you need to know
what the encoding is to interpret the file names (which are byte
strings) as character strings.

There is, unfortunately, no operating system API to carry
the notion of a file system encoding. By convention, the locale
settings should be used to establish this encoding, in particular
the LC_CTYPE facet of the locale. This is defined in the
environment variables LC_CTYPE, LC_ALL, and LANG (searched
in this order).
2) Why only 3 of the 4 files come back as unicode strings?
If LANG is not set, the "C" locale is assumed, which uses
ASCII as its file system encoding. In this locale,
'\xe2\x99\xaa\xe2\x99\xac' is not a valid file name (atleast
it cannot be interpreted as characters, and hence not
be converted to Unicode).

Now, your Python script has requested that all file names
*should* be returned as character (ie. Unicode) strings, but
Python cannot comply, since there is no way to find out what
this byte string means, in terms of characters.

So we have three options:
1. skip this string, only return the ones that can be
converted to Unicode. Give the user the impression
the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
(i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).
3) The proper "general" way to deal with this situation?
You can chose option 1 or 3; you could tell the user
about it, and then ignore the file, you could try to
guess the encoding (UTF-8 would be a reasonable guess).
My goal is to build generalized code that consistently works with all
kinds of filenames.
Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())

This should work in most cases; Python will try to
determine the file system encoding from the environment,
and try to encode the file. Notice, however:

- on some systems, getfilesystemencoding may return None,
if the encoding could not be determined. Fall back
to sys.getdefaultencoding in this case.
- depending on where you got path from, this may
raise a UnicodeError, if the user has entered a
path name which cannot be encoding in the file system
encoding (the user may well believe that she has
such a file on disk).

So your code would read

try:
path = path.encode(sys.getfilesystemencoding() or
sys.getdefaultencoding())
except UnicodeError:
print >>sys.stderr, "Invalid path name", repr(path)
sys.exit(1)
Ultimately, all I'm trying to do is copy some files
around. I'd really prefer to find a programmatic way to make this work
that was independent of the user's configured locale, if possible.


As long as you manage to get a byte string from the path
entered, all should be fine.

Regards,
Martin
Jul 18 '05 #2
"Martin v. Löwis" wrote:
My goal is to build generalized code that consistently works with all
kinds of filenames.


Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())


Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.

Serge.
Jul 18 '05 #3
Serge Orlov wrote:
Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.


Sounds reasonable. OTOH, this would be the only (one of a very
few?) occasion where Python combines byte+unicode => byte.
Furthermore, it might be that the conversion of the Unicode
string to a file name fails as well.

That said, I still think it is a good idea, so contributions
are welcome.

Regards,
Martin
Jul 18 '05 #4
On Wed, Feb 23, 2005 at 10:07:19PM +0100, "Martin v. Löwis" wrote:
So we have three options:
1. skip this string, only return the ones that can be
converted to Unicode. Give the user the impression
the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
(i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).
Understood. This appears to be the most flexible solution among the
three.
3) The proper "general" way to deal with this situation?


You can chose option 1 or 3; you could tell the user
about it, and then ignore the file, you could try to
guess the encoding (UTF-8 would be a reasonable guess).


Ok.
My goal is to build generalized code that consistently works with all
kinds of filenames.


Then it is best to drop the notion that file names are
character strings (because some file names aren't). You
do so by converting your path variable into a byte
string. To do that, you could try

[snip] So your code would read

try:
path = path.encode(sys.getfilesystemencoding() or
sys.getdefaultencoding())
except UnicodeError:
print >>sys.stderr, "Invalid path name", repr(path)
sys.exit(1)


This makes sense to me. I'll work on implementing it that way.

Thanks for the in-depth explanation!

KEN

--
Kenneth J. Pronovici <pr******@ieee.org>
Personal Homepage: http://www.skyjammer.com/~pronovic/
"They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
- Benjamin Franklin, Historical Review of Pennsylvania, 1759
Jul 18 '05 #5
Martin v. Löwis wrote:
Serge Orlov wrote:
Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.


Sounds reasonable. OTOH, this would be the only (one of a very
few?) occasion where Python combines byte+unicode => byte.
Furthermore, it might be that the conversion of the Unicode
string to a file name fails as well.

That said, I still think it is a good idea, so contributions
are welcome.

It would probably mess up those systems where filenames really are unicode
strings and not byte sequences.

Windows (when using NTFS) stores all the filenames in unicode, and Python
uses the unicode api to implement listdir (when given a unicode path). This
means that the filename never gets encoded to a byte string either by the
OS or Python. If you use a byte string path than the filename gets encoded
by Windows and Python just returns what it is given.
Jul 18 '05 #6
Duncan Booth wrote:
Martin v. Löwis wrote:
Serge Orlov wrote:
Shouldn't os.path.join do that? If you pass a unicode string
and a byte string it currently tries to convert bytes to
characters
but it makes more sense to convert the unicode string to bytes
and return two byte strings concatenated.


Sounds reasonable. OTOH, this would be the only (one of a very
few?) occasion where Python combines byte+unicode => byte.
Furthermore, it might be that the conversion of the Unicode
string to a file name fails as well.

That said, I still think it is a good idea, so contributions
are welcome.

It would probably mess up those systems where filenames really are
unicode strings and not byte sequences.

Windows (when using NTFS) stores all the filenames in unicode, and
Python uses the unicode api to implement listdir (when given a
unicode path). This means that the filename never gets encoded to
a byte string either by the OS or Python. If you use a byte string
path than the filename gets encoded by Windows and Python just
returns what it is given.


Sorry for being not clear, but I meant posixpath.join since the whole
discussion is about posix systems.

Serge.

Jul 18 '05 #7
Duncan Booth wrote:
Windows (when using NTFS) stores all the filenames in unicode, and Python
uses the unicode api to implement listdir (when given a unicode path). This
means that the filename never gets encoded to a byte string either by the
OS or Python. If you use a byte string path than the filename gets encoded
by Windows and Python just returns what it is given.


Serge's answer is good: you might only want to apply this algorithm to
posixpath. OTOH, in the specific case, it would not have caused problems
if it were applied to ntpath as well: the path was a Unicode string, so
listdir would have returned only Unicode strings (on Windows), and the
code in path.join dealing with mixed string types would not have been
triggered.

Again, I think the algorithm should be this:
- if both are the same kind of string, just concatenate them
- if not, try to coerce the byte string to a Unicode string, using
sys.getfileencoding()
- if that fails, try the other way 'round
- if that fails, let join fail.

The only drawback I can see with that approach is that it would "break"
environments where the system encoding is "undefined", i.e. implicit
string/unicode coercions are turned off. In such an environment, it
is probably desirable that os.path.join performs no coercion as well,
so this might need to get special-cased.

Regards,
Martin
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Jason Kratz | last post by:
OK. I've search on google groups and around the web for this and I haven't found an answer. I'm a Python newbie and have what I assume is a basic question. os.listdir takes a pathname as an arg...
8
by: Hannu Kankaanp?? | last post by:
This may be a bug or simply a strange result of undefined behaviour, but this is what I get with Python 2.3.2 on Windows XP: >>> import os >>> os.listdir('') >>> os.listdir(u'')
0
by: Ishwor | last post by:
hi check your seperator variable in the os module. :) for example >>> import os >>> os.sep '\\' Now what you do is :- >> os.listdir("D:" + os.sep + "any_other_folder_name" + os.sep); :)
1
by: Doran_Dermot | last post by:
Hi All, I'm currently using "os.listdir" to obtain the contents of some slow Windows shares. I think I've seen another way of doing this using the win32 library but I can't find the example...
15
by: Riccardo Galli | last post by:
Hi, I noticed that when I use os.listdir I need to work with absolute paths 90% of times. While I can use a for cycle, I'd prefere to use a list comprehension, but it becomes too long. I...
1
by: kai | last post by:
Hello, I use dircache.listdir(myDir) in my module repeatedly. On OS WIN 2000 listdir() will re-read the directory structure! But on AIX, listdir() will not re-read the directory structure (see...
6
by: Stef Mientki | last post by:
hello, I want to find all files with the extension "*.txt". From the examples in "Learning Python, Lutz and Asher" and from the website I see examples where you also may specify a wildcard...
3
by: vedrandekovic | last post by:
Hello Here is my simple listdir example: Here is my error: WindowsError: The system cannot find the path specified: 'l/ *.*'
0
by: scriptmann | last post by:
Hi, I'm trying to use os.listdir() to list directories with simplified chinese filenames. However, while I can see the filenames correctly in windows explorer, I am getting ? in the filename...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.