os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

gabor

hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

i'm on Unix. (linux, ubuntu edgy)

so it seems that it does not always return unicode filenames.

it seems that it tries to interpret the filenames using the filesystem's
encoding, and if that fails, it simply returns the filename as byte-string.

so you get back let's say an array of 21 filenames, from which 3 are
byte-strings, and the rest unicode strings.

after digging around, i found this in the source code:

#ifdef Py_USING_UNICODE
if (arg_is_unicode) {
PyObject *w;

w = PyUnicode_FromEncodedObject(v,
Py_FileSystemDefaultEncoding,
"strict");
if (w != NULL) {
Py_DECREF(v);
v = w;
}
else {
/* fall back to the original byte string, as
discussed in patch #683592 */
PyErr_Clear();
}
}
#endif

so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.

and now i'm not sure what to do.
i know that:

1. the documentation is completely wrong. it does not always return
unicode filenames
2. it's true that the documentation does not specify what happens if the
filename is not in the filesystem-encoding, but i simply expected that i
get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them. but this is just plain wrong. from now on,
EVERYWHERE where i use os.listdir, i will have to go through all the
filenames in it, and check if they are unicode-strings or not.

so basically i'd like to ask here: am i reading something incorrectly?
or am i using os.listdir the "wrong way"? how do other people deal with
this?

p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail, because the auto-convert only
happens using 'ascii' as the encoding, and if it was not possible to
decode the filename inside listdir, it's quite probable that it also
will not work using 'ascii' as the charset.
gabor

Nov 16 '06 #1

Subscribe Post Reply

2942

Terry Reedy

"gabor" <ga***@nekomancer.netwrote in message
news:ed**************************@news.flashnewsgr oups.com...

so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.

and now i'm not sure what to do.
i know that:

1. the documentation is completely wrong. it does not always return
unicode filenames

Unless someone says otherwise, report the discrepancy between doc and code
as a bug on the SF tracker. I have no idea of what the resolution should
be ;-).

tjr

Nov 16 '06 #2

Martin v. Löwis

gabor schrieb:

so basically i'd like to ask here: am i reading something incorrectly?

You are reading it correctly. This is how it behaves.

or am i using os.listdir the "wrong way"? how do other people deal with
this?

You didn't say why the behavior causes a problem for you - you only
explained what the behavior is.

Most people use os.listdir in a way like this:

for name in os.listdir(path):
full = os.path.join(path, name)
attrib = os.stat(full)
if some-condition:
f = open(full)
...

All this code will typically work just fine with the current behavior,
so people typically don't see any problem.

Regards,
Martin

Nov 16 '06 #3

gabor

Martin v. Löwis wrote:

gabor schrieb:

>or am i using os.listdir the "wrong way"? how do other people deal with
this?

You didn't say why the behavior causes a problem for you - you only
explained what the behavior is.

Most people use os.listdir in a way like this:

for name in os.listdir(path):
full = os.path.join(path, name)
attrib = os.stat(full)
if some-condition:
f = open(full)
...

All this code will typically work just fine with the current behavior,
so people typically don't see any problem.

i am sorry, but it will not work. actually this is exactly what i did,
and it did not work. it dies in the os.path.join call, where file_name
is converted into unicode. and python uses 'ascii' as the charset in
such cases. but, because listdir already failed to decode the file_name
with the filesystem-encoding, it usually also fails when tried with 'ascii'.

example:

>>dir_name = u'something'
unicode_file_name = u'\u732b.txt' # the japanese cat-symbol
bytestring_file_name = unicode_file_name.encode('utf-8')
import os.path

os.path.join(dir_name,unicode_file_name)

u'something/\u732b.txt'

>>>

os.path.join(dir_name,bytestring_file_name)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/posixpath.py", line 65, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1:
ordinal not in range(128)

>>>

gabor

Nov 16 '06 #4

Martin v. Löwis

gabor schrieb:

>All this code will typically work just fine with the current behavior,
so people typically don't see any problem.

i am sorry, but it will not work. actually this is exactly what i did,
and it did not work. it dies in the os.path.join call, where file_name
is converted into unicode. and python uses 'ascii' as the charset in
such cases. but, because listdir already failed to decode the file_name
with the filesystem-encoding, it usually also fails when tried with
'ascii'.

Ah, right. So yes, it will typically fail immediately - just as you
wanted it to do, anyway; the advantage with this failure is that you
can also find out what specific file name is causing the problem
(whereas when listdir failed completely, you could not easily find
out the cause of the failure).

How would you propose listdir should behave?

Regards,
Martin

Nov 16 '06 #5

Fredrik Lundh

gabor wrote:

get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them.

p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail

it will raise an exception, most likely. didn't you just say that
exceptions were ok?

</F>

Nov 17 '06 #6

Laurent Pointal

gabor a écrit :

hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().

Have a try.

A+

Laurent.

Nov 17 '06 #7

gabor

Laurent Pointal wrote:
Laurent Pointal wrote:

gabor a écrit :
>hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().

Have a try.

A+

Laurent.

gabor a écrit :
>hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().

i don't think it would work. because os.listdir already tried, and
failed (that's why we got a byte-string and not an unicode-string)

gabor

Nov 17 '06 #8

Leo Kislov

Martin v. Löwis wrote:

gabor schrieb:

All this code will typically work just fine with the current behavior,
so people typically don't see any problem.

i am sorry, but it will not work. actually this is exactly what i did,
and it did not work. it dies in the os.path.join call, where file_name
is converted into unicode. and python uses 'ascii' as the charset in
such cases. but, because listdir already failed to decode the file_name
with the filesystem-encoding, it usually also fails when tried with
'ascii'.

Ah, right. So yes, it will typically fail immediately - just as you
wanted it to do, anyway; the advantage with this failure is that you
can also find out what specific file name is causing the problem
(whereas when listdir failed completely, you could not easily find
out the cause of the failure).

How would you propose listdir should behave?

How about returning two lists, first list contains unicode names, the
second list contains undecodable names:

files, troublesome = os.listdir(separate_errors=True)

and make separate_errors=True by default in python 3.0 ?

-- Leo

Nov 17 '06 #9

gabor

Fredrik Lundh wrote:

gabor wrote:

>get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them.

>p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail

it will raise an exception, most likely. didn't you just say that
exceptions were ok?

yes, but it's raised at the wrong place imho :)

(just to clarify: simply pointing out this behavior in the documentation
is also one of the possible solutions)

for me the current behavior seems as if file-reading would work like this:

a = open('foo.txt')
data = a.read()
a.close()

print data

>>TheFileFromWhichYouHaveReadDidNotExistExceptio n

gabor

Nov 17 '06 #10

Johan von Boisman

Laurent Pointal wrote:

gabor a écrit :
>hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().

Have a try.

A+

Laurent.

Strange coincident, as I was wrestling with this problem only yesterday.

I found this most illuminating discussion on the topic with
contributions from Mr Lövis and others:

http://www.thescripts.com/forum/thread41954.html

/johan

Nov 17 '06 #11

Martin v. Löwis

Leo Kislov schrieb:

How about returning two lists, first list contains unicode names, the
second list contains undecodable names:

files, troublesome = os.listdir(separate_errors=True)

and make separate_errors=True by default in python 3.0 ?

That would be quite an incompatible change, no?

Regards,
Martin

Nov 18 '06 #12

Fredrik Lundh

Martin v. Löwis wrote:

>How about returning two lists, first list contains unicode names, the
second list contains undecodable names:

files, troublesome = os.listdir(separate_errors=True)

and make separate_errors=True by default in python 3.0 ?

That would be quite an incompatible change, no?

it also violates a fundamental design rule for the standard library.

</F>

Nov 18 '06 #13

Leo Kislov

Martin v. Löwis wrote:

Leo Kislov schrieb:
How about returning two lists, first list contains unicode names, the
second list contains undecodable names:

files, troublesome = os.listdir(separate_errors=True)

and make separate_errors=True by default in python 3.0 ?

That would be quite an incompatible change, no?

Yeah, that was idea-dump. Actually it is possible to make this idea
mostly backward compatible by making os.listdir() return only unicode
names and os.binlistdir() return only binary directory entries.
Unfortunately the same trick will not work for getcwd.

Another idea is to map all 256 bytes to unicode private code points.
When a file name cannot be fully decoded the undecoded bytes will be
mapped to specially allocated code points. Unfortunately this idea
seems to leak if the program later wants to write such unicode string
to a file. Python will have to throw an exception since we don't know
if it is ok to write broken string to a file. So we are back to square
one, programs need to deal with filesystem garbage :(

-- Leo

Nov 18 '06 #14

Similar topics

Unicode in C++

by: Michael Davis | last post by:

Hi, I've known C/C++ for years, but only ever used ascii strings. I have a client who wants to know how gcc handles unicode. I've found the functions utf8_mbtowc, utf8_mbstowcs, utf8_wctomb and...

C / C++

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

why isn't Unicode the default encoding?

by: John Salerno | last post by:

Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it...

Python

unicode mess in c++

by: damjan | last post by:

This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into...

C / C++

Why GCC does warn me when I using gets() function for accessing file

by: Cuthbert | last post by:

After compiling the source code with gcc v.4.1.1, I got a warning message: "/tmp/ccixzSIL.o: In function 'main';ex.c: (.text+0x9a): warning: the 'gets' function is dangerous and should not be...

C / C++

unicode mystery/problem

by: Petr Jakes | last post by:

Hi, I am using Python 2.4.3 on Fedora Core4 and "Eric3" Python IDE .. Below mentioned code works fine in the Eric3 environment. While trying to start it from the command line, it returns: ...

Python

does raw_input() return unicode?

by: Stuart McGraw | last post by:

In the announcement for Python-2.3 http://groups.google.com/group/comp.lang.python/msg/287e94d9fe25388d?hl=en it says "raw_input(): can now return Unicode objects". But I didn't see anything...

Python

Python nuube needs Unicode help

by: gheissenberger | last post by:

HELP! Guy who was here before me wrote a script to parse files in Python. Includes line: print u where u is a line from a file we are parsing. However, we have started recieving data from...

Python

iterparse and unicode

by: George Sakkis | last post by:

It seems xml.etree.cElementTree.iterparse() is not unicode aware: .... print elem.text .... Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 64,...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General