473,474 Members | 1,750 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Encoding of file names

Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Help me, before my thin veneer of genius is torn from my boss's eyes!
;-)

Dec 8 '05 #1
11 3202
utabintarbo wrote:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.


I'm not sure of the answer, but note that .isfile() is not just checking
whether the filename is valid, it's checking that something *exists*
with that name, and that it is a file. Big difference... at least in
telling you where to look for the solution. In this case, checking
which of the two tests in ntpath.isfile() is actually failing might be a
first step if you don't have some other lead. (ntpath is what os.path
translates into on Windows, so look for ntpath.py in the Python lib folder.)

If you're really seeing what you're seeing, I suspect a bug since if
os.listdir() can find it (and it's really a file), os.isfile() should
report it as a file, I would think.

-Peter

Dec 8 '05 #2
utabintarbo wrote:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.


Does the problem persist if you feed os.listdir() a unicode path?
This will cause listdir() to return unicode filenames which are less prone
to encoding confusion.

Peter
Dec 8 '05 #3
utabintarbo wrote:
Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.


Just to eliminate the obvious, you are calling os.path.join() with the
parent name before calling isfile(), yes? Something like

for f in os.listdir(someDir):
fp = os.path.join(someDir, f)
if os.path.isfile(fp):
...

Kent
Dec 8 '05 #4
"utabintarbo" wrote:
I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')


how did you print that name? "\xa6" is a "broken vertical bar", which, as
far as I know, is a valid filename character under both Unix and Windows.

if DIR is a variable that points to the remote directory, what does this
print:

import os
files = os.listdir(DIR)
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

(if necessary, replace [0] with an index that corresponds to one of
the problematic filenames)

when you've tried that, try this variation (only the listdir line has
changed):

import os
files = os.listdir(unicode(DIR)) # <-- this line has changed
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

</F>

Dec 8 '05 #5
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

I believe that may do the trick. Here is the results of running your
code:
DIR = os.getcwd()
files = os.listdir(DIR)
file = files[-1]
file 'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model' print file L07JS41C.04389525AA.QTRªINR.EªC-P.D11.081305.P2.KPF.model print repr(file) 'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model' fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) False print os.path.isdir(fullname) False files = os.listdir(unicode(DIR))
file = files[-1]
print file L07JS41C.04389525AA.QTR¦INR.E¦C-P.D11.081305.P2.KPF.model print repr(file) u'L07JS41C.04389525AA.QTR\u2592INR.E\u2524C-P.D11.081305.P2.KPF.model' fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) True <--- Success! print os.path.isdir(fullname)

False

Thanks to all who posted. :-)

Dec 8 '05 #6
utabintarbo wrote:
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

I believe that may do the trick. Here is the results of running your
code:


For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT,
a vertical line in the middle, plus a line from that going left) into
a file name. How he managed to do that, I can only guess: most likely,
the Samba installation assumes that the file system encoding on
the Solaris box is some IBM code page (say, CP 437 or CP 850). If so,
the byte on disk would be \xb4. Where this came from, I have to guess
further: perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python).
For reasons unknown to me, the US and several European versions
of XP map this to \xa6, VERTICAL BAR (I can somewhat see that
as meaningful for U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped
to U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere;
most likely in a misconfiguration of the Samba server (which assumes
some encoding for the files on disk, yet the AIX application
uses a different encoding).

Regards,
Martin
Dec 8 '05 #7
On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:
utabintarbo wrote:
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>


For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a
vertical line in the middle, plus a line from that going left) into a
file name. How he managed to do that, I can only guess: most likely, the
Samba installation assumes that the file system encoding on the Solaris
box is some IBM code page (say, CP 437 or CP 850). If so, the byte on
disk would be \xb4. Where this came from, I have to guess further:
perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python). For
reasons unknown to me, the US and several European versions of XP map
this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for
U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped to
U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere; most
likely in a misconfiguration of the Samba server (which assumes some
encoding for the files on disk, yet the AIX application uses a different
encoding).


Isn't the key thing that Windows is applying a non-roundtrippable
character encoding? If i've understood this right, Samba and Windows are
talking in unicode, with these (probably quite spurious, but never mind)
U+25xx characters, and Samba is presenting a quite consistent view of the
world: there's a file called "double bucky backlash grey box" in the
directory listing, and if you ask for a file called "double bucky backlash
grey box", you get it. Windows, however, maps that name to the 8-bit
string "double bucky blackslash vertical bar", but when you pass *that*
back to it, it gets encoded as the unicode string "double bucky backslash
vertical bar", which Sambda then doesn't recognise.

I don't know what Windows *should* do here. I know it shouldn't do this -
this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered. The
solution is either to apply an information-preserving encoding (UTF-8,
say), or to refuse to do it at all (ie, raise an error if there are
unencodable characters), neither of which are particularly beautiful
solutions. I think Windows is in a bit of a rock/hard place situation
here, poor thing.

Incidentally, for those who haven't come across CP_ACP before, it's not
yet another character encoding, it's a pseudovalue which means 'the
system's current default character set'.

tom

--
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies
Dec 9 '05 #8
Part of the reason (I think) is that our CAD/Data Management system
(which produces the aforementioned .MODEL files) substitutes (stupidly,
IMNSHO) non-printable characters for embedded spaces in file names.
This is part of what leads to my consternation here.

And yeah, Windows isn't helping matters much. No surprise there. :-P

Just for s&g's, I ran this on python 2.3 on knoppix:
DIR = os.getcwd()
files = os.listdir(DIR)
file = files[-1]
print file L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model print repr(file) 'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model' fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) True <--- It works fine here print os.path.isdir(fullname) False files = os.listdir(unicode(DIR))
file = files[-1]
print file L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model print repr(file) 'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model' fullname = os.path.join(DIR, file)
print os.path.isfile(fullname) True <--- It works fine here
too! print os.path.isdir(fullname) False


This is when mounting the same samba share in Linux. This tends to
support Tom's point re:the "non-roundtrippability" thing.

Thanks again to all.

Dec 9 '05 #9
Tom Anderson wrote:
Isn't the key thing that Windows is applying a non-roundtrippable
character encoding?
This is a fact, but it is not a key thing. Of course Windows is
applying a non-roundtrippable character encoding. What else could it
do?
Windows, however, maps that name to the
8-bit string "double bucky blackslash vertical bar"
Only if you ask it to. There are two sets of APIs: one to apply
if you ask for byte strings (FindFirstFileA), and one to apply when you
ask for Unicode strings (FindFirstFileW).

In one case it has to convert; in the other, it doesn't.
I don't know what Windows *should* do here. I know it shouldn't do this
- this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered.


It always did this, and always will. Applications should stop using the
*A versions of the API. If they continue to do so, they will continue
to get bogus results in border cases.

The real issue here really is that there was a border case, when there
shouldn't be one.

Regards,
Martin
Dec 9 '05 #10
On Fri, 9 Dec 2005, "Martin v. Löwis" wrote:
Tom Anderson wrote:
Isn't the key thing that Windows is applying a non-roundtrippable
character encoding?
This is a fact, but it is not a key thing. Of course Windows is applying
a non-roundtrippable character encoding. What else could it do?


Well, i'm no great thinker, but i'd say that errors should never pass
silently, and that in the face of ambiguity, one should refuse the
temptation to guess. So, as i said in my post, if the name couldn't be
translated losslessly, an error should be raised.
I don't know what Windows *should* do here. I know it shouldn't do this
- this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered.


It always did this, and always will. Applications should stop using the
*A versions of the API.


Absolutely true.
If they continue to do so, they will continue to get bogus results in
border cases.


No. The availability of a better alternative is not an excuse for
gratuitous breakage of the worse alternative.

tom

--
Whose house? Run's house!
Dec 10 '05 #11
Tom Anderson wrote:
This is a fact, but it is not a key thing. Of course Windows is
applying a non-roundtrippable character encoding. What else could it do?

Well, i'm no great thinker, but i'd say that errors should never pass
silently, and that in the face of ambiguity, one should refuse the
temptation to guess. So, as i said in my post, if the name couldn't be
translated losslessly, an error should be raised.


I believe this would not work, the way the API is structured. You do
first FindFirstFile, getting a file name and a ahandle. Then you do
FindNextFile repeatedly, passing the handle. An error of FindFirstFile
is indicated by returning an invalid handle.

So if you wanted FindFirstFile to return an error for unencodable file
names, it would not be possible to get a listing of the other files
in the directory.

FindFirstFile also gives the 8.3 file name (if present), and that is
valid without problems.

Regards,
Martin
Dec 10 '05 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Edward K. Ream | last post by:
Am I reading pep 277 correctly? On Windows NT/XP, should filenames always be converted to Unicode using the mbcs encoding? For example, myFile = unicode(__file__, "mbcs", "strict") This...
8
by: Edward K. Ream | last post by:
The documentation for encoding lines at C:\Python23\Doc\Python-Docs-2.3.1\whatsnew\section-encodings.html states: "Encodings are declared by including a specially formatted comment in the...
15
by: Steven Bethard | last post by:
I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with...
3
by: paulgor | last post by:
Hi, May be it's a know issue but my search brought nothing... We have static HTML files with Japanese text in UTF-8 encoding - it's on-line Help for our application, so there are no Web...
2
by: joakim.hove | last post by:
Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...
6
by: Franz Steinhaeusler | last post by:
Hello NG, a little longer question, I'm working on our project DrPython and try fix bugs in Linux, (on windows, it works very good now with latin-1 encoding). On Windows, it works good now,...
0
by: Omari Norman | last post by:
My program creates new XML files (not through the DOM, but just by simple file.write calls.) It would be nice if said files would be in the default system encoding. So in Python 2.5 I use ...
6
by: Harshad Modi | last post by:
hello , I make one function for encoding latin1 to utf-8. but i think it is not work proper. plz guide me. it is not get proper result . such that i got "Belgi�" using this method, (Belgium)...
4
by: K Viltersten | last post by:
I noticed that when i read files from an FTP using FtpWebRequest, the file names using swedish characters seem not to be interpreted correctly. I get a square, instead. After having looked for...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.