473,396 Members | 1,756 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Re: Unicode File Names

Step 4: Either wait for Python 2.7 or apply the patch to your own copy
of zipfile ...
Actually, this is released in Python 2.6, see r62724.

Regards,
Martin
Oct 17 '08 #1
3 2602
On Oct 17, 6:32 pm, "Martin v. Lo"wis" <mar...@v.loewis.dewrote:
Step 4: Either wait for Python 2.7 or apply the patch to your own copy
of zipfile ...

Actually, this is released in Python 2.6, see r62724.
Hi Martin,

That's good. I was lead astray by the fact that the 2.6 docs still
contain the note that the OP asked about: "There is no official file
name encoding for ZIP files. If you have unicode file names, you must
convert them to byte strings in your desired encoding before passing
them to write(). WinZip interprets all file names as encoded in CP437,
also known as DOS Latin."

The first sentence was and is bafflegab, the second didn't mention the
portability issues arising from its suggestion (and is now not true),
and the third needs explanation or omission. I believe that WinZip has
supported utf8 since v11.2.

Should the note be removed, or should it say something like "Unicode
file names are supported. New in Python 2.6."? Is there anything else
that should be mentioned?

More on cp437: I see where you mentioned to the patch author that a
unicode string should be encoded in cp437 if possible, but this was
not done -- it first tries ascii. What are your views on what encoding
should be assumed if the utf8 flag is not set?

Cheers,
John
Oct 17 '08 #2
Should the note be removed, or should it say something like "Unicode
file names are supported. New in Python 2.6."? Is there anything else
that should be mentioned?
The note should be corrected, documenting the behaviour implemented.
More on cp437: I see where you mentioned to the patch author that a
unicode string should be encoded in cp437 if possible, but this was
not done -- it first tries ascii. What are your views on what encoding
should be assumed if the utf8 flag is not set?
There isn't any standard that is widely followed (just as the note that
you declared bafflegab says). While APPNOTE.TXT specifies it as cp437,
implementations often ignore that, because a) they didn't know, and b)
cp437 was too limited for what they want to do. So we see all kinds of
alternative implementations - often involving the locale's code page
(and on Windows, both OEMCP and ACP get used - often just as a side
effect of whatever internal representation the applications use).

In 2.x, Python doesn't need to decide, so when opening a zip file, the
file names get reported as byte strings unless they have the UTF-8
bit set (in which case they get decoded). In 3.x, file names (in the
zipfile module) uniformly use the (unicode) character string type, hence
that version implements the spec, by decoding as 437.

Upon encoding, chosing between ASCII and CP437 has trade-offs. Notice
how both are formally complying to the spec, as ASCII is a subset of
CP437 (i.e. even though it uses the ASCII codec, it *still* encodes
as CP437). The tradeoffs can be studied by looking at three groups
of file names:
- pure ASCII; choice does not matter (both ascii and cp437 can
encode the file name, and both get the same result)
- arbitrary string containing non-CP437 characters; choice does
not matter (neither ascii nor cp437 can encode, so the UTF-8
bit must be used)
- others; here are the tradeoffs. Pro ASCII: receiver can unambiguously
reproduce the original file name, as the UTF-8 bit will be set.
Pro CP437: old software (unaware of the UTF-8 bit) has a chance
of correctly guessing the file name (if it followed APPNOTE.TXT).

I (now) prefer the tradeoff being taken, as it's the one that
produces more reliable results in the long run (i.e. when more
and more zip readers support UTF-8).

Regards,
Martin
Oct 18 '08 #3
On Oct 18, 5:57*pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:
Should the note be removed, or should it say something like "Unicode
file names are supported. New in Python 2.6."? Is there anything else
that should be mentioned?

The note should be corrected, documenting the behaviour implemented.
More on cp437: I see where you mentioned to the patch author that a
unicode string should be encoded in cp437 if possible, but this was
not done -- it first tries ascii. What are your views on what encoding
should be assumed if the utf8 flag is not set?
[lots of enlightenment snipped]

Thanks heaps, Martin.
Cheers,
John
Oct 18 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...
19
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...
19
by: Svennglenn | last post by:
I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following...
1
by: Arthur | last post by:
Howdy All, Here is my problem: I have over 10,000 files that have file names in Korean. When running Windows 98, and then Windows 2000 these file names appeared in Korean with no problems when...
5
by: Norman Diamond | last post by:
Here are two complete lines of output from Visual Studio 2005: 1>$B%W%m%8%'%/%H=PNO$K(B Authenticode $B=pL>$7$F$$$^$9(B... 1>Successfully signed: c:\T The first line means roughly: Doing...
13
by: Kelvin Moss | last post by:
Hi all, How could one write an strstr function to work with unicode characters? Are there existing implementations/solutions/api for doing so? Any pointers would be appreciated. Thanks ..
13
by: gabor | last post by:
hi, from the documentation (http://docs.python.org/lib/os-file-dir.html) for os.listdir: "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode...
2
by: John Nagle | last post by:
Here's a strange little bug. "socket.getaddrinfo" blows up if given a bad domain name containing ".." in Unicode. The same string in ASCII produces the correct "gaierror" exception. Actually,...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.