467,923 Members | 1,440 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 467,923 developers. It's quick & easy.

zipfile module: problems with filename having non ascii characters


I've a simple python script that read a directory and put the files into a
Zip file.

I'm using the os.walk method to get the directory content,
I'm creating ZipInfo objects and set "filename", ... to what os.walk give
me.
....
And it works!!!!

BUT!!

When I open the created zip file with "WinZip" (or any other zip tool)
filenames are not always like they should be.
In fact filenames with characters like "","","" are not correctly defined
in the zip file.

Does any one knows what must be done ?
Does this is a "unicode" problem ?
Does this is a known bug in ZipFile module ?
????

Thanks

Vincent
Jul 18 '05 #1
  • viewed: 2831
Share:
4 Replies
Zip files don't have a way to define the encoding of filenames---names
are just byte strings, and different utilities may interpret them in
different ways. The only thing that seems to be defined is that '/' is
the directory separator, and possibly that the filename can't contain
'\0'.

You can probably find the encoding that winzip uses with a little
trial-and-error, and convert your filenames in your encoding to
filenames in that encoding. This may depend on the language or region
of the installed Windows, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBJ7rcJd01MZaTXX0RAg3AAJ4j4bJi1zy5kJxIuPJm5y 0RRrmDNQCglS+S
D+016AywZh98VkLrPOKyBbM=
=i06Z
-----END PGP SIGNATURE-----

Jul 18 '05 #2
Jeff Epler wrote:
Zip files don't have a way to define the encoding of filenames---names
are just byte strings, and different utilities may interpret them in
different ways. The only thing that seems to be defined is that '/' is
the directory separator, and possibly that the filename can't contain
'\0'.

Thanks, I've got the problem and replace all "\" to "/".

You can probably find the encoding that winzip uses with a little
trial-and-error, and convert your filenames in your encoding to
filenames in that encoding. This may depend on the language or region
of the installed Windows, though.


Thanks for the explanation.

That limitation is only valid for zip files ?
Is there an another "compression tool" that don't have such limitation
(tgz? , bz2? , ???
Jul 18 '05 #3
vi***********@yahoo.com wrote:
That limitation is only valid for zip files ?
It appears that WinZip and other tools interpret the file names in a
zipfile in CP437. So to properly put non-ASCII file names into a
zipfile, you need to convert them into CP437. If the file name
contains a character which is not available in CP437, you cannot
save the file in a zipfile (without renaming it).

Not really a Unicode problem, but rather a problem that Unicode
tries to solve.
Is there an another "compression tool" that don't have such limitation
(tgz? , bz2? , ???


tar, traditionally, is also unaware of character sets. Single Unix 3
(and I believe also earlier) ended the tar wars with the introduction
of the pax utility, which does allow for specification of a character
set in a pax file; among the supported character sets are ISO-8859-n,
and UTF-8.

Jrg Schilling's star(1) also uses UTF-8 for file names.

On the non-tar side of the world, WinRAR supports Unicode in archives.
For compatibility, they also put a non-Unicode name into the archive,
but the Unicode name, if present, is meant to take precedence.

Regards,
Martin
Jul 18 '05 #4
"Martin v. Lwis" wrote:
vi***********@yahoo.com wrote:
That limitation is only valid for zip files ?
It appears that WinZip and other tools interpret the file names in a
zipfile in CP437. So to properly put non-ASCII file names into a
zipfile, you need to convert them into CP437. If the file name
contains a character which is not available in CP437, you cannot
save the file in a zipfile (without renaming it).


Thanks, with cp437 it rocks!!!!

Not really a Unicode problem, but rather a problem that Unicode
tries to solve.
Is there an another "compression tool" that don't have such limitation
(tgz? , bz2? , ???
tar, traditionally, is also unaware of character sets. Single Unix 3
(and I believe also earlier) ended the tar wars with the introduction
of the pax utility, which does allow for specification of a character
set in a pax file; among the supported character sets are ISO-8859-n,
and UTF-8.


Thanks for the info.

Jrg Schilling's star(1) also uses UTF-8 for file names.

On the non-tar side of the world, WinRAR supports Unicode in archives.
For compatibility, they also put a non-Unicode name into the archive,
but the Unicode name, if present, is meant to take precedence.


Thus, the most "portable" compression tool.

Thanks for those valuable remarks.

Vincent
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by LC | last post: by
19 posts views Thread by Gerson Kurz | last post: by
6 posts views Thread by Bennie | last post: by
5 posts views Thread by Waguy | last post: by
1 post views Thread by Ritesh Raj Sarraf | last post: by
5 posts views Thread by OriginalBrownster | last post: by
8 posts views Thread by =?utf-8?B?5Lq66KiA6JC95pel5piv5aSp5rav77yM5pyb5p6B | last post: by
5 posts views Thread by Neil Crighton | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.