473,406 Members | 2,705 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Wanted: safe codec for filenames

Hallöchen!

I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however, I
need uppercase Latin letters being encoded, too, and the encoding
must contain only lowercase Latin letters, numbers, underscores, and
maybe a little bit more. The result should be more legible than
base64, though.

Has anybody created such a codec already?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
Sep 5 '07 #1
3 2203
Hallöchen!

Torsten Bronger writes:
I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however,
I need uppercase Latin letters being encoded, too, and the
encoding must contain only lowercase Latin letters, numbers,
underscores, and maybe a little bit more. The result should be
more legible than base64, though.

Has anybody created such a codec already?
Okay, the following works fine for me:
--8<---------------cut here---------------start------------->8---
import codecs

class Codec(codecs.Codec):
"""Codec class for safe filenames. Safe filenames work on all important
filesystems, i.e., they don't contain special or dangerous characters, and
they don't assume that filenames are treated case-sensitively.
>>u"hallo".encode("safefilename")
'hallo'
>>u"Hallo".encode("safefilename")
'(h)allo'
>>u"MIT Thesis".encode("safefilename")
'(mit)_(t)hesis'
>>u"Gesch\\u00e4ftsbrief".encode("safefilename")
'(g)esch{e4}ftsbrief'

Of course, the mapping works in both directions as expected:
>>"(g)esch{e4}ftsbrief".decode("safefilename")
u'Gesch\\xe4ftsbrief'
>>"(mit)_(t)hesis".decode("safefilename")
u'MIT Thesis'

"""
lowercase_letters = "abcdefghijklmnopqrstuvwxyz"
safe_characters = lowercase_letters + "0123456789-+!$%&`'@~#.,^"
uppercase_letters = lowercase_letters.upper()
def encode(self, input, errors='strict'):
"""Convert Unicode strings to safe filenames."""
output = ""
i = 0
input_length = len(input)
while i < input_length:
c = input[i]
if c in self.safe_characters:
output += str(c)
elif c == " ":
output += "_"
elif c in self.uppercase_letters:
output += "("
while i < input_length and input[i] in self.uppercase_letters:
output += str(input[i]).lower()
i += 1
output += ")"
continue
else:
output += "{" + hex(ord(c))[2:] + "}"
i += 1
return output, input_length
def handle_problematic_characters(self, errors, input, start, end, message):
if errors == 'ignore':
return u""
elif errors == 'replace':
return u"?"
else:
raise UnicodeDecodeError("safefilename", input, start, end, message)
def decode(self, input, errors='strict'):
"""Convert safe filenames to Unicode strings."""
input = str(input)
input_length = len(input)
output = u""
i = 0
while i < input_length:
c = input[i]
if c in self.safe_characters:
output += c
elif c == "_":
output += " "
elif c == "(":
i += 1
while i < input_length and input[i] in self.lowercase_letters:
output += input[i].upper()
i += 1
if i == input_length:
self.handle_problematic_characters(errors, input, i-1, i, "open parenthesis was never closed")
continue
if input[i] != ')':
self.handle_problematic_characters(
errors, input, i, i+1, "invalid character '%s' in parentheses sequence" % input[i])
continue
elif c == "{":
end_position = input.find("}", i)
if end_position == -1:
end_position = i+1
while end_position < input_length and input[end_position] in "0123456789abcdef" and \
end_position - i <= 8:
end_position += 1
output += self.handle_problematic_characters(errors, input, i, end_position,
"open backet was never closed")
i = end_position
continue
else:
try:
output += unichr(int(input[i+1:end_position], 16))
except:
output += self.handle_problematic_characters(errors, input, i, end_position+1,
"invalid data between brackets")
i = end_position
else:
output += self.handle_problematic_characters(errors, input, i, i+1, "invalid character '%s'" % c)
i += 1
return output, input_length

class StreamWriter(Codec, codecs.StreamWriter):
pass

class StreamReader(Codec, codecs.StreamReader):
pass
def _registry(encoding):
if encoding == "safefilename":
return (Codec().encode, Codec().decode, StreamReader, StreamWriter)
else:
return None

codecs.register(_registry)

if __name__ == "__main__":
import doctest
doctest.testmod()
--8<---------------cut here---------------end--------------->8---
--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
Sep 5 '07 #2
En Wed, 05 Sep 2007 19:20:45 -0300, Torsten Bronger
<br*****@physik.rwth-aachen.deescribi�:
Torsten Bronger writes:
>I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however,
I need uppercase Latin letters being encoded, too, and the
encoding must contain only lowercase Latin letters, numbers,
underscores, and maybe a little bit more. The result should be
more legible than base64, though.

Okay, the following works fine for me:
Nice codec. Altough if one is looking for really portable file names,
there are additional rules, collected here
http://www.boost.org/libs/filesystem...lity_guide.htm
Hard to comply with all the character set rules *and* keep all name
lengths below the limits.

--
Gabriel Genellina

Sep 6 '07 #3
Hallöchen!

Gabriel Genellina writes:
En Wed, 05 Sep 2007 19:20:45 -0300, Torsten Bronger
<br*****@physik.rwth-aachen.deescribi�:
>Torsten Bronger writes:
>>I'd like to map general unicode strings to safe filename. I
tried punycode but it is case-sensitive, which Windows is not.
[...]

Okay, the following works fine for me:

Nice codec. Altough if one is looking for really portable file
names, there are additional rules, collected here
http://www.boost.org/libs/filesystem...lity_guide.htm
Hard to comply with all the character set rules *and* keep all name
lengths below the limits.
Yes, and therefore, a *very* careful encoding was not an option.
For my own application, I need long filenames for example. So I
used a Wikipedia table to find a sensible compromise.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
Sep 6 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Oleg Leschov | last post by:
Where can I find a list and documentation for codecs? What I want to do is to make a unicode string out of unicode data. for example. I am parsing NTFS metadata, that contains filenames as UCS-2...
5
by: Fuzzyman | last post by:
Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an...
2
by: Max M | last post by:
Is there any codec available for handling The special UTF-7 codec for IMAP? I have searched the web for info, but there only seem to be discussions about it. Not actual implementations. This...
42
by: Irmen de Jong | last post by:
Pickle and marshal are not safe. They can do harmful things if fed maliciously constructed data. That is a pity, because marshal is fast. I need a fast and safe (secure) marshaler. Is xdrlib the...
11
by: UJ | last post by:
If I've got a video/audio file, how can I tell what Codec it needs? I want to be able to let the user upload a file to a server but I want to make sure before hand that the codec is already...
11
by: prats | last post by:
I want to write a GUI application in PYTHON using QT. This application is supposed to take in Japanese characters. I am using PyQt as the wrapper for using QT from python. I am able to take input...
9
by: beni.cherniavsky | last post by:
Python seems to be missing a UCS-32 codec, even in wide builds (not that it the build should matter). Is there some deep reason or should I just contribute a patch? If it's just a bug, should I...
0
by: elizabeth.kegel | last post by:
Hello- I have a webform with a link that needs to open an audio file *.wma. *.mp3, etc. What is odd is I am able to click on the file and the Windows Media Player opens and the audio file plays. ...
1
by: Tim Arnold | last post by:
Hi, I'm using the codecs module to read in utf8 and write out cp1252 encodings. For some characters I'd like to override the default behavior. For example, the mdash character comes out as the code...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.