Wanted: safe codec for filenames

Torsten Bronger

Hallöchen!

I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however, I
need uppercase Latin letters being encoded, too, and the encoding
must contain only lowercase Latin letters, numbers, underscores, and
maybe a little bit more. The result should be more legible than
base64, though.

Has anybody created such a codec already?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)

Sep 5 '07 #1

Subscribe Post Reply

2203

Torsten Bronger

Hallöchen!

Torsten Bronger writes:

I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however,
I need uppercase Latin letters being encoded, too, and the
encoding must contain only lowercase Latin letters, numbers,
underscores, and maybe a little bit more. The result should be
more legible than base64, though.

Has anybody created such a codec already?

Okay, the following works fine for me:
--8<---------------cut here---------------start------------->8---
import codecs

class Codec(codecs.Codec):
"""Codec class for safe filenames. Safe filenames work on all important
filesystems, i.e., they don't contain special or dangerous characters, and
they don't assume that filenames are treated case-sensitively.

>>u"hallo".encode("safefilename")

'hallo'

>>u"Hallo".encode("safefilename")

'(h)allo'

>>u"MIT Thesis".encode("safefilename")

'(mit)_(t)hesis'

>>u"Gesch\\u00e4ftsbrief".encode("safefilename")

'(g)esch{e4}ftsbrief'

Of course, the mapping works in both directions as expected:

>>"(g)esch{e4}ftsbrief".decode("safefilename")

u'Gesch\\xe4ftsbrief'

>>"(mit)_(t)hesis".decode("safefilename")

u'MIT Thesis'

"""
lowercase_letters = "abcdefghijklmnopqrstuvwxyz"
safe_characters = lowercase_letters + "0123456789-+!$%&`'@~#.,^"
uppercase_letters = lowercase_letters.upper()
def encode(self, input, errors='strict'):
"""Convert Unicode strings to safe filenames."""
output = ""
i = 0
input_length = len(input)
while i < input_length:
c = input[i]
if c in self.safe_characters:
output += str(c)
elif c == " ":
output += "_"
elif c in self.uppercase_letters:
output += "("
while i < input_length and input[i] in self.uppercase_letters:
output += str(input[i]).lower()
i += 1
output += ")"
continue
else:
output += "{" + hex(ord(c))[2:] + "}"
i += 1
return output, input_length
def handle_problematic_characters(self, errors, input, start, end, message):
if errors == 'ignore':
return u""
elif errors == 'replace':
return u"?"
else:
raise UnicodeDecodeError("safefilename", input, start, end, message)
def decode(self, input, errors='strict'):
"""Convert safe filenames to Unicode strings."""
input = str(input)
input_length = len(input)
output = u""
i = 0
while i < input_length:
c = input[i]
if c in self.safe_characters:
output += c
elif c == "_":
output += " "
elif c == "(":
i += 1
while i < input_length and input[i] in self.lowercase_letters:
output += input[i].upper()
i += 1
if i == input_length:
self.handle_problematic_characters(errors, input, i-1, i, "open parenthesis was never closed")
continue
if input[i] != ')':
self.handle_problematic_characters(
errors, input, i, i+1, "invalid character '%s' in parentheses sequence" % input[i])
continue
elif c == "{":
end_position = input.find("}", i)
if end_position == -1:
end_position = i+1
while end_position < input_length and input[end_position] in "0123456789abcdef" and \
end_position - i <= 8:
end_position += 1
output += self.handle_problematic_characters(errors, input, i, end_position,
"open backet was never closed")
i = end_position
continue
else:
try:
output += unichr(int(input[i+1:end_position], 16))
except:
output += self.handle_problematic_characters(errors, input, i, end_position+1,
"invalid data between brackets")
i = end_position
else:
output += self.handle_problematic_characters(errors, input, i, i+1, "invalid character '%s'" % c)
i += 1
return output, input_length

class StreamWriter(Codec, codecs.StreamWriter):
pass

class StreamReader(Codec, codecs.StreamReader):
pass
def _registry(encoding):
if encoding == "safefilename":
return (Codec().encode, Codec().decode, StreamReader, StreamWriter)
else:
return None

codecs.register(_registry)

if __name__ == "__main__":
import doctest
doctest.testmod()
--8<---------------cut here---------------end--------------->8---
--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)

Sep 5 '07 #2

Gabriel Genellina

En Wed, 05 Sep 2007 19:20:45 -0300, Torsten Bronger
<br*****@physik.rwth-aachen.deescribiï¿½:

Torsten Bronger writes:

>I'd like to map general unicode strings to safe filename. I tried
punycode but it is case-sensitive, which Windows is not. Thus,
"Hallo" and "hallo" are mapped to "Hallo-" and "hallo-", however,
I need uppercase Latin letters being encoded, too, and the
encoding must contain only lowercase Latin letters, numbers,
underscores, and maybe a little bit more. The result should be
more legible than base64, though.

Okay, the following works fine for me:

Nice codec. Altough if one is looking for really portable file names,
there are additional rules, collected here
http://www.boost.org/libs/filesystem...lity_guide.htm
Hard to comply with all the character set rules *and* keep all name
lengths below the limits.

--
Gabriel Genellina

Sep 6 '07 #3

Torsten Bronger

HallÃ¶chen!

Gabriel Genellina writes:

En Wed, 05 Sep 2007 19:20:45 -0300, Torsten Bronger
<br*****@physik.rwth-aachen.deescribiï¿½:

>Torsten Bronger writes:

>>I'd like to map general unicode strings to safe filename. I
tried punycode but it is case-sensitive, which Windows is not.
[...]

Okay, the following works fine for me:

Nice codec. Altough if one is looking for really portable file
names, there are additional rules, collected here
http://www.boost.org/libs/filesystem...lity_guide.htm
Hard to comply with all the character set rules *and* keep all name
lengths below the limits.

Yes, and therefore, a *very* careful encoding was not an option.
For my own application, I need long filenames for example. So I
used a Wikipedia table to find a sensible compromise.

TschÃ¶,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)

Sep 6 '07 #4

Similar topics

codec to parse raw UCS data?

by: Oleg Leschov | last post by:

Where can I find a list and documentation for codecs? What I want to do is to make a unicode string out of unicode data. for example. I am parsing NTFS metadata, that contains filenames as UCS-2...

Python

Changing the default text codec

by: Fuzzyman | last post by:

Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an...

Python

IMAP UTF-7, any codec for that anywhere?

by: Max M | last post by:

Is there any codec available for handling The special UTF-7 codec for IMAP? I have searched the web for info, but there only seem to be discussions about it. Not actual implementations. This...

Python

is there a safe marshaler?

by: Irmen de Jong | last post by:

Pickle and marshal are not safe. They can do harmful things if fed maliciously constructed data. That is a pity, because marshal is fast. I need a fast and safe (secure) marshaler. Is xdrlib the...

Python

Which codec is required?

by: UJ | last post by:

If I've got a video/audio file, how can I tell what Codec it needs? I want to be able to let the user upload a file to a server but I want to make sure before hand that the codec is already...

C# / C Sharp

help wanted regarding displaying Japanese characters in a GUI using QT and python

by: prats | last post by:

I want to write a GUI application in PYTHON using QT. This application is supposed to take in Japanese characters. I am using PyQt as the wrapper for using QT from python. I am able to take input...

Python

Where is the ucs-32 codec?

by: beni.cherniavsky | last post by:

Python seems to be missing a UCS-32 codec, even in wide builds (not that it the build should matter). Is there some deep reason or should I just contribute a patch? If it's just a bug, should I...

Python

wma codec error C00D109B from ASP.NET web application

by: elizabeth.kegel | last post by:

Hello- I have a webform with a link that needs to open an audio file *.wma. *.mp3, etc. What is odd is I am able to click on the file and the Windows Media Player opens and the audio file plays. ...

Visual Basic .NET

modifying a codec

by: Tim Arnold | last post by:

Hi, I'm using the codecs module to read in utf8 and write out cp1252 encodings. For some characters I'd like to override the default behavior. For example, the mdash character comes out as the code...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA