By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,908 Members | 1,860 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,908 IT Pros & Developers. It's quick & easy.

making a valid file name...

P: n/a
Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGH IJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname[i]) < 0:
fname[i] = ' '
return fname

Anyone think of a simpler solution?

Oct 17 '06 #1
Share this Question
Share on Google+
10 Replies


P: n/a
I would suggest something like string.maketrans
http://docs.python.org/lib/node41.html. I don't remember exactly how
it works, but I think it's something like
>>invalid_chars = "abc"
replace_chars = "123"
char_map = string.maketrans(invalid_chars, replace_chars)
filename = "abc123.txt"
filename.translate(charmap)
'123123.txt'

--
Jerry

Oct 17 '06 #2

P: n/a

SpreadTooThin wrote:
Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGH IJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname[i]) < 0:
fname[i] = ' '
return fname

Anyone think of a simpler solution?
If you want to strip 'em:
>>valid=':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '
filename = '!"£!£$"$££$%$£%$£lasfjalsfjdlasfjasfd()()()someth ingelse.dat'
stripped = ''.join(c for c in filename if c in valid)
stripped
'lasfjalsfjdlasfjasfdsomethingelse.dat'

If you want to replace them with something, be careful of the regex
string being built (ie a space character).
import re
>>re.sub(r'[^%s]' % valid,' ',filename)
' lasfjalsfjdlasfjasfd somethingelse.dat'
Jon.

Oct 17 '06 #3

P: n/a
Sometimes the user inputs characters that aren't valid
characters for a file or directory name. Here are the
characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '
Just a caveat, as colons and slashes can give grief on various
operating systems...combined with periods, it may be possible to
cause trouble too...
This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGH IJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname[i]) < 0:
fname[i] = ' '
return fname

Anyone think of a simpler solution?
I don't know if it's simpler, but you can use
>>fname = "this is a test & it ain't expen$ive.py"
''.join(c in valid and c or ' ' for c in fname)
'this is a test it ain t expen ive.py'

It does use the "it's almost a ternary operator, but not quite"
method concurrently being discussed/lambasted in another thread.
Treat accordingly, with all that may entail. Should be good in
this case though.

If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know how
well the find() method of a string performs in relationship to
"in" testing of a set. Test and see, if it's important.

-tkc

Oct 17 '06 #4

P: n/a
Hi,

On 10/17/2006 06:22:45 PM, SpreadTooThin wrote:
valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '
not specifying the OS platform, these are not all the characters
that may occur in a filename: '[]{}-=", etc. And '/' is NOT valid.
On a unix platform. And it should be easy to scan the filename and
check every character against the 'valid-string'.

HTH, cu l8r, Edgar.
--
\|||/
(o o) Just curious...
----ooO-(_)-Ooo---------------------------------------------------------
Oct 17 '06 #5

P: n/a
On 2006-10-17, Tim Chase <py*********@tim.thechases.comwrote:
If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know
how well the find() method of a string performs in relationship
to "in" testing of a set. Test and see, if it's important.
The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).

--
Neil Cerutti
This is not a book to be put down lightly. It should be thrown
with great force. --Dorothy Parker
Oct 17 '06 #6

P: n/a
>If you're doing it on a time-critical basis, it might help to
>make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know
how well the find() method of a string performs in relationship
to "in" testing of a set. Test and see, if it's important.

The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).
In "theory" (you know...that little town in west Texas where
everything goes right), a set-membership test should be O(1). A
binary search function would be O(log N). A linear search of a
string for a member should be O(N).

In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

And I know that a number of folks have done some hefty
optimization of Python's string-handling abilities. There's
likely a tradeoff point where it's better to use one over the
other depending on the size of the whitelist. YMMV

-tkc



Oct 17 '06 #7

P: n/a
On 2006-10-17, Edgar Matzinger <ed***@edgar-matzinger.nlwrote:
Hi,

On 10/17/2006 06:22:45 PM, SpreadTooThin wrote:
>valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '

not specifying the OS platform, these are not all the
characters that may occur in a filename: '[]{}-=", etc. And '/'
is NOT valid. On a unix platform. And it should be easy to
scan the filename and check every character against the
'valid-string'.
In the interactive fiction world where I come from, a portable
filename is only 8 chars long and matches the regex
[A-Z][A-Z0-9]*, i.e., capital letters and numbers, with no
extension. That way it'll work on old DOS machines and on
Risc-OS. Wait... is there Python for Risc-OS?
--
Neil Cerutti
>
HTH, cu l8r, Edgar.
Oct 17 '06 #8

P: n/a
Matthew Warren wrote:
>>import re
badfilename='£"%^"£^"£$^ihgeroighroeig3645^£$^"k novin98u4#346#1461461'
valid=':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL MNOPQRSTUVWXYZ '
goodfilename=re.sub('[^'+valid+']',' ',badfilename)
to create arbitrary character sets, it's usually best to run the character string through
re.escape() before passing it to the RE engine.

</F>

Oct 18 '06 #9

P: n/a
Tim Chase:
In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.
With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()
Bye,
bearophile

Oct 18 '06 #10

P: n/a
On 2006-10-18, be************@lycos.com <be************@lycos.comwrote:
Tim Chase:
>In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()
On my Python2.4 for Windows, they are often still neck-and-neck
for len(good) = 26. set's disadvantage of having to be
constructed is heavily amortized over 100,000 membership
tests. Without knowing the usage pattern, it'd be hard to choose
between them.

--
Neil Cerutti
Oct 19 '06 #11

This discussion thread is closed

Replies have been disabled for this discussion.