stripping unwanted chars from string

Edward Elliott

I'm looking for the "best" way to strip a large set of chars from a filename
string (my definition of best usually means succinct and readable). I
only want to allow alphanumeric chars, dashes, and periods. This is what I
would write in Perl (bless me father, for I have sinned...):

$filename =~ tr/\w.-//cd, or equivalently
$filename =~ s/[^\w.-]//

I could just use re.sub like the second example, but that's a bit overkill.
I'm trying to figure out if there's a good way to do the same thing with
string methods. string.translate seems to do what I want, the problem is
specifying the set of chars to remove. Obviously hardcoding them all is a
non-starter.

Working with chars seems to be a bit of a pain. There's no equivalent of
the range function, one has to do something like this:

[chr(x) for x in range(ord('a'), ord('z')+1)] ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Do that twice for letters, once for numbers, add in a few others, and I get
the chars I want to keep. Then I'd invert the set and call translate.
It's a mess and not worth the trouble. Unless there's some way to expand a
compact representation of a char list and obtain its complement, it looks
like I'll have to use a regex.

Ideally, there would be a mythical charset module that works like this:
keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'
toss = charset.invert (keep)

Sadly I can find no such beast. Anyone have any insight? As of now,
regexes look like the best solution.

May 4 '06 #1

Subscribe Reply

2272

John Machin

On 4/05/2006 1:36 PM, Edward Elliott wrote:

I'm looking for the "best" way to strip a large set of chars from a filename
string (my definition of best usually means succinct and readable). I
only want to allow alphanumeric chars, dashes, and periods. This is what I
would write in **** (bless me father, for I have sinned...):
[expletives deleted] and it was wrong anyway (according to your
requirements);
using \w would keep '_' which is *NOT* alphanumeric.
I could just use re.sub like the second example, but that's a bit overkill.
I'm trying to figure out if there's a good way to do the same thing with
string methods. string.translate seems to do what I want, the problem is
specifying the set of chars to remove. Obviously hardcoding them all is a
non-starter.

Working with chars seems to be a bit of a pain. There's no equivalent of
the range function, one has to do something like this:
[chr(x) for x in range(ord('a'), ord('z')+1)] ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought required!! Monkey see, monkey type. keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
fixer = lambda x: ''.join(c for c in x if c in keepchars)
fixer('qwe!@#456.--Howzat?') 'qwe456.--Howzat'
Do that twice for letters, once for numbers, add in a few others, and I get
the chars I want to keep. Then I'd invert the set and call translate.
It's a mess and not worth the trouble. Unless there's some way to expand a
compact representation of a char list and obtain its complement, it looks
like I'll have to use a regex.

Ideally, there would be a mythical charset module that works like this:
keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'

Where'd that '_' come from?
toss = charset.invert (keep)

Sadly I can find no such beast. Anyone have any insight? As of now,
regexes look like the best solution.

I'll leave it to somebody else to dredge up the standard riposte to your
last sentence :-)

One point on your requirements: replacing unwanted characters instead of
deleting them may be better -- theoretically possible problems with
deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
becomes '' which is not a valid filename. And a legibility problem: if
you hate '_' and ' ' so much, why not change them to '-'?

Oh and just in case the fix was accidentally applied to a path:

keepchars.update(os.sep)
if os.altsep: keepchars.update(os.altsep)

HTH,
John

May 4 '06 #2

Edward Elliott

John Machin wrote:

[expletives deleted] and it was wrong anyway (according to your
requirements);
using \w would keep '_' which is *NOT* alphanumeric.
Actually the perl is correct, the explanation was the faulty part. When in
doubt, trust the code. Plus I explicitly allowed _ further down, so the
mistake should have been fairly obvious.

>>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought required!! Monkey see, monkey type.

I won't dignify that with a response. The code that is, I could give a toss
about the comments. If you enjoy using such verbose, error-prone
representations in your code, god help anyone maintaining it. Including
you six months later. Quick, find the difference between these sets at a
glance:

'qwertyuiopasdfghjklzxcvbnm'
'abcdefghijklmnopqrstuvwxyz'
'abcdefghijklmnopprstuvwxyz'
'abcdefghijk1mnopqrstuvwxyz'
'qwertyuopasdfghjklzxcvbnm' # no fair peeking

And I won't even bring up locales.

>>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
>>> fixer = lambda x: ''.join(c for c in x if c in keepchars)

Those darn monkeys, always think they're so clever! ;)
if "you can" == "you should": do(it)
else: do(not)

Sadly I can find no such beast. Anyone have any insight? As of now,
regexes look like the best solution.

I'll leave it to somebody else to dredge up the standard riposte to your
last sentence :-)

If the monstrosity above is the best you've got, regexes are clearly the
better solution. Readable trumps inscrutable any day.

One point on your requirements: replacing unwanted characters instead of
deleting them may be better -- theoretically possible problems with
deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
becomes '' which is not a valid filename.
Which is why I perform checks for emptiness and uniqueness after the strip.
I decided long ago that stripping is preferable to replacement here.

And a legibility problem: if
you hate '_' and ' ' so much, why not change them to '-'?
_ is allowed. And I do prefer -, but not for legibility. It doesn't
require me to hit Shift.

Oh and just in case the fix was accidentally applied to a path:

keepchars.update(os.sep)
if os.altsep: keepchars.update(os.altsep)

Nope, like I said this is strictly a filename. Stripping out path
components is the first thing I do. But thanks for pointing out these
common pitfalls for members of our studio audience. Tell him what he's
won, Johnny! ;)

May 4 '06 #3

Bryan

>>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
or
keepchars = set(string.letters + string.digits + '-.')

bryan

May 4 '06 #4

Edward Elliott

Bryan wrote:

>>> keepchars = set(string.letters + string.digits + '-.')

Now that looks a lot better. Just don't forget the underscore. :)

May 4 '06 #5

bruno at modulix

Edward Elliott wrote:

Bryan wrote:
>>> keepchars = set(string.letters + string.digits + '-.')

Now that looks a lot better. Just don't forget the underscore. :)

You may also want to have a look at string.translate() and
string.maketrans()

--
bruno desthuilliers
python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
p in 'o****@xiludom.gro'.split('@')])"

May 4 '06 #6

John Machin

On 4/05/2006 4:30 PM, Edward Elliott wrote:

Bryan wrote:
>>> keepchars = set(string.letters + string.digits + '-.')
Now that looks a lot better. Just don't forget the underscore. :)

*Looks* better than the monkey business. Perhaps I should point out to
those of the studio audience who are huddled in an ASCII bunker (if any)
that string.letters provides the characters considered to be alphabetic
in whatever the locale is currently set to. There is no guarantee that
the operating system won't permit filenames containing other characters,
ones that the file's creator would quite reasonably consider to be
alphabetic. And of course there are languages that have characters that
one would not want to strip but can scarcely be described as alphanumeric.

import os
os.listdir(u'.') [u'\xc9t\xe9_et_hiver.doc', u'\u041c\u043e\u0441\u043a\u0432\u0430.txt',
u'\u5f20\u654f.txt']
import string
string.letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ'

Doing
import locale; locale.setlocale(locale.LC_ALL, '')
would make string.letters work (for me) with the first file above, but
that's all.

May 4 '06 #7

Alex Martelli

Edward Elliott <no****@127.0.0.1> wrote:

I'm looking for the "best" way to strip a large set of chars from a filename
string (my definition of best usually means succinct and readable). I
only want to allow alphanumeric chars, dashes, and periods. This is what I
would write in Perl (bless me father, for I have sinned...):

$filename =~ tr/\w.-//cd, or equivalently
$filename =~ s/[^\w.-]//

I could just use re.sub like the second example, but that's a bit overkill.
I'm trying to figure out if there's a good way to do the same thing with
string methods. string.translate seems to do what I want, the problem is
specifying the set of chars to remove. Obviously hardcoding them all is a
non-starter.

(untested code, but, the general idea shd be correct)...:

class KeepOnly(object):
allchars = ''.join(chr(i) for i in xrange(256))
identity = string.maketrans('', '')

def __init__(self, chars_to_keep):
self.chars_to_delete = self.allchars.translate(
self.identity, chars_to_keep)

def __call__(self, some_string):
return some_string.translate(self.identity,
self.chars_to_delete)
Alex

May 4 '06 #8

Similar topics

3309

string.lstrip stripping too much?

by: joram gemma | last post by:

Hello, on windows python 2.4.1 I have the following problem >>> s = 'D:\\music\\D\\Daniel Lanois\\For the beauty of Wynona' >>> print s D:\music\D\Daniel Lanois\For the beauty of Wynona >>>...

Python

9960

Stripping non-numeric chars from a field string - SQL 2000

by: JuniorLinn | last post by:

Hi there - I would like to share this strip of code with our SQL 2000 DBA community. The code below strips all non-numeric characters from a given string field and rebuilds the string. Very...

Microsoft SQL Server

5994

Stripping HTML tags from a TEXTAREA field

by: Jeff North | last post by:

Hi, I'm using a control called HTMLArea which allows a person to enter text and converts the format instructions to html tags. Most of my users know nothing about html so this is perfect for my...

Javascript

2769

What's the guideline for dealing with unwanted chars in input stream?

by: lovecreatesbeauty | last post by:

/* When should we worry about the unwanted chars in input stream? Can we predicate this kind of behavior and prevent it before debugging and testing? What's the guideline for dealing with it? ...

C / C++

2565

Stripping out unwanted characters

by: et | last post by:

How can I strip out unwanted characters in a string before updating the database? For instance, in names & addresses in our client table, we want only letters and numbers, no punctuation. Is...

ASP.NET

1502

string stripping issues

by: orangeDinosaur | last post by:

Hello, I am encountering a behavior I can think of reason for. Sometimes, when I use the .strip module for strings, it takes away more than what I've specified. For example: >>> a = ' ...

Python

2657

stripping spaces in front of line

by: eight02645999 | last post by:

hi wish to ask a qns on strip i wish to strip all spaces in front of a line (in text file) f = open("textfile","rU") while (1): line = f.readline().strip() if line == '': break print line

Python

2407

Stripping HTML from RSS feed

by: Jason | last post by:

First things first, let me say that I couldn't decide whether to post this to the PHP ng, or to an XML ng. I know from experience that you guys know what you're talking about, though, and all of...

PHP

1399

Stripping parts of a path

by: Tim Cook | last post by:

Hi All, I just ran into an issue with the rstrip method when using it on path strings. When executing a function I have a need to strip off a portion of the current working directory and add...

Python

7086

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7332

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

6991

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5578

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5014

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

3167

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

1512

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

736

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

382

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General