By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,016 Members | 2,265 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,016 IT Pros & Developers. It's quick & easy.

Delete all not allowed characters..

P: n/a
Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'',
u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

.....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..

Oct 25 '07 #1
Share this Question
Share on Google+
9 Replies


P: n/a
On Oct 25, 10:52 am, Abandoned <best...@gmail.comwrote:
Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'',
u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..
Something like:

import re
def clear( s, allowed=[], case_sensitive=True):
flags = ''
if not case_sensitive:
flags = '(?i)'
return re.sub( flags + '[^%s]' % ''.join( allowed ), ' ', s )

And call:

clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ] )
clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ], False )

And so forth. Or just use re directly!

(This implementation is imperfect in that it's possible to hack the
regular expression, and it may break with mismatched '[]' characters,
but the idea is there.)

Adam

Oct 25 '07 #2

P: n/a
On Thu, 25 Oct 2007 07:52:36 -0700, Abandoned wrote:
Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Ş',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..
the list comprehension does not allow "else",
but it can be used in a similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)
Oct 25 '07 #3

P: n/a
On Thu, 25 Oct 2007 17:42:36 +0200, Michal Bozon wrote:
the list comprehension does not allow "else", but it can be used in a
similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)
Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)
--
Steven.
Oct 25 '07 #4

P: n/a
On Thu, 25 Oct 2007 07:52:36 -0700, Abandoned wrote:
Hi..
I want to delete all now allowed characters in my text. I use this
function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Ş',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed) return s1

You don't need to make allowed a list. Make it a string, it is easier to
read.

allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'

....And my problem this function replace the character to "" but i want
to " "
for example:
input: Exam%^^ple
output: Exam ple

I think the most obvious way is this:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
L = []
for ch in s:
if ch in allowed: L.append(ch)
else: L.append(" ")
return ''.join(s)
Perhaps a better way is to use a translation table:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
not_allowed = [i for i in range(0x110000) if unichr(i) not in allowed]
table = dict(zip(not_allowed, u" "*len(not_allowed)))
return s.translate(table)

Even better is to pre-calculate the translation table, so it is
calculated only when needed:

TABLE = None
def build_table():
global TABLE
if TABLE is None:
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
not_allowed = \
[i for i in range(0x110000) if unichr(i) not in allowed]
TABLE = dict(zip(not_allowed, u" "*len(not_allowed)))
return TABLE

def clear(s):
return s.translate(build_table())
The first time you call clear(), it will take a second or so to build the
translation table, but then it will be very fast.

--
Steven.
Oct 25 '07 #5

P: n/a
>
>the list comprehension does not allow "else", but it can be used in a
similar form:
( I was wrong, as Tim Chase have shown )
>s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)

Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:
String appending is not tragically slower,
for strings long tens of MB, the speed
makes me a difference in few tens of percents,
so it is not several times slower, or so
s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)
Nice, I did not know that string translation exists, but
Abandoned have defined allowed characters, so making
a translation table for the unallowed characters,
which would take nearly complete unicode character table
would be inefficient.

Oct 25 '07 #6

P: n/a
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'',
u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
Using ord() may speed things up. If you want to include A through Z
for example, you can use
ord_chr=ord(chr) ## convert once
if (ord_chr) 64 and (ord_chr < 91): (On a U.S. English system)
and won't have to check every letter in an 'include it' string or
list. Lower case "a" through "z" would be a range also, and u'0'
through u'9' should be as well. That would leave a few remaining
characters that may have to be searched if they are not contiguous
decimal numbers.

Oct 25 '07 #7

P: n/a
On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
>Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so
That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.
In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.
[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:

>>allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
.... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
>>>
timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if
unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>>>
timer.repeat(3, 10)
[18.267689228057861, 16.495684862136841, 16.785034894943237]
The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.

On the other hand, if he is translating only small strings, with
different sets of allowed chars each time, then there is no advantage to
using the translate method.

And on the third hand... I can't help but feel that the *right* solution
to Abandoned's problem is to use encode/decode with the appropriate codec.

--
Steven.
Oct 25 '07 #8

P: n/a
On Oct 26, 12:05 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:
On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)
Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:
String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so

That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.

In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.

[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:
>allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \

... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
>timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if

unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>timer.repeat(3, 10)

[18.267689228057861, 16.495684862136841, 16.785034894943237]

The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.
String translate runs 10 times faster than pure python: unicode
translate isn't anywhere near as fast as it has to look up each
character in the mapping dict.

import timeit

timer = timeit.Timer("a.translate(m)", setup = "a = u'abc' * 1000; m =
dict((x, x) for x in range(256))")

print timer.repeat(3, 10000)

[2.4009871482849121, 2.4191598892211914, 2.3641388416290283]
timer = timeit.Timer("a.translate(m)", setup = "a = 'abc' * 1000; m =
''.join(chr(x) for x in range(256))")

print timer.repeat(3, 10000)

[0.12261486053466797, 0.12225103378295898, 0.12217879295349121]
Also, the unicode translation dict as given doesn't work on
character's that aren't allowed: it should map ints to ints rather
than ints to strings.

Anyway, there's no need to pay the cost of building a full mapping
dict when most of the entries are the same. Something like this can
work:

from collections import defaultdict

def clear(message):
allowed = u'abc...'
clear_translate = defaultdict(lambda: ord(u' '))
clear_translate.update((c, c) for c in map(ord, allowed))
return message.translate(clear_translate)

--
Paul Hankin

Oct 26 '07 #9

P: n/a
....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
I don't think anyone has addressed this yet. It would be
if chr found_in_allowed_set:
output_string += chr
else:
output_string += " "
This Is Just A General Example of code to use. You probably would not
use 'output_string +=' but whatever form the implementation takes, you
would use an if/else

Nice, I did not know that string translation exists, but Abandoned
have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.
And this is also bad logic. If you use a 'not allowed', then
everything else will be included by default. Any errors in the 'not
allowed' or deviations that use an additional unicode character will
be included by default. You want to have the program include only
what it is told to include IMHO.

Oct 26 '07 #10

This discussion thread is closed

Replies have been disabled for this discussion.