473,288 Members | 1,750 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,288 software developers and data experts.

Delete all not allowed characters..

Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

.....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..

Oct 25 '07 #1
9 2060
On Oct 25, 10:52 am, Abandoned <best...@gmail.comwrote:
Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..
Something like:

import re
def clear( s, allowed=[], case_sensitive=True):
flags = ''
if not case_sensitive:
flags = '(?i)'
return re.sub( flags + '[^%s]' % ''.join( allowed ), ' ', s )

And call:

clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ] )
clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ], False )

And so forth. Or just use re directly!

(This implementation is imperfect in that it's possible to hack the
regular expression, and it may break with mismatched '[]' characters,
but the idea is there.)

Adam

Oct 25 '07 #2
On Thu, 25 Oct 2007 07:52:36 -0700, Abandoned wrote:
Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Åž',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..
the list comprehension does not allow "else",
but it can be used in a similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)
Oct 25 '07 #3
On Thu, 25 Oct 2007 17:42:36 +0200, Michal Bozon wrote:
the list comprehension does not allow "else", but it can be used in a
similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)
Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)
--
Steven.
Oct 25 '07 #4
On Thu, 25 Oct 2007 07:52:36 -0700, Abandoned wrote:
Hi..
I want to delete all now allowed characters in my text. I use this
function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Åž',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed) return s1

You don't need to make allowed a list. Make it a string, it is easier to
read.

allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'

....And my problem this function replace the character to "" but i want
to " "
for example:
input: Exam%^^ple
output: Exam ple

I think the most obvious way is this:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
L = []
for ch in s:
if ch in allowed: L.append(ch)
else: L.append(" ")
return ''.join(s)
Perhaps a better way is to use a translation table:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
not_allowed = [i for i in range(0x110000) if unichr(i) not in allowed]
table = dict(zip(not_allowed, u" "*len(not_allowed)))
return s.translate(table)

Even better is to pre-calculate the translation table, so it is
calculated only when needed:

TABLE = None
def build_table():
global TABLE
if TABLE is None:
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
not_allowed = \
[i for i in range(0x110000) if unichr(i) not in allowed]
TABLE = dict(zip(not_allowed, u" "*len(not_allowed)))
return TABLE

def clear(s):
return s.translate(build_table())
The first time you call clear(), it will take a second or so to build the
translation table, but then it will be very fast.

--
Steven.
Oct 25 '07 #5
>
>the list comprehension does not allow "else", but it can be used in a
similar form:
( I was wrong, as Tim Chase have shown )
>s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)

Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:
String appending is not tragically slower,
for strings long tens of MB, the speed
makes me a difference in few tens of percents,
so it is not several times slower, or so
s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)
Nice, I did not know that string translation exists, but
Abandoned have defined allowed characters, so making
a translation table for the unallowed characters,
which would take nearly complete unicode character table
would be inefficient.

Oct 25 '07 #6
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8', u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
Using ord() may speed things up. If you want to include A through Z
for example, you can use
ord_chr=ord(chr) ## convert once
if (ord_chr) 64 and (ord_chr < 91): (On a U.S. English system)
and won't have to check every letter in an 'include it' string or
list. Lower case "a" through "z" would be a range also, and u'0'
through u'9' should be as well. That would leave a few remaining
characters that may have to be searched if they are not contiguous
decimal numbers.

Oct 25 '07 #7
On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
>Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so
That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.
In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.
[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:

>>allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
.... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
>>>
timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if
unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>>>
timer.repeat(3, 10)
[18.267689228057861, 16.495684862136841, 16.785034894943237]
The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.

On the other hand, if he is translating only small strings, with
different sets of allowed chars each time, then there is no advantage to
using the translate method.

And on the third hand... I can't help but feel that the *right* solution
to Abandoned's problem is to use encode/decode with the appropriate codec.

--
Steven.
Oct 25 '07 #8
On Oct 26, 12:05 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:
On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)
Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:
String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so

That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.

In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.

[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:
>allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \

... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutw vyxz'
>timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if

unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>timer.repeat(3, 10)

[18.267689228057861, 16.495684862136841, 16.785034894943237]

The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.
String translate runs 10 times faster than pure python: unicode
translate isn't anywhere near as fast as it has to look up each
character in the mapping dict.

import timeit

timer = timeit.Timer("a.translate(m)", setup = "a = u'abc' * 1000; m =
dict((x, x) for x in range(256))")

print timer.repeat(3, 10000)

[2.4009871482849121, 2.4191598892211914, 2.3641388416290283]
timer = timeit.Timer("a.translate(m)", setup = "a = 'abc' * 1000; m =
''.join(chr(x) for x in range(256))")

print timer.repeat(3, 10000)

[0.12261486053466797, 0.12225103378295898, 0.12217879295349121]
Also, the unicode translation dict as given doesn't work on
character's that aren't allowed: it should map ints to ints rather
than ints to strings.

Anyway, there's no need to pay the cost of building a full mapping
dict when most of the entries are the same. Something like this can
work:

from collections import defaultdict

def clear(message):
allowed = u'abc...'
clear_translate = defaultdict(lambda: ord(u' '))
clear_translate.update((c, c) for c in map(ord, allowed))
return message.translate(clear_translate)

--
Paul Hankin

Oct 26 '07 #9
....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
I don't think anyone has addressed this yet. It would be
if chr found_in_allowed_set:
output_string += chr
else:
output_string += " "
This Is Just A General Example of code to use. You probably would not
use 'output_string +=' but whatever form the implementation takes, you
would use an if/else

Nice, I did not know that string translation exists, but Abandoned
have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.
And this is also bad logic. If you use a 'not allowed', then
everything else will be included by default. Any errors in the 'not
allowed' or deviations that use an additional unicode character will
be included by default. You want to have the program include only
what it is told to include IMHO.

Oct 26 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

26
by: S!mb | last post by:
Hi all, I'm currently developping a tool to convert texts files between linux, windows and mac. the end of a line is coded by 2 characters in windows, and only one in unix & mac. So I have to...
7
by: hungrymind | last post by:
Hi all, I am developing some control (textbox based), to validate inputs to that control I am using regular expression, where pattern is generated dynamically. I need to identify what all...
1
by: Anandan | last post by:
Hi, This is regarding Dataset Filter: WILDCARD CHARACTERS Both the * and % can be used interchangeably for wildcards in a LIKE comparison. If the string in a LIKE clause contains a * or %,...
13
by: Bryan Parkoff | last post by:
I have seen that C/C++ Compiler supports long filename up to 254 characters plus the extension. Can header files and source code files accept space between alphabet character and numeric...
22
by: Cylix | last post by:
I have a 4row x 1col table, I would like to drop all the content of row three. Since Mac IE5.2 does not suppport deleteRow method, I have also try to set the innerHTML=''; but it does not work. ...
17
by: (PeteCresswell) | last post by:
I've got apps where you *really* wouldn't want to delete certain items by accident, but the users just have to have a "Delete" button. My current strategies: Plan A:...
7
by: ClarkePeters | last post by:
I have large text files that I read into an array, but before that I take out all the special characters such as tabs, new lines, and returns. However I'm left with all the extra spaces (sometimes...
6
by: David | last post by:
Hi all, I try to use map container and delete the elements. but there are something wrong, please help me check it. class test{ protected: map<string,myclass*tests; public:
7
by: Grok | last post by:
I need an elegant way to remove any characters in a string if they are not in an allowed char list. The part cleaning files of the non-allowed characters will run as a service, so no forms here. ...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.