468,514 Members | 1,627 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,514 developers. It's quick & easy.

Faster 'if char in string' test

I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:

'''
Translate Speed Test

This code looks for invalid characters in a string,
and raises an exception when it finds one.
I'm testing 2 methods: the 'if char in string' method,
and one based on using the 'translate' function and
comparing string lengths afterwards.
Wow, what a difference! Translate is over 10x faster!

Function Loops Seconds Loops/sec
***********************************************
In 10000 0.171 58479
Translate 10000 0.016 624998

'''

import mytime
import string
_allchars = None
_deletechars = None
_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
def init():
global _allchars, _deletechars
l = []
a = []
for i in range(256):
a.append(chr(i))
if not chr(i) in _validchars:
l.append(chr(i))
_deletechars = ''.join(l)
_allchars = ''.join(a)
def test():
max = 10000
tmr = mytime.Timer()
r = range(max)
s = "This is a string to test for invalid characters."
print tmr.heading

tmr.startit()
for i in r:
for c in s:
if c in _deletechars:
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('In')

tmr.startit()
for i in r:
s2 = s.translate(_allchars, _deletechars)
if len(s2) != len(s):
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('Translate')
init()

if __name__ == "__main__":
test()
Jul 18 '05 #1
12 3741
Kamilche wrote:
was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:


For a fair comparison you should use a dictionary instead of a string for
the containment test.

Peter

Jul 18 '05 #2
Kamilche wrote:
I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:


Perhaps it would be a good idea to post the snippets at the Python cookbook.

Reinhold

--
Wenn eine Linuxdistribution so wenig brauchbare Software wie Windows
mitbrächte, wäre das bedauerlich. Was bei Windows der Umfang eines
"kompletten Betriebssystems" ist, nennt man bei Linux eine Rescuedisk.
-- David Kastrup in de.comp.os.unix.linux.misc
Jul 18 '05 #3
Kamilche,

have you tried module sets?

tmp_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_validchars=sets.Set(tmp_validchars)

....

the test becomes:

testset=sets.Set(s)
if testset.difference(_validchars):
raise Exception("Invalid character found!")
it may be even faster...
best wishes

Harald
Jul 18 '05 #4


why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()

hehe...

I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Jul 18 '05 #5
Very good!

One remark:

translate method is much slower then "c in string" method in the case
when the test string is very long (say 5mb) and the invalid character
is located close to the beginning of the string. (10000 times slower!)
kl*******@home.com (Kamilche) wrote in message news:<88**************************@posting.google. com>...
I was looking for a way to speed up detecting invalid characters in my
TCP string, and thought of yet another use for the translate function!
If you were to 'translate out' the bad characters, and compare string
lengths afterwards, you would know whether or not the line contained
invalid characters. The new method is more than 10x faster than the
standard 'if char in string' test! So - here's the code plus sample
timings:

'''
Translate Speed Test

This code looks for invalid characters in a string,
and raises an exception when it finds one.
I'm testing 2 methods: the 'if char in string' method,
and one based on using the 'translate' function and
comparing string lengths afterwards.
Wow, what a difference! Translate is over 10x faster!

Function Loops Seconds Loops/sec
***********************************************
In 10000 0.171 58479
Translate 10000 0.016 624998

'''

import mytime
import string
_allchars = None
_deletechars = None
_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
def init():
global _allchars, _deletechars
l = []
a = []
for i in range(256):
a.append(chr(i))
if not chr(i) in _validchars:
l.append(chr(i))
_deletechars = ''.join(l)
_allchars = ''.join(a)
def test():
max = 10000
tmr = mytime.Timer()
r = range(max)
s = "This is a string to test for invalid characters."
print tmr.heading

tmr.startit()
for i in r:
for c in s:
if c in _deletechars:
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('In')

tmr.startit()
for i in r:
s2 = s.translate(_allchars, _deletechars)
if len(s2) != len(s):
raise Exception("Invalid character found!")
tmr.stopit(max)
print tmr.results('Translate')
init()

if __name__ == "__main__":
test()

Jul 18 '05 #6
Pierre-Frédéric Caillaud <pe****@free.fr> wrote in message news:<opr929m40e1v4ijd@musicbox>...
why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()... heh heh


You're right, Pierre, that just might be more efficient! But I doubt
those two 'len' calls would show up on a profiler, somehow. ;-) But
you have to time it again, because the question becomes 'how efficient
is the translate function at removing invalid characters from a
string?' It might be much more efficient at removing one char, than at
removing a thousand. So the timings would still be necessary, to make
sure you're not optimizing the wrong part.

The 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?
Jul 18 '05 #7
Pierre-Frédéric Caillaud <pe****@free.fr> wrote in message news:<opr929m40e1v4ijd@musicbox>...
why don't you use translate to delete all the valid characters and test
if the resulting string is not not empty ? this will save you two calls to
len()... heh heh


You're right, Pierre, that just might be more efficient! But I doubt
those two 'len' calls would show up on a profiler, somehow. ;-) But
you have to time it again, because the question becomes 'how efficient
is the translate function at removing invalid characters from a
string?' It might be much more efficient at removing one char, than at
removing a thousand. So the timings would still be necessary, to make
sure you're not optimizing the wrong part.

The 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?
Jul 18 '05 #8
Kamilche wrote:
he 'dict' solution proposed wouldn't work because the data I'm
testing is in string form, and the overhead of translating the string
to a dict before testing would swamp the results. So would using a
set, because timings show a set is 4x slower than a dict.

Unless I'm misunderstanding Peter's suggestion. Did you mean
translating the string into a dict, then using a 'if boguschar in
dict' test?


Look for validate_dict() to see what I meant. I've since found a faster
alternative using regular expressions.

<code>
import itertools, re, string

_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)

_invaliddict = dict(itertools.izip(_invalidchars, itertools.repeat(None)))

valid = "This is a string to test for invalid characters."
invalid = valid + _invalidchars + valid
massaged = (_invalidchars + valid) * 100

rInvalid = re.compile("[%s]" % re.escape(_invalidchars))

def validate_list(s, invalid=_invalidchars):
for c in s:
if c in invalid:
return False
return True

def validate_dict(s, invalid=_invaliddict):
for c in s:
if c in invalid:
return False
return True

def validate_translate(s):
return len(s.translate(_allchars, _invalidchars)) == len(s)

def validate_regex(s, rInvalid=rInvalid):
return not rInvalid.search(s)

def validate_noop(s, valid=valid):
return s is valid
if __name__ == "__main__":
import timeit
tests = [f for k, f in globals().items() if k.startswith("validate_")]
for f in tests:
assert f(valid)
assert not f(invalid)

for f in tests:
name = f.__name__
print name
for data in ["valid", "invalid", "massaged"]:
print " ", data,
timeit.main([
"-sfrom findcharspeed import %s as f, %s as d" % (name, data),
"f(d)"])
</code>

Here are my timings:

validate_regex
valid 100000 loops, best of 3: 2.24 usec per loop
invalid 100000 loops, best of 3: 2.37 usec per loop
massaged 1000000 loops, best of 3: 1.61 usec per loop
validate_dict
valid 100000 loops, best of 3: 10.8 usec per loop
invalid 100000 loops, best of 3: 10.7 usec per loop
massaged 1000000 loops, best of 3: 0.99 usec per loop
validate_translate
valid 100000 loops, best of 3: 2.62 usec per loop
invalid 100000 loops, best of 3: 3.68 usec per loop
massaged 10000 loops, best of 3: 57.6 usec per loop
validate_list
valid 100000 loops, best of 3: 14.2 usec per loop
invalid 100000 loops, best of 3: 14 usec per loop
massaged 1000000 loops, best of 3: 1.02 usec per loop
validate_noop
valid 1000000 loops, best of 3: 0.582 usec per loop
invalid 1000000 loops, best of 3: 0.622 usec per loop
massaged 1000000 loops, best of 3: 0.634 usec per loop

Note how badly your solution using str.translate() may perform for certain
data due to its failure to shortcut.

Peter
Jul 18 '05 #9
el*******@hotmail.com (Elbert Lev) wrote in message news:<94*************************@posting.google.c om>...

translate method is much slower then "c in string" method in the case
when the test string is very long (say 5mb) and the invalid character
is located close to the beginning of the string. (10000 times slower!)


Good point! I was surprised that allocating a fresh string and
comparing lengths was speedier, but upon reflection, I could see how
that could be so. And I also see how if it was a HUMONGOUS string, a
book more like, this would no longer be the case.

Thanks for pointing that out!
Jul 18 '05 #10
Peter Otten <__*******@web.de> wrote in message news:<cb*************@news.t-online.com>...
Note how badly your solution using str.translate() may perform for certain
data due to its failure to shortcut.

Peter


Heheh, I'm a Python newbie. I can't even understand the code you have
written, but I'll try.

You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C', like I had originally thought. :-D It's better
in many ways, but it has its dark alleyways where a novice can get
stabbed in the back, too.

{looks suspiciously down the dark alleyway where Peter's code is
standing, and tries to glide by undetected.}

--Kamilche
Jul 18 '05 #11

You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C'


In the case of variable number of args, C gives you neither the number of
arguments, not their types, nor their names, so you have to send the
number as a parameter or append a null at the end which... sucks !
Jul 18 '05 #12
Kamilche wrote:
Heheh, I'm a Python newbie. I can't even understand the code you have
written, but I'll try.
If I don't understand Python code, I tend to blame the code. I'll try to
either explain or rewrite the "dark" pieces below.
You know, as I was writing stuff like fn(**args.__dict__) yesterday
(which for me is pretty advanced!), it occurred to me that Python is
not 'simpler than C', like I had originally thought. :-D It's better
in many ways, but it has its dark alleyways where a novice can get
stabbed in the back, too.


Python and C are both simple. Python has friendlier error messages and less
bloat, i. e. manual garbage collection in programs that actually work.
Recent additions to Python and the newstyle/classic class schism, while all
reasonable in themselves, make it harder to "know it all". If decorators
take off, this will become even worse.
Enough pessimism - Python is still much more compact than C++, and if anyone
has the power to say no, then it's Guido van Rossum. If a way to integrate
type inferencing smoothly is actually found, why should anyone want to
program in C at all?

Peter

Now the revised code:

import itertools, re, string

_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)

# repeat(v) gives an "endless" sequence of v, so
# zip("abc", repeat(None)) == [("a", None), ("b", None), ("c", None)]
# In Python 2.4 I would use a set but in 2.3.x dicts are faster.
# _invalidset = set(_invalidcars)
_invaliddict = dict(zip(_invalidchars, itertools.repeat(None)))

# contains only valid characters
valid = "This is a string to test for invalid characters."

# contains both valid and invalid characters
invalid = valid + _invalidchars + valid

# A LONG string containing MANY invalid characters to make
# the translate() solution look bad
massaged = (_invalidchars + valid) * 100

# supposing _invalidchars == "abc" we get the regex re.compile("[abc]").
# rInvalid.search(s) would then return a non-None result whenever s
# contains an a, b, or c. The nice thing for our purpose is that it stops
# at the first match, i. e. if the first char is invalid, we don't care
# about the rest
rInvalid = re.compile("[%s]" % re.escape(_invalidchars))

def validate_list(s, invalid=_invalidchars):
# since the set of invalid characters is used many times,
# it is faster to have it as a local variable.
# I don't know why, but achieving this via a default parameter
# is even faster than a
# invalid = _invalidchars
# assignment at the beginning of the function
for c in s:
if c in invalid:
return False
return True

def validate_dict(s, invalid=_invaliddict):
# exactly the same as validate_list, but the
# c in invalid
# is faster because invalid is a dictionary
for c in s:
if c in invalid:
return False
return True

def validate_translate(s):
return len(s.translate(_allchars, _invalidchars)) == len(s)

def validate_regex(s):
return not rInvalid.search(s)

def validate_noop(s, valid=valid):
""" Not a valid implementation. Only to estimate the function
call overhead.
"""
return s is valid
def measure(func, data):
""" Run the speed test for a func/data pair and print the result.

func: name of the function to be tested
data: name of the variable containing the sample data
"""
print " ", data,
setup = "from findcharspeed import %s, %s" % (func, data)
run = "%s(%s)" % (func, data)
# example of the above:
# from findcharspeed import validate_list, valid
# validate_list(valid) # this statement will be timed

# timeit.main() expects parameters from the commandline, so we have
# to add the appropriate switches when invoking it programmatically.
timeit.main(["-s" + setup, run])

if __name__ == "__main__":
import timeit

# functions I want to time
tests = [validate_list, validate_dict, validate_translate,
validate_regex, validate_noop]

# make sure the functions return the correct result
for f in tests:
assert f(valid)
assert not f(invalid)

# calculate the timings
for f in tests:
print f.__name__
# test each candidate function with 3 different sample strings
for data in ["valid", "invalid", "massaged"]:
measure(f.__name__, data)
Jul 18 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by Yang Song | last post: by
1 post views Thread by pawel | last post: by
23 posts views Thread by YinTat | last post: by
11 posts views Thread by akarui.tomodachi | last post: by
12 posts views Thread by karthikbalaguru | last post: by
23 posts views Thread by LilacSkin | last post: by
1 post views Thread by fmendoza | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.