By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,761 Members | 1,199 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,761 IT Pros & Developers. It's quick & easy.

Inexplicable timings

P: n/a
I'm looking at the implementation of regular expressions in Python and
wrote a script to test my changes. This is the script:
import re
import time

base = "abc"
final = "d"

for n in [100, 1000, 10000]:
for f in [final, ""]:
for r in ["+", "+?"]:
pattern = "(?:%s)%s%s" % (base, r, final)
text_format = "(?:%s){%d}%s" % (base, n, f)
regex = re.compile(pattern)
text = base * n + f
start_time = time.clock()
found = regex.search(text)
finish_time = time.clock()
duration = finish_time - start_time
print "%s\t:: %s\t:: %.3f" % (text_format, pattern,
duration)
I've found some puzzling behaviour that I can't explain. If I delete
sre_parse.pyc or sre_compile.pyc before running the script I get
timings like this:

(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.022
(?:abc){1000} :: (?:abc)+?d :: 0.020
(?:abc){10000}d :: (?:abc)+d :: 0.001
(?:abc){10000}d :: (?:abc)+?d :: 0.000
(?:abc){10000} :: (?:abc)+d :: 1.961
(?:abc){10000} :: (?:abc)+?d :: 1.756

Subsequent runs give timings like this:

(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.020
(?:abc){1000} :: (?:abc)+?d :: 0.020
(?:abc){10000}d :: (?:abc)+d :: 0.001
(?:abc){10000}d :: (?:abc)+?d :: 0.000
(?:abc){10000} :: (?:abc)+d :: 3.953
(?:abc){10000} :: (?:abc)+?d :: 1.763

The second-from-last takes twice as long!

This is repeatable.

Python 2.5.2, on the other hand, gives timings like this:

(?:abc){100}d :: (?:abc)+d :: 0.000
(?:abc){100}d :: (?:abc)+?d :: 0.000
(?:abc){100} :: (?:abc)+d :: 0.000
(?:abc){100} :: (?:abc)+?d :: 0.000
(?:abc){1000}d :: (?:abc)+d :: 0.000
(?:abc){1000}d :: (?:abc)+?d :: 0.000
(?:abc){1000} :: (?:abc)+d :: 0.044
(?:abc){1000} :: (?:abc)+?d :: 0.027
(?:abc){10000}d :: (?:abc)+d :: 0.002
(?:abc){10000}d :: (?:abc)+?d :: 0.002
(?:abc){10000} :: (?:abc)+d :: 4.547
(?:abc){10000} :: (?:abc)+?d :: 2.858

repeatably, irrespective of whether I delete the .pyc files first.

Why is only one timing affected, and why by so much? Python quits
after each run, so it can't be anything cumulative, surely. I've
considered things like caching, for example compiling the .pyc files
on the first run, but can't see why (or how) that would cause what I
see.

Can anyone suggest a factor that I might be missing? Any suggestions
welcome!

Matthew
Aug 1 '08 #1
Share this Question
Share on Google+
1 Reply

P: n/a
On Aug 2, 12:43*am, MRAB <goo...@mrabarnett.plus.comwrote:
I'm looking at the implementation of regular expressions in Python and
wrote a script to test my changes. This is the script:

import re
import time

base = "abc"
final = "d"

for n in [100, 1000, 10000]:
* * for f in [final, ""]:
* * * * for r in ["+", "+?"]:
* * * * * * pattern = "(?:%s)%s%s" % (base, r, final)
* * * * * * text_format = "(?:%s){%d}%s" % (base, n, f)
* * * * * * regex = re.compile(pattern)
* * * * * * text = base * n + f
* * * * * * start_time = time.clock()
* * * * * * found = regex.search(text)
* * * * * * finish_time = time.clock()
* * * * * * duration = finish_time - start_time
* * * * * * print "%s\t:: %s\t:: %.3f" % (text_format, pattern,
duration)

I've found some puzzling behaviour that I can't explain. If I delete
sre_parse.pyc or sre_compile.pyc before running the script I get
timings like this:

(?:abc){100}d * *:: (?:abc)+d * *:: 0.000
(?:abc){100}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){100} * *:: (?:abc)+d * *:: 0.000
(?:abc){100} * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000} * *:: (?:abc)+d * *:: 0.022
(?:abc){1000} * *:: (?:abc)+?d * *:: 0.020
(?:abc){10000}d * *:: (?:abc)+d * *:: 0.001
(?:abc){10000}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){10000} * *:: (?:abc)+d * *:: 1.961
(?:abc){10000} * *:: (?:abc)+?d * *:: 1.756

Subsequent runs give timings like this:

(?:abc){100}d * *:: (?:abc)+d * *:: 0.000
(?:abc){100}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){100} * *:: (?:abc)+d * *:: 0.000
(?:abc){100} * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000} * *:: (?:abc)+d * *:: 0.020
(?:abc){1000} * *:: (?:abc)+?d * *:: 0.020
(?:abc){10000}d * *:: (?:abc)+d * *:: 0.001
(?:abc){10000}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){10000} * *:: (?:abc)+d * *:: 3.953
(?:abc){10000} * *:: (?:abc)+?d * *:: 1.763

The second-from-last takes twice as long!

This is repeatable.

Python 2.5.2, on the other hand, gives timings like this:

(?:abc){100}d * *:: (?:abc)+d * *:: 0.000
(?:abc){100}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){100} * *:: (?:abc)+d * *:: 0.000
(?:abc){100} * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+d * *:: 0.000
(?:abc){1000}d * *:: (?:abc)+?d * *:: 0.000
(?:abc){1000} * *:: (?:abc)+d * *:: 0.044
(?:abc){1000} * *:: (?:abc)+?d * *:: 0.027
(?:abc){10000}d * *:: (?:abc)+d * *:: 0.002
(?:abc){10000}d * *:: (?:abc)+?d * *:: 0.002
(?:abc){10000} * *:: (?:abc)+d * *:: 4.547
(?:abc){10000} * *:: (?:abc)+?d * *:: 2.858

repeatably, irrespective of whether I delete the .pyc files first.

Why is only one timing affected, and why by so much? Python quits
after each run, so it can't be anything cumulative, surely. I've
considered things like caching, for example compiling the .pyc files
on the first run, but can't see why (or how) that would cause what I
see.

Can anyone suggest a factor that I might be missing? Any suggestions
welcome!
I've found the cause. It's the memory allocation.

For future reference, I think what's happening is that when Python
generates the .pyc files it's using more memory (grabbing it from the
system) so that when my code needs to allocate Python can do it more
quickly; if Python didn't have to generate the .pyc files then it has
to grab more from the system, which takes longer. Or something like
that.

Anyway, it's fixed. :-)
Aug 2 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.