re.search much slower then grep on some regular expressions

Henning_Thornblad wrote:

What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

Jul 4 '08 #4

Paddy

On Jul 4, 1:36*pm, Peter Otten <__pete...@web.dewrote:

Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
* * row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input * * * * * * * * *(input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

- Paddy.

Jul 4 '08 #5

Filipe Fernandes

On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__*******@web.dewrote:

Henning_Thornblad wrote:

>What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

>This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is), but why would you say this particular
regex isn't common enough in real code?

filipe

Jul 4 '08 #6

Carl Banks

On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@gmail.comwrote:

On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web.dewrote:
Henning_Thornblad wrote:

What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I don't think it's possible to avoid quadratic behavior in regexps in
general, but clearly there are ways to optimize in some cases.

I'm guessing that grep builds a table of locations of individual
characters as it scans and, when the regexp exhausts the input, it
tries to backtrack to the last slash it saw, except there wasn't one
so it knows the regexp cannot be satisfied and it exits early.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is),

I'm pretty sure it isn't: the regexp documentation makes no claims as
to the performance of the regexp that I'm aware of.

but why would you say this particular
regex isn't common enough in real code?

When re.search regexps start with things like [^...]* or .*, typically
the excluded characters are a typically found frequently in the
input. For example, the pattern .*hello.* could be used to find a
line with hello in it, with the expectation that there are lots of
newlines. But if there aren't any newlines the regexp wouldn't be
very useful.

Carl Banks

Jul 5 '08 #7

John Nagle

Henning_Thornblad wrote:

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

John Nagle

Jul 5 '08 #8

Peter Pearson

On Fri, 4 Jul 2008 20:34:03 -0700 (PDT), Carl Banks wrote:

On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@gmail.comwrote:
>On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web.dewrote:
Henning_Thornblad wrote:

>This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

[snip]

>Is this a bug in python?

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I did this:

$ python -m timeit -s "import re" "re.search( '[^13]*x', 900*'a' )"
100 loops, best of 3: 16.7 msec per loop

for values of 900 ranging from 300 to 1000, and the time taken
per loop was indeed quadratic.

--
To email me, substitute nowhere->spamcop, invalid->net.

Jul 5 '08 #9

John Nagle wrote:

Henning_Thornblad wrote:
>What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

Now that's premature optimization :-)

Apart from the fact that re.search() is executed only once in the above
script the re library uses a caching scheme so that even if the re.search()
call were in a loop the overhead would be a few microseconds for the cache
lookup.

Peter

Jul 5 '08 #10

Paddy wrote:

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

So you're saying the Python algo is alternatively gifted...

Peter

Jul 5 '08 #11

Filipe Fernandes wrote:

but why would you say this particular
regex isn't common enough in real code?

As Carl says, it's not just the regex, it's the the combination with a long
line that exposes the re library's weakness.

Peter

Jul 5 '08 #12

Paddy

On Jul 5, 7:01*am, Peter Otten <__pete...@web.dewrote:

Paddy wrote:
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

So you're saying the Python algo is alternatively gifted...

Peter

The following isn't the article I read on regexp types and their speed
differences but it does give info on regexp engine types:
http://books.google.co.uk/books?id=G...sult#PPA145,M1

- Paddy.

Jul 5 '08 #13

Paddy <pa*******@googlemail.com>:

On Jul 4, 1:36Â*pm, Peter Otten <__pete...@web.dewrote:
>Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input Â* Â* Â* Â* Â* Â* Â* Â* Â*(input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 5 '08 #14

Carl Banks

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@gmx.netwrote:

Paddy <paddy3...@googlemail.com>:

On Jul 4, 1:36 pm, Peter Otten <__pete...@web.dewrote:
Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

I'm not sure of the specifics and maybe that is the case, but it could
also be a case of a different codebase which is optimized differently.
Carl Banks

Jul 5 '08 #15

Carl Banks <pa************@gmail.com>:

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@gmx.netwrote:
>Paddy <paddy3...@googlemail.com>:

On Jul 4, 1:36 pm, Peter Otten <__pete...@web.dewrote:
Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

>grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

>You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

>re.search('[^ "=]*/', row) if "/" in row else None

>might be good enough.

>Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.
--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 5 '08 #16

Carl Banks

On Jul 5, 6:44 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@gmx.netwrote:

Carl Banks <pavlovevide...@gmail.com>:

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@gmx.netwrote:
Paddy <paddy3...@googlemail.com>:

On Jul 4, 1:36 pm, Peter Otten <__pete...@web.dewrote:
Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)
Carl Banks

Jul 5 '08 #17

bearophileHUGS

Paddy:

You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

Many Python implementations contains a TCL interpreter. TCL REs may be
better than Python ones, so it can be interesting to benchmark the
same RE with TCL, to see how much time it needs. If someone here knows
TCL it may require to write just few lines of TCL code (used through
tkinter, by direct call).

Bye,
bearophile

Jul 5 '08 #18

Mark Dickinson

On Jul 5, 1:54*pm, Carl Banks <pavlovevide...@gmail.comwrote:

I don't think you've illustrated that at all. *What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. *It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

Apparently, grep and Tcl convert a regex to a finite state machine.
Matching is then *very* fast: essentially linear time in the
length of the string being matched, even in the worst case.
Though it is possible for the size of the finite state machine
to grow exponentially with the size of the regex.

But not all PCREs can be converted to a finite state machine, so
Perl, Python, etc. use a backtracking approach, which has
exponential running time in the worst case. In particular,
it's not possible to use a finite state machine to represent
a regular expression that contains backreferences.

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Mark

Jul 5 '08 #19

Paddy

On Jul 5, 4:13*pm, Mark Dickinson <dicki...@gmail.comwrote:

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

That's the one!

Thanks Mark.

- Paddy.

Jul 5 '08 #20

Terry Reedy

Mark Dickinson wrote:

On Jul 5, 1:54 pm, Carl Banks <pavlovevide...@gmail.comwrote:

Part of the problem is a lack of agreement on what
'regular expression' means.

Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Jul 5 '08 #21

Terry Reedy <tj*****@udel.edu>:

Mark Dickinson wrote:
>On Jul 5, 1:54 pm, Carl Banks <pavlovevide...@gmail.comwrote:

>Part of the problem is a lack of agreement on what
'regular expression' means.

Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Yet, to many programmers the Hummer is not only just a car, its _the_ car.
Everything else is just silly, bloody old posix crap ;)

(meaning, that pcres are often seen as true regular expressions, and
everything else is referred to as "being somehow restricted")

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 5 '08 #22

sjdevnull

On Jul 5, 11:13 am, Mark Dickinson <dicki...@gmail.comwrote:

Apparently, grep and Tcl convert a regex to a finite state machine.

....

But not all PCREs can be converted to a finite state machine

....

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Formally, there's no lack of agreement that I'm aware of. Anyone
formally defining it will use something along the lines of what
wikipedia specifies (with "Regular expressions in this sense can
express the regular languages, exactly the class of languages accepted
by finite state automata."). Colloquially it's used to mean "any text-
matching scheme that looks vaguely like sed/grep regexes".

PCREs are certainly not "real" regular expressions, but they are
colloquially.

Jul 6 '08 #23

Mark Wooding

Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

-- [mdw]

Jul 6 '08 #24

Mark Wooding <md*@distorted.org.uk>:

Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

>I just wanted to illustrate, that the speed of the given search is
somehow related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I
tried a sample C++-code using pcre++ (a wrapper around libpcre) and saw
it running equally long.

So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

I don't know about the latter, but perl doesn't use any smart algorithm, it
just heavily relies on memoization to gain speed.

This makes perl perform poorly in other situations, as mentioned in the
paper cited by Mark Dickinson:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

In Python on the other, this pattern works fine, at the cost of performance
losses on other patterns.

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 6 '08 #25

Terry Reedy

Sebastian "lunar" Wiesner wrote:

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

One thing to remember in making comparisons is that Python moved to its
own implementation to support unicode as well as extended ascii (bytes
in 3.0). Does grep do that? Has pcre, which Python once used, been
upgraded to do that?

Jul 6 '08 #26

Mark Wooding

Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Stack overflow condition. :-(

-- [mdw]

Jul 6 '08 #27

Mark Wooding <md*@distorted.org.uk>:

Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

How come, that you think so?

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 7 '08 #28

Marc 'BlackJack' Rintsch

On Mon, 07 Jul 2008 16:44:22 +0200, Sebastian \"lunar\" Wiesner wrote:

Mark Wooding <md*@distorted.org.uk>:

>Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

>># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

How come, that you think so?

The '#' is usually the prompt of root accounts while ordinary users get a '$'.

Ciao,
Marc 'BlackJack' Rintsch

Jul 7 '08 #29

Henning Thornblad

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

/Henning Thornblad

Jul 7 '08 #30

Paddy wrote:

On Jul 4, 1:36 pm, Peter Otten <__pete...@web.dewrote:
>Henning_Thornblad wrote:
>>What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm ;)

>>This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

I can and do :-)

It's a major problem that regular expression parsing in python has
exponential complexity when polynomial algorithms (for a subset of
regexp expressions, e.g. excluding back-references) are well-known.

It rules out using python for entire classes of applications where
regexp parsing is on the critical path.

Kris

Jul 7 '08 #31

On Jul 8, 2:51 am, Henning Thornblad <Henning.Thornb...@gmail.com>
wrote:

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N**2) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**3)
process.

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

Jul 7 '08 #32

On Jul 8, 2:51 am, Henning Thornblad <Henning.Thornb...@gmail.com>
wrote:

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

Jul 8 '08 #33

samwyse

On Jul 4, 6:43*am, Henning_Thornblad <Henning.Thornb...@gmail.com>
wrote:

What can be the cause of the large difference between re.search and
grep?

While doing a simple grep:
grep '[^ "=]*/' input * * * * * * * * *(input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Jul 8 '08 #34

samwyse wrote:

On Jul 4, 6:43 am, Henning_Thornblad <Henning.Thornb...@gmail.com>
wrote:
>What can be the cause of the large difference between re.search and
grep?

>While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Very interesting! Thanks very much for the pointer.

Kris

Jul 8 '08 #35

samwyse wrote:

On Jul 4, 6:43 am, Henning_Thornblad <Henning.Thornb...@gmail.com>
wrote:
>What can be the cause of the large difference between re.search and
grep?

>While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Kris

Kris

Jul 8 '08 #36

Henning Thornblad

On Jul 8, 2:48 am, John Machin <sjmac...@lexicon.netwrote:

On Jul 8, 2:51 am, Henning Thornblad <Henning.Thornb...@gmail.com>
wrote:

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

What we had was a regex that searched textfiles, row for row,
for an absolute path starting/ending with ", = or blank space.

The path have a known signature which is pretty unique and the regex
looked like this:
[^ ="]*/[^ ="]*signature[^ =]*/

So we didn't really see any performance loss until we got an textfile
with some really long rows, one reaching 156,000 characters. We
removed the cause of these long rows.
But also decided to tune the regex making it more scalable.

I have though of two possible soultions:
1. Invert [^ ="] so that it becomes [alot of charcters you could find
in a path]
2. Add [ ="]: [ ="][^ ="]*/[^ ="]*signature[^ =]*/[ ="] and just trim
them later
They both work good performance wise but i can't figure out which one
is the less ugly.

Anyway, I'm glad I asked the question. I have learned more about regex
then i ever thought i'd need :)

Many Thanks
Henning Thornblad

Jul 9 '08 #37

On Jul 9, 2:01*am, Kris Kennaway <k...@FreeBSD.orgwrote:

samwyse wrote:
On Jul 4, 6:43 am, Henning_Thornblad <Henning.Thornb...@gmail.com>
wrote:
What can be the cause of the large difference between re.search and
grep?

While doing a simple grep:
grep '[^ "=]*/' input * * * * * * * * *(input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
* * (Rep(AnyBut(' "='))+Str('/'), *TEXT),
* * (AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
* * token = scanner.read()
* * print token
* * if token[0] is None:
* * * * break

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). *If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re? How large is "large"?
What kind of text?

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Cheers,
John

Jul 9 '08 #38

John Machin wrote:

>Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?

Trivial stuff like:

(Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'),
('mtree', 'mtree')),
(Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'),
('mtree', 'mtree')),
(Str('list of filesystem changes from before and after'),
('mtree', 'mtree')),

(re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed
unexpectedly|script.*failed: here are the contents of)'),
('configure_error', 'configure')),
....

There are about 150 of them and I want to find which is the first match
in a text file that ranges from a few KB up to 512MB in size.

How large is "large"?
What kind of text?

It's compiler/build output.

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Thanks, looks interesting but I don't think it is the best fit here. I
would like to avoid spawning hundreds of processes to process each file
(since I have tens of thousands of them to process).

Kris

Jul 9 '08 #39

Jeroen Ruigrok van der Werven

-On [20080709 14:08], Kris Kennaway (kr**@FreeBSD.org) wrote:

>It's compiler/build output.

Sounds like the FreeBSD ports build cluster. :)

Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

Also, a while ago I did tests with various GCC compilers and their effect on
Python running time as well as Intel's cc. Intel won on (nearly) all
accounts, meaning it was faster overall.

>From the top of my mind: GCC 4.1.x was faster than GCC 4.2.x.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org/ asmodai
ã‚¤ã‚§ãƒ«ãƒ¼ãƒ³ ãƒ©ã‚¦ãƒ•ãƒ*ãƒƒã‚¯ ãƒ´ã‚¡ãƒ³ ãƒ‡ãƒ« ã‚¦ã‚§ãƒ«ãƒ´ã‚§ãƒ³
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Beware of the fury of the patient man...

Jul 9 '08 #40

Jeroen Ruigrok van der Werven wrote:

-On [20080709 14:08], Kris Kennaway (kr**@FreeBSD.org) wrote:
>It's compiler/build output.

Sounds like the FreeBSD ports build cluster. :)

Yes indeed!

Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

I am pretty sure the problem is algorithmic, not bad byte code :) If it
was a matter of a few % then that is in the scope of compiler tweaks,
but we're talking orders of magnitude.

Kris

Also, a while ago I did tests with various GCC compilers and their effect on
Python running time as well as Intel's cc. Intel won on (nearly) all
accounts, meaning it was faster overall.

From the top of my mind: GCC 4.1.x was faster than GCC 4.2.x.

Jul 9 '08 #41

samwyse

On Jul 8, 11:01*am, Kris Kennaway <k...@FreeBSD.orgwrote:

samwyse wrote:

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). *If the matching was fast then I could possibly pickle the
lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

And here's my (probably over-engineered) testbed:

from __future__ import with_statement
from os.path import exists
from timeit import Timer

from Plex import *

filename = "file-%d.txt"

def create_files(n):
for x in range(0,n):
fname = filename % x
if not exists(fname):
print 'creating', fname
with open(fname, 'w') as f:
print >>f, (4875*2**x)*'a',

def compile_lexicon():
global lexicon
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])

def test(fname):
with open(fname, 'r') as f:
scanner = Scanner(lexicon, f, fname)
while 1:
token = scanner.read()
#print token
if token[0] is None:
break

def my_timed_test(func_name, *args):
stmt = func_name + '(' + ','.join(map(repr, args)) + ')'
t = Timer(stmt, "from __main__ import "+func_name)
print stmt, 'took', t.timeit(1), 'secs'

if __name__ == '__main__':
create_files(6)
my_timed_test('compile_lexicon')
for x in range(0,4):
my_timed_test('test', filename%x)

Jul 9 '08 #42

samwyse wrote:

On Jul 8, 11:01 am, Kris Kennaway <k...@FreeBSD.orgwrote:
>samwyse wrote:

>>You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/
"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

>Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.

My lexicon was quite a bit bigger, containing about 150 strings and regexps.

However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

The docs say it is supposed to be linear in the file size ;-) ;-(

Kris

Jul 9 '08 #43

On Jul 9, 10:06 pm, Kris Kennaway <k...@FreeBSD.orgwrote:

John Machin wrote:

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?

Trivial stuff like:

(Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'),
('mtree', 'mtree')),
(Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'),
('mtree', 'mtree')),
(Str('list of filesystem changes from before and after'),
('mtree', 'mtree')),

(re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed
unexpectedly|script.*failed: here are the contents of)'),
('configure_error', 'configure')),
...

There are about 150 of them and I want to find which is the first match
in a text file that ranges from a few KB up to 512MB in size.

How large is "large"?

What kind of text?

It's compiler/build output.

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Thanks, looks interesting but I don't think it is the best fit here. I
would like to avoid spawning hundreds of processes to process each file
(since I have tens of thousands of them to process).

Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.

Cheers,
John

Jul 10 '08 #44

John Machin wrote:

Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.

If it was just strings, then sure...with regexps it might be possible to
make it work, but it doesn't sound particularly maintainable. I will
stick with my shell script until python gets a regexp engine of
equivalent performance.

Kris

Jul 10 '08 #45

J. Cliff Dyer

On Wed, 2008-07-09 at 12:29 -0700, samwyse wrote:

On Jul 8, 11:01 am, Kris Kennaway <k...@FreeBSD.orgwrote:
samwyse wrote:

You might want to look at Plex.
>http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

Sounds like a good strategy would be to find the smallest chunk of the
file that matches can't cross, and iterate your search on units of those
chunks. For example, if none of your regexes cross line boundaries,
search each line of the file individually. That may help turn around
the speed degradation you're seeing.

Cheers,
Cliff

Jul 10 '08 #46

Marc 'BlackJack' Rintsch <bj****@gmx.net>:

On Mon, 07 Jul 2008 16:44:22 +0200, Sebastian \"lunar\" Wiesner wrote:

>Mark Wooding <md*@distorted.org.uk>:

>>Sebastian "lunar" Wiesner <ba***********@gmx.netwrote:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

How come, that you think so?

The '#' is usually the prompt of root accounts while ordinary users get a
'$'.

Oh ... I just used "#" as a general placeholder, because I didn't want to
copy my whole two-line prompt ;) I can assure, that I did not try this as
root ;) and will use "$" in future ;)

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jul 10 '08 #47