473,473 Members | 1,750 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Regex speed

Hello,

I recently ported a simple utility script to analyze a data file from
Perl to Python that uses regex substitutions, not more complex than

re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

When run without these regex substitutions, the scripts' speed is nearly
equal. However, with the regexes applied, the Python version's running
time is five times or more of the Perl version's.

So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

--or--

Is there a Python interface for the PCRE library out there?

Thanks

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #1
14 1859
On Fri, 29 Oct 2004 20:06:05 +0200,
Reinhold Birkenfeld <re************************@wolke7.net> wrote:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')
You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('"') and s.endswith('"')'.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?
The regex engine *is* implemented in C; look at Modules/_sre.c.
Is there a Python interface for the PCRE library out there?


PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.

--amk
Jul 18 '05 #2
A.M. Kuchling wrote:
On Fri, 29 Oct 2004 20:06:05 +0200,
Reinhold Birkenfeld <re************************@wolke7.net> wrote:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')


You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('"') and s.endswith('"')'.


You're right. The sub calls are those:

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

So at least the \s* are justified.

But my actual question is why Perl can run the same regexes in what
seems no time at all.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?


The regex engine *is* implemented in C; look at Modules/_sre.c.


But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?
Is there a Python interface for the PCRE library out there?


PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.


You're right again. Is the pre module using the PCRE C library?

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #3
On Fri, 29 Oct 2004 20:35:28 +0200,
Reinhold Birkenfeld <re************************@wolke7.net> wrote:
But my actual question is why Perl can run the same regexes in what
seems no time at all.
Probably Perl's engine does some particular bit of optimization that
Python's engine isn't doing. I have no idea what optimization that might
be.
But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?
That code is the compiler that turns expressions into the bytecode that the
regex engine runs. Having it be in Python only matters to the performance
of re.compile(), and that function usually isn't a bottleneck.
You're right again. Is the pre module using the PCRE C library?


An extensively hacked version of it. Modules/pypcre.c contains the bulk of
it; pcremodule.c is the Python interface to it.

--amk
Jul 18 '05 #4
Reinhold Birkenfeld wrote:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')
BTW, do you want those or
re1 = re.compile(r"\s*<[^>]*>\s*")
re2 = re.compile(r".*\(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\(([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?
It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.find(end_c, i)
if j == -1:
return None
return text[i+i:j]

(If you instead what your original regexp says, use
def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.rfind(end_c)
if j < i: # includes 'j == -1' on find failure
return None
return text[i+1:j]
def find1(text):
return find_text_in_matching_pairs(text, "<", ">")

def find2(text):
return find_text_in_matching_pairs(text, "(", ")")

def find3(text):
if text.startswith('"') and text.endswith('"'):
return text[1:-1]
return None
Is there a Python interface for the PCRE library out there?


Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=d...m=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.

Andrew
da***@dalkescientific.com
Jul 18 '05 #5
Andrew Dalke wrote:
Reinhold Birkenfeld wrote:
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')


BTW, do you want those or
re1 = re.compile(r"\s*<[^>]*>\s*")
re2 = re.compile(r".*\(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\(([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?


It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.find(end_c, i)
if j == -1:
return None
return text[i+i:j]


OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.
Is there a Python interface for the PCRE library out there?


Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=d...m=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.


I used 2.3.4.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #6
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.


Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)

-Peter
Jul 18 '05 #7
Peter Hansen wrote:
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.


Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)


Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.

Anyway, for your pleasure <wink>, the code (it takes a list of newsgroup
postings and calculates a ratio followups/postings):

------------------------------------------------------------------
At first the perl version (not coded by me):
------------------------------------------------------------------

#!/usr/bin/perl -w

use strict;

die unless @ARGV;

my (%post,%fup);
while (<STDIN>)
{
chomp;
my (undef,$from,$mid,$ref) = split /\|/;

$from =~ s|\s*<.*>\s*||;
$from = $1 if $from =~ m|\((.*)\)|;
$from =~ s|^"(.*)"$|$1|;

my $arr = $post{$from} ||= [];
push @$arr,$mid;

$fup{$ref}++;
}

my @arr;
for (keys %post)
{
my $arr = $post{$_};

my $m = 0;
$m += $fup{$_} || 0 for @$arr;

my $n = @$arr;
push @arr,[$m/$n,$n,$m,$_];
}

@arr = sort { $b->[1] <=> $a->[1] } @arr;
$#arr = $ARGV[0]-1;

print "Pos Relevanz Posts F'ups Poster\n";
print "--- -------- ----- ----- ------------------\n";

my $i = 1;
for (sort { $b->[0] <=> $a->[0] || $b->[1] <=> $a->[1] } @arr)
{
printf "%2d) %2.4f %4d %4d %s\n",$i++,@$_;
}

------------------------------------------------------------------
Original Python translation (with regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os, re

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')

for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

post.setdefault(fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1

------------------------------------------------------------------
Optimized Python (without regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

def inparens(text):
i = text.find("(")
if i == -1: return text
j = text.rfind(")")
if j < i: return text
return text[i+1:j]

def removeangles(text):
i = text.find("<")
if i == -1: return text
j = text.rfind(">")
if j < i: return text
return text[0:i].rstrip() + text[j+1:].lstrip()

for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = inparens(removeangles(fro))
if fro.startswith('"'):
if fro.endswith('"'):
fro = fro[1:-1]

post.setdefault(fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1

------------------------------------------------------------------

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #8
Reinhold Birkenfeld wrote:
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.


I'm sorry, I managed to miss the significance of that
statement in your original post.

I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.

-Peter
Jul 18 '05 #9
Peter Hansen wrote:
Reinhold Birkenfeld wrote:
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.
I'm sorry, I managed to miss the significance of that
statement in your original post.


;)
I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.


Well, I did alias the sub methods in that way:

re1sub = re.compile("whatever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #10
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.


If your problem is that the re.compile() setup takes too long (compared
to the time taken to do the actual matching/replacing), my guess is that
you don't operate on that much data. And if that's the case, does it
really matter if it's a fraction of a second slower?

Note that str.find() currently uses the naive approach resulting in bad
time complexity, so it's not suitable for larger data sets. See
http://online.effbot.org/2004_08_01_archive.htm#find
Erik
Jul 18 '05 #11
Reinhold Birkenfeld wrote:
re1sub = re.compile("whatever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.


The above line just led me to realize that you are
doing everything at "module level" instead of inside
a function. That means all your variables are global,
instead of local, which means all variable lookups
are being done in a dictionary rather than taking
advantage of the significant optimization provided by
the use of locals (which use simple indexing operations).

Try sticking everything, including the above line
inside a def main() and call that at the end, for
comparison.

(It's more likely at this point, however, that you
are quite right that the re.sub operations themselves
are taking most of the time, so this still won't
result in much overall improvement, I suspect.
Profiling would confirm that, of course.)

-Peter
Jul 18 '05 #12
Erik Heneryd wrote:
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.


If your problem is that the re.compile() setup takes too long (compared
to the time taken to do the actual matching/replacing), my guess is that
you don't operate on that much data. And if that's the case, does it
really matter if it's a fraction of a second slower?


My data file is 6 MB in size, I don't think that to be too little.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #13
Peter Hansen wrote:
Reinhold Birkenfeld wrote:
re1sub = re.compile("whatever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.


The above line just led me to realize that you are
doing everything at "module level" instead of inside
a function. That means all your variables are global,
instead of local, which means all variable lookups
are being done in a dictionary rather than taking
advantage of the significant optimization provided by
the use of locals (which use simple indexing operations).

Try sticking everything, including the above line
inside a def main() and call that at the end, for
comparison.


Done. The results are the following:

Original: 2,579 sec
sub method aliased to variable: 2,542 sec
everything stuffed in function: 2,422 sec

Without regex substitution: 0,523 sec
(but _with_ re.compile calls,
so this can't be the problem)

Perl version: 0,811 sec
Perl version without regexes: 0,568 sec

Python version using find(): 0,870 sec

(all values are averages)

So you're right, speed is getting better.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #14
Reinhold Birkenfeld wrote:
Original: 2,579 sec
sub method aliased to variable: 2,542 sec
everything stuffed in function: 2,422 sec

Without regex substitution: 0,523 sec
(but _with_ re.compile calls,
so this can't be the problem)


Ah, thought you said something about compile() being the timesink, but I
guess I misunderstood...
Erik

Jul 18 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...
2
by: John Grandy | last post by:
Is it advisable to compile a Regex for a massively scalable ASP.NET web-application ? How exactly does this work ? Do you create a separate class library and expose the Regex.Replace() as a...
4
by: Brian Henry | last post by:
I have phone numbers like this in a data table 123-435-1234 1231231234 432.234.2321 they all have different formatting, what I want to do is get them all formatted like this (123) 123-1234
13
by: Chris Lieb | last post by:
I am trying to write a regex that will parse BBcode into HTML using JavaScript. Everything was going smoothly using the string class replace() operator with regex's until I got to the list tag....
15
by: Kay Schluehr | last post by:
I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be...
7
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward...
17
by: garrickp | last post by:
While creating a log parser for fairly large logs, we have run into an issue where the time to process was relatively unacceptable (upwards of 5 minutes for 1-2 million lines of logs). In contrast,...
15
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
6
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
1
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.