473,785 Members | 2,488 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex speed

Hello,

I recently ported a simple utility script to analyze a data file from
Perl to Python that uses regex substitutions, not more complex than

re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')

When run without these regex substitutions, the scripts' speed is nearly
equal. However, with the regexes applied, the Python version's running
time is five times or more of the Perl version's.

So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?

--or--

Is there a Python interface for the PCRE library out there?

Thanks

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #1
14 1894
On Fri, 29 Oct 2004 20:06:05 +0200,
Reinhold Birkenfeld <re************ ************@wo lke7.net> wrote:
re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')
You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('" ') and s.endswith('"') '.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?
The regex engine *is* implemented in C; look at Modules/_sre.c.
Is there a Python interface for the PCRE library out there?


PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.

--amk
Jul 18 '05 #2
A.M. Kuchling wrote:
On Fri, 29 Oct 2004 20:06:05 +0200,
Reinhold Birkenfeld <re************ ************@wo lke7.net> wrote:
re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')


You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('" ') and s.endswith('"') '.


You're right. The sub calls are those:

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

So at least the \s* are justified.

But my actual question is why Perl can run the same regexes in what
seems no time at all.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?


The regex engine *is* implemented in C; look at Modules/_sre.c.


But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?
Is there a Python interface for the PCRE library out there?


PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.


You're right again. Is the pre module using the PCRE C library?

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #3
On Fri, 29 Oct 2004 20:35:28 +0200,
Reinhold Birkenfeld <re************ ************@wo lke7.net> wrote:
But my actual question is why Perl can run the same regexes in what
seems no time at all.
Probably Perl's engine does some particular bit of optimization that
Python's engine isn't doing. I have no idea what optimization that might
be.
But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there?
That code is the compiler that turns expressions into the bytecode that the
regex engine runs. Having it be in Python only matters to the performance
of re.compile(), and that function usually isn't a bottleneck.
You're right again. Is the pre module using the PCRE C library?


An extensively hacked version of it. Modules/pypcre.c contains the bulk of
it; pcremodule.c is the Python interface to it.

--amk
Jul 18 '05 #4
Reinhold Birkenfeld wrote:
re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')
BTW, do you want those or
re1 = re.compile(r"\s *<[^>]*>\s*")
re2 = re.compile(r".* \(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\( ([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?
It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_ma tching_pairs(te xt, start_c = "<", end_c = ">"):
i = text.find(start _c)
if i == -1:
return None
j = text.find(end_c , i)
if j == -1:
return None
return text[i+i:j]

(If you instead what your original regexp says, use
def find_text_in_ma tching_pairs(te xt, start_c = "<", end_c = ">"):
i = text.find(start _c)
if i == -1:
return None
j = text.rfind(end_ c)
if j < i: # includes 'j == -1' on find failure
return None
return text[i+1:j]
def find1(text):
return find_text_in_ma tching_pairs(te xt, "<", ">")

def find2(text):
return find_text_in_ma tching_pairs(te xt, "(", ")")

def find3(text):
if text.startswith ('"') and text.endswith(' "'):
return text[1:-1]
return None
Is there a Python interface for the PCRE library out there?


Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=d...m=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.

Andrew
da***@dalkescie ntific.com
Jul 18 '05 #5
Andrew Dalke wrote:
Reinhold Birkenfeld wrote:
re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')


BTW, do you want those or
re1 = re.compile(r"\s *<[^>]*>\s*")
re2 = re.compile(r".* \(([^)]*)\).*")

(For the last it doesn't make much difference. There will only be
a single backtrack.)

For that matter, what about
re2 = re.compile(r"\( ([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?


It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.

One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use

def find_text_in_ma tching_pairs(te xt, start_c = "<", end_c = ">"):
i = text.find(start _c)
if i == -1:
return None
j = text.find(end_c , i)
if j == -1:
return None
return text[i+i:j]


OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.
Is there a Python interface for the PCRE library out there?


Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.

The old benchmarks compared pcre and sre and found that
sre was faster. See
http://groups.google.com/groups?oi=d...m=an_588925502

Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.


I used 2.3.4.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #6
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.


Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)

-Peter
Jul 18 '05 #7
Peter Hansen wrote:
Reinhold Birkenfeld wrote:
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.

So the Perl regex engine seems to be highly optimized, at least for
simple expressions.


Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)


Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.

Anyway, for your pleasure <wink>, the code (it takes a list of newsgroup
postings and calculates a ratio followups/postings):

------------------------------------------------------------------
At first the perl version (not coded by me):
------------------------------------------------------------------

#!/usr/bin/perl -w

use strict;

die unless @ARGV;

my (%post,%fup);
while (<STDIN>)
{
chomp;
my (undef,$from,$m id,$ref) = split /\|/;

$from =~ s|\s*<.*>\s*||;
$from = $1 if $from =~ m|\((.*)\)|;
$from =~ s|^"(.*)"$|$1| ;

my $arr = $post{$from} ||= [];
push @$arr,$mid;

$fup{$ref}++;
}

my @arr;
for (keys %post)
{
my $arr = $post{$_};

my $m = 0;
$m += $fup{$_} || 0 for @$arr;

my $n = @$arr;
push @arr,[$m/$n,$n,$m,$_];
}

@arr = sort { $b->[1] <=> $a->[1] } @arr;
$#arr = $ARGV[0]-1;

print "Pos Relevanz Posts F'ups Poster\n";
print "--- -------- ----- ----- ------------------\n";

my $i = 1;
for (sort { $b->[0] <=> $a->[0] || $b->[1] <=> $a->[1] } @arr)
{
printf "%2d) %2.4f %4d %4d %s\n",$i++,@$_;
}

------------------------------------------------------------------
Original Python translation (with regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os, re

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

re1 = re.compile(r"\s *<.*>\s*")
re2 = re.compile(r".* \((.*)\).*")
re3 = re.compile(r'^" (.*)"$')

for line in sys.stdin:
line = line.rstrip("\n ")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)

post.setdefault (fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems( ):
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((flo at(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1

------------------------------------------------------------------
Optimized Python (without regexes):
------------------------------------------------------------------

#!/usr/bin/python

import sys, os

if len(sys.argv) == 1:
sys.exit(0)

post = {}
fup = {}

def inparens(text):
i = text.find("(")
if i == -1: return text
j = text.rfind(")")
if j < i: return text
return text[i+1:j]

def removeangles(te xt):
i = text.find("<")
if i == -1: return text
j = text.rfind(">")
if j < i: return text
return text[0:i].rstrip() + text[j+1:].lstrip()

for line in sys.stdin:
line = line.rstrip("\n ")
try:
nix, fro, mid, ref = line.split("|")
except: continue

fro = inparens(remove angles(fro))
if fro.startswith( '"'):
if fro.endswith('" '):
fro = fro[1:-1]

post.setdefault (fro, []).append(mid)

if ref in fup: fup[ref] += 1
else: fup[ref] = 1

res = []

for name, arr in post.iteritems( ):
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((flo at(m)/n, n, m, name))

res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]

print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"

res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1

------------------------------------------------------------------

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #8
Reinhold Birkenfeld wrote:
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.


I'm sorry, I managed to miss the significance of that
statement in your original post.

I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.

-Peter
Jul 18 '05 #9
Peter Hansen wrote:
Reinhold Birkenfeld wrote:
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.
I'm sorry, I managed to miss the significance of that
statement in your original post.


;)
I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.

If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.

If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.


Well, I did alias the sub methods in that way:

re1sub = re.compile("wha tever").sub

There was a performance gain, but it was about 1/100th of the speed
difference.

Reinhold

--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
5730
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning is to improve, but not to prove.
2
4285
by: John Grandy | last post by:
Is it advisable to compile a Regex for a massively scalable ASP.NET web-application ? How exactly does this work ? Do you create a separate class library and expose the Regex.Replace() as a method ..... ..... or can you just use the syntax :
4
4546
by: Brian Henry | last post by:
I have phone numbers like this in a data table 123-435-1234 1231231234 432.234.2321 they all have different formatting, what I want to do is get them all formatted like this (123) 123-1234
13
2376
by: Chris Lieb | last post by:
I am trying to write a regex that will parse BBcode into HTML using JavaScript. Everything was going smoothly using the string class replace() operator with regex's until I got to the list tag. Implementing the list tag itself was fairly easy. What was not was trying to handle the list items. For some reason, in BBcode, they didn't bother defining an end tag for a list item. I guess that they designed it with bad old HTML 3.2 in mind...
15
3234
by: Kay Schluehr | last post by:
I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be relations between those strings: s_k.startswith(s_1) -> True or s_k.endswith(s_1) -> True. An extreme case would be ls = . For this reason SRE_Match should provide the longest possible match. Is there a Python module able to create an optimized regex rx...
7
2589
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward slash then some numbers. For some reason I am not getting that. It won't work at all in 2.0
17
1993
by: garrickp | last post by:
While creating a log parser for fairly large logs, we have run into an issue where the time to process was relatively unacceptable (upwards of 5 minutes for 1-2 million lines of logs). In contrast, using the Linux tool grep would complete the same search in a matter of seconds. The search we used was a regex of 6 elements "or"ed together, with an exclusionary set of ~3 elements. Due to the size of the files, we decided to run these line...
15
50261
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
6
2074
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test it is called a 100 times but in my real app it could be called 50000 or more times. So my question is if it is somehow possible to speed this up and if so how
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10325
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9950
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7499
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6739
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4050
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2879
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.