Hello,
I recently ported a simple utility script to analyze a data file from
Perl to Python that uses regex substitutions, not more complex than
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')
When run without these regex substitutions, the scripts' speed is nearly
equal. However, with the regexes applied, the Python version's running
time is five times or more of the Perl version's.
So my question is: Why is the re module implemented in pure Python?
Isn't it possible to integrate it into the core or rewrite it in C?
--or--
Is there a Python interface for the PCRE library out there?
Thanks
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm 14 1859
On Fri, 29 Oct 2004 20:06:05 +0200,
Reinhold Birkenfeld <re************************@wolke7.net> wrote: re1 = re.compile(r"\s*<.*>\s*") re2 = re.compile(r".*\((.*)\).*") re3 = re.compile(r'^"(.*)"$')
You should post the actual code, because these substitutions could be made
more efficient. For example, why are the bracketing \s* in re1 and the
bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent
to 'if s.startswith('"') and s.endswith('"')'.
So my question is: Why is the re module implemented in pure Python? Isn't it possible to integrate it into the core or rewrite it in C?
The regex engine *is* implemented in C; look at Modules/_sre.c.
Is there a Python interface for the PCRE library out there?
PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You
could try 'import pre' to use it, but I don't expect it to be significantly
faster.
--amk
A.M. Kuchling wrote: On Fri, 29 Oct 2004 20:06:05 +0200, Reinhold Birkenfeld <re************************@wolke7.net> wrote: re1 = re.compile(r"\s*<.*>\s*") re2 = re.compile(r".*\((.*)\).*") re3 = re.compile(r'^"(.*)"$')
You should post the actual code, because these substitutions could be made more efficient. For example, why are the bracketing \s* in re1 and the bracketing .* in re2 there? re3 isn't using re.M, so it's equivalent to 'if s.startswith('"') and s.endswith('"')'.
You're right. The sub calls are those:
fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)
So at least the \s* are justified.
But my actual question is why Perl can run the same regexes in what
seems no time at all. So my question is: Why is the re module implemented in pure Python? Isn't it possible to integrate it into the core or rewrite it in C?
The regex engine *is* implemented in C; look at Modules/_sre.c.
But /usr/lib/python2.3/sre*.py are relatively large for that; what's in
there? Is there a Python interface for the PCRE library out there?
PCRE was used from versions 1.5 up to 2.3; it'll be gone in Python 2.4. You could try 'import pre' to use it, but I don't expect it to be significantly faster.
You're right again. Is the pre module using the PCRE C library?
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
On Fri, 29 Oct 2004 20:35:28 +0200,
Reinhold Birkenfeld <re************************@wolke7.net> wrote: But my actual question is why Perl can run the same regexes in what seems no time at all.
Probably Perl's engine does some particular bit of optimization that
Python's engine isn't doing. I have no idea what optimization that might
be.
But /usr/lib/python2.3/sre*.py are relatively large for that; what's in there?
That code is the compiler that turns expressions into the bytecode that the
regex engine runs. Having it be in Python only matters to the performance
of re.compile(), and that function usually isn't a bottleneck.
You're right again. Is the pre module using the PCRE C library?
An extensively hacked version of it. Modules/pypcre.c contains the bulk of
it; pcremodule.c is the Python interface to it.
--amk
Reinhold Birkenfeld wrote: re1 = re.compile(r"\s*<.*>\s*") re2 = re.compile(r".*\((.*)\).*") re3 = re.compile(r'^"(.*)"$')
BTW, do you want those or
re1 = re.compile(r"\s*<[^>]*>\s*")
re2 = re.compile(r".*\(([^)]*)\).*")
(For the last it doesn't make much difference. There will only be
a single backtrack.)
For that matter, what about
re2 = re.compile(r"\(([^)]*)\)")
then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python? Isn't it possible to integrate it into the core or rewrite it in C?
It isn't. It's written in C. I've not done timing tests
between Perl and Python's engines for a long time, so I can't
provide feedback on that aspect.
One thing about Python is that we tend to use regexps less
often than Perl. For example, you may be able to use
def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.find(end_c, i)
if j == -1:
return None
return text[i+i:j]
(If you instead what your original regexp says, use
def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"):
i = text.find(start_c)
if i == -1:
return None
j = text.rfind(end_c)
if j < i: # includes 'j == -1' on find failure
return None
return text[i+1:j]
def find1(text):
return find_text_in_matching_pairs(text, "<", ">")
def find2(text):
return find_text_in_matching_pairs(text, "(", ")")
def find3(text):
if text.startswith('"') and text.endswith('"'):
return text[1:-1]
return None
Is there a Python interface for the PCRE library out there?
Python used to use PCRE instead of its current sre, back
in the 1.5 days. Python 1.6/2.x switched to sre in part
because of the need for Unicode support.
The old benchmarks compared pcre and sre and found that
sre was faster. See http://groups.google.com/groups?oi=d...m=an_588925502
Which versions of Python and Perl are you using for
the tests? I know there has been some non-trivial work
for the 2.3 version of Python.
Andrew da***@dalkescientific.com
Andrew Dalke wrote: Reinhold Birkenfeld wrote: re1 = re.compile(r"\s*<.*>\s*") re2 = re.compile(r".*\((.*)\).*") re3 = re.compile(r'^"(.*)"$')
BTW, do you want those or re1 = re.compile(r"\s*<[^>]*>\s*") re2 = re.compile(r".*\(([^)]*)\).*")
(For the last it doesn't make much difference. There will only be a single backtrack.)
For that matter, what about re2 = re.compile(r"\(([^)]*)\)") then using re2.search instead of re2.match?
So my question is: Why is the re module implemented in pure Python? Isn't it possible to integrate it into the core or rewrite it in C?
It isn't. It's written in C. I've not done timing tests between Perl and Python's engines for a long time, so I can't provide feedback on that aspect.
One thing about Python is that we tend to use regexps less often than Perl. For example, you may be able to use
def find_text_in_matching_pairs(text, start_c = "<", end_c = ">"): i = text.find(start_c) if i == -1: return None j = text.find(end_c, i) if j == -1: return None return text[i+i:j]
OK, thank you. I now got rid of all the regexes, and - surprise,
surprise - the speeds are almost equal. The bitter thing about it is
that there are now twelve LOC more in Python that don't make
understanding the code easier.
So the Perl regex engine seems to be highly optimized, at least for
simple expressions. Is there a Python interface for the PCRE library out there?
Python used to use PCRE instead of its current sre, back in the 1.5 days. Python 1.6/2.x switched to sre in part because of the need for Unicode support.
The old benchmarks compared pcre and sre and found that sre was faster. See http://groups.google.com/groups?oi=d...m=an_588925502
Which versions of Python and Perl are you using for the tests? I know there has been some non-trivial work for the 2.3 version of Python.
I used 2.3.4.
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Reinhold Birkenfeld wrote: OK, thank you. I now got rid of all the regexes, and - surprise, surprise - the speeds are almost equal. The bitter thing about it is that there are now twelve LOC more in Python that don't make understanding the code easier.
So the Perl regex engine seems to be highly optimized, at least for simple expressions.
Perhaps it's not just your regexes... as Andrew said, "You should post
the actual code ..." (Preferably both the Perl and the Python, if you
are trying to make this into a them vs. us kind of thing, as you
seem to be doing.)
-Peter
Peter Hansen wrote: Reinhold Birkenfeld wrote: OK, thank you. I now got rid of all the regexes, and - surprise, surprise - the speeds are almost equal. The bitter thing about it is that there are now twelve LOC more in Python that don't make understanding the code easier.
So the Perl regex engine seems to be highly optimized, at least for simple expressions.
Perhaps it's not just your regexes... as Andrew said, "You should post the actual code ..." (Preferably both the Perl and the Python, if you are trying to make this into a them vs. us kind of thing, as you seem to be doing.)
Well, commenting out the regex substitutions in both versions leads to
equal execution times, as I mentioned earlier, so it has to be my regexes.
Anyway, for your pleasure <wink>, the code (it takes a list of newsgroup
postings and calculates a ratio followups/postings):
------------------------------------------------------------------
At first the perl version (not coded by me):
------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
die unless @ARGV;
my (%post,%fup);
while (<STDIN>)
{
chomp;
my (undef,$from,$mid,$ref) = split /\|/;
$from =~ s|\s*<.*>\s*||;
$from = $1 if $from =~ m|\((.*)\)|;
$from =~ s|^"(.*)"$|$1|;
my $arr = $post{$from} ||= [];
push @$arr,$mid;
$fup{$ref}++;
}
my @arr;
for (keys %post)
{
my $arr = $post{$_};
my $m = 0;
$m += $fup{$_} || 0 for @$arr;
my $n = @$arr;
push @arr,[$m/$n,$n,$m,$_];
}
@arr = sort { $b->[1] <=> $a->[1] } @arr;
$#arr = $ARGV[0]-1;
print "Pos Relevanz Posts F'ups Poster\n";
print "--- -------- ----- ----- ------------------\n";
my $i = 1;
for (sort { $b->[0] <=> $a->[0] || $b->[1] <=> $a->[1] } @arr)
{
printf "%2d) %2.4f %4d %4d %s\n",$i++,@$_;
}
------------------------------------------------------------------
Original Python translation (with regexes):
------------------------------------------------------------------
#!/usr/bin/python
import sys, os, re
if len(sys.argv) == 1:
sys.exit(0)
post = {}
fup = {}
re1 = re.compile(r"\s*<.*>\s*")
re2 = re.compile(r".*\((.*)\).*")
re3 = re.compile(r'^"(.*)"$')
for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue
fro = re1.sub("", fro)
fro = re2.sub(r"\1", fro)
fro = re3.sub(r"\1", fro)
post.setdefault(fro, []).append(mid)
if ref in fup: fup[ref] += 1
else: fup[ref] = 1
res = []
for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))
res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]
print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"
res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1
------------------------------------------------------------------
Optimized Python (without regexes):
------------------------------------------------------------------
#!/usr/bin/python
import sys, os
if len(sys.argv) == 1:
sys.exit(0)
post = {}
fup = {}
def inparens(text):
i = text.find("(")
if i == -1: return text
j = text.rfind(")")
if j < i: return text
return text[i+1:j]
def removeangles(text):
i = text.find("<")
if i == -1: return text
j = text.rfind(">")
if j < i: return text
return text[0:i].rstrip() + text[j+1:].lstrip()
for line in sys.stdin:
line = line.rstrip("\n")
try:
nix, fro, mid, ref = line.split("|")
except: continue
fro = inparens(removeangles(fro))
if fro.startswith('"'):
if fro.endswith('"'):
fro = fro[1:-1]
post.setdefault(fro, []).append(mid)
if ref in fup: fup[ref] += 1
else: fup[ref] = 1
res = []
for name, arr in post.iteritems():
m = 0
for item in arr:
if item in fup:
m += fup[item]
n = len(arr)
res.append((float(m)/n, n, m, name))
res.sort(lambda x, y: cmp(y[1], x[1]))
del res[int(sys.argv[1]):]
print "Pos Relevanz Posts F'ups Poster"
print "--- -------- ----- ----- ------------------"
res.sort(lambda x, y: cmp(y[0], x[0]) or cmp(y[1], x[1]))
i = 1
for item in res:
print "%2d)" % i, " %2.4f %4d %4d %s" % item
i += 1
------------------------------------------------------------------
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Reinhold Birkenfeld wrote: Well, commenting out the regex substitutions in both versions leads to equal execution times, as I mentioned earlier, so it has to be my regexes.
I'm sorry, I managed to miss the significance of that
statement in your original post.
I wonder what the impact is of the fact that the re.sub
operations are function calls in Python. The overhead
of function calls is relatively high in Python, so
perhaps an experiment would be revealing. Can you try
with a dummy re.sub() call (create a dummy "re" object
that is global to your module, with a dummy .sub()
method that just does "return") and compare the
speed of that with the version without the re.sub
calls at all? Probably a waste of time, but perhaps
the actual re operations are not so slow after all,
but the calls themselves are.
If that's true, you would at least get a tiny improvement
by alias re.sub to a local name before the loop, to
avoid the global lookup for "re" and then the attribute
lookup for "sub" on each of the three calls, each time
through the loop.
If you can show the format of the input data, I would
be happy to try a little profiling, if you haven't already
done that to prove that the bulk of the time is actually
in the re.sub operation itself.
-Peter
Peter Hansen wrote: Reinhold Birkenfeld wrote: Well, commenting out the regex substitutions in both versions leads to equal execution times, as I mentioned earlier, so it has to be my regexes. I'm sorry, I managed to miss the significance of that statement in your original post.
;)
I wonder what the impact is of the fact that the re.sub operations are function calls in Python. The overhead of function calls is relatively high in Python, so perhaps an experiment would be revealing. Can you try with a dummy re.sub() call (create a dummy "re" object that is global to your module, with a dummy .sub() method that just does "return") and compare the speed of that with the version without the re.sub calls at all? Probably a waste of time, but perhaps the actual re operations are not so slow after all, but the calls themselves are.
If that's true, you would at least get a tiny improvement by alias re.sub to a local name before the loop, to avoid the global lookup for "re" and then the attribute lookup for "sub" on each of the three calls, each time through the loop.
If you can show the format of the input data, I would be happy to try a little profiling, if you haven't already done that to prove that the bulk of the time is actually in the re.sub operation itself.
Well, I did alias the sub methods in that way:
re1sub = re.compile("whatever").sub
There was a performance gain, but it was about 1/100th of the speed
difference.
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Reinhold Birkenfeld wrote: OK, thank you. I now got rid of all the regexes, and - surprise, surprise - the speeds are almost equal. The bitter thing about it is that there are now twelve LOC more in Python that don't make understanding the code easier.
If your problem is that the re.compile() setup takes too long (compared
to the time taken to do the actual matching/replacing), my guess is that
you don't operate on that much data. And if that's the case, does it
really matter if it's a fraction of a second slower?
Note that str.find() currently uses the naive approach resulting in bad
time complexity, so it's not suitable for larger data sets. See http://online.effbot.org/2004_08_01_archive.htm#find
Erik
Reinhold Birkenfeld wrote: re1sub = re.compile("whatever").sub
There was a performance gain, but it was about 1/100th of the speed difference.
The above line just led me to realize that you are
doing everything at "module level" instead of inside
a function. That means all your variables are global,
instead of local, which means all variable lookups
are being done in a dictionary rather than taking
advantage of the significant optimization provided by
the use of locals (which use simple indexing operations).
Try sticking everything, including the above line
inside a def main() and call that at the end, for
comparison.
(It's more likely at this point, however, that you
are quite right that the re.sub operations themselves
are taking most of the time, so this still won't
result in much overall improvement, I suspect.
Profiling would confirm that, of course.)
-Peter
Erik Heneryd wrote: Reinhold Birkenfeld wrote: OK, thank you. I now got rid of all the regexes, and - surprise, surprise - the speeds are almost equal. The bitter thing about it is that there are now twelve LOC more in Python that don't make understanding the code easier.
If your problem is that the re.compile() setup takes too long (compared to the time taken to do the actual matching/replacing), my guess is that you don't operate on that much data. And if that's the case, does it really matter if it's a fraction of a second slower?
My data file is 6 MB in size, I don't think that to be too little.
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Peter Hansen wrote: Reinhold Birkenfeld wrote: re1sub = re.compile("whatever").sub
There was a performance gain, but it was about 1/100th of the speed difference.
The above line just led me to realize that you are doing everything at "module level" instead of inside a function. That means all your variables are global, instead of local, which means all variable lookups are being done in a dictionary rather than taking advantage of the significant optimization provided by the use of locals (which use simple indexing operations).
Try sticking everything, including the above line inside a def main() and call that at the end, for comparison.
Done. The results are the following:
Original: 2,579 sec
sub method aliased to variable: 2,542 sec
everything stuffed in function: 2,422 sec
Without regex substitution: 0,523 sec
(but _with_ re.compile calls,
so this can't be the problem)
Perl version: 0,811 sec
Perl version without regexes: 0,568 sec
Python version using find(): 0,870 sec
(all values are averages)
So you're right, speed is getting better.
Reinhold
--
[Windows ist wie] die Bahn: Man muss sich um nichts kuemmern, zahlt fuer
jede Kleinigkeit einen Aufpreis, der Service ist mies, Fremde koennen
jederzeit einsteigen, es ist unflexibel und zu allen anderen Verkehrs-
mitteln inkompatibel. -- Florian Diesch in dcoulm
Reinhold Birkenfeld wrote: Original: 2,579 sec sub method aliased to variable: 2,542 sec everything stuffed in function: 2,422 sec
Without regex substitution: 0,523 sec (but _with_ re.compile calls, so this can't be the problem)
Ah, thought you said something about compile() being the timesink, but I
guess I misunderstood...
Erik This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: alphatan |
last post by:
Is there relative source or document for this purpose?
I've searched the index of "Mastering Regular Expression", but cannot
get the useful information for C.
Thanks in advanced.
--
Learning...
|
by: John Grandy |
last post by:
Is it advisable to compile a Regex for a massively scalable ASP.NET
web-application ?
How exactly does this work ?
Do you create a separate class library and expose the Regex.Replace() as a...
|
by: Brian Henry |
last post by:
I have phone numbers like this in a data table
123-435-1234
1231231234
432.234.2321
they all have different formatting, what I want to do is get them all
formatted like this
(123) 123-1234
|
by: Chris Lieb |
last post by:
I am trying to write a regex that will parse BBcode into HTML using
JavaScript. Everything was going smoothly using the string class
replace() operator with regex's until I got to the list tag....
|
by: Kay Schluehr |
last post by:
I have a list of strings ls = and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in . There might
be...
| |
by: Extremest |
last post by:
I am using this regex.
static Regex paranthesis = new Regex("(\\d*/\\d*)",
RegexOptions.IgnoreCase);
it should find everything between parenthesis that have some numbers
onyl then a forward...
|
by: garrickp |
last post by:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast,...
|
by: morleyc |
last post by:
Hi, i would like to remove a number of characters from my string (\t
\r \n which are throughout the string), i know regex can do this but i
have no idea how. Any pointers much appreciated.
Chris
|
by: |
last post by:
Hi all,
Sorry for the lengthy post but as I learned I should post
concise-and-complete code.
So the code belows shows that the execution of ValidateAddress consumes a
lot of time. In the test...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...
| |