473,325 Members | 2,870 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

regular expressions ... slow

Hi,

Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
?

Are there any plans to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???

Greetings, Uwe
Nov 17 '08 #1
8 3090
On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
<ro*************@googlemail.comwrote:
Hi,

Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html ?
Yes, it's been brought up here, on python-dev and python-ideas several
times in the past year and a half.
Are there any plans to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???
I don't think anyone has taken any concrete steps towards re-writing
the regular expression module. My understanding from previous threads
on the topic is that the core developers would be willing to accept a
re-written regular expression engine, but none of them are interested
in doing it themselves. The general consensus seemed to be that the
pathological cases hilited in that article are not very common in the
real world, and that simply switching to the alternative approach
advocated there would require giving up things like backreferences
that are actually used in the real world, which is probably
unacceptable.

Some references:
http://mail.python.org/pipermail/pyt...ch/072241.html
http://mail.python.org/pipermail/pyt...ry/427604.html
http://mail.python.org/pipermail/pyt...il/000405.html

Personally, I know very little about the nitty gritty of regular
expression engines, but there's some reference material for you to
chew on.

--
Jerry
Nov 17 '08 #2
Uwe Schmitt wrote:
Hi,

Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
?
Near the end:

While writing the text editor sam [6] in the early 1980s, Rob Pike wrote
a new regular expression implementation, which Dave Presotto extracted
into a library that appeared in the Eighth Edition. Pike's
implementation incorporated submatch tracking into an efficient NFA
simulation but, like the rest of the Eighth Edition source, was not
widely distributed. Pike himself did not realize that his technique was
anything new. Henry Spencer reimplemented the Eighth Edition library
interface from scratch, but using backtracking, and released his
implementation into the public domain. It became very widely used,
eventually serving as the basis for the slow regular expression
implementations mentioned earlier: Perl, PCRE, Python, and so on. (In
his defense, Spencer knew the routines could be slow, and he didn't know
that a more efficient algorithm existed. He even warned in the
documentation, “Many users have found the speed perfectly adequate,
although replacing the insides of egrep with this code would be a
mistake.”) Pike's regular expression implementation, extended to support
Unicode, was made freely available with sam in late 1992, but the
particularly efficient regular expression search algorithm went
unnoticed. The code is now available in many forms: as part of sam, as
Plan 9's regular expression library, or packaged separately for Unix.
Ville Laurikari independently discovered Pike's algorithm in 1999,
developing a theoretical foundation as well [2].

So, depending on the license, there appears to be a potential basis for
a Python unicode version.

Are there any plans to speed up Pythons regular expression module ?
Yes, but I don't know how much people with such plans have considered
the adaptive approach recommended.
is the example in this artricle too far from reality ???
The example is, but I don't think the problem illustrated is.

tjr


Nov 17 '08 #3
Jerry Hill wrote:
On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
<ro*************@googlemail.comwrote:
>Hi,

Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html ?

Yes, it's been brought up here, on python-dev and python-ideas several
times in the past year and a half.
>Are there any plans to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???

I don't think anyone has taken any concrete steps towards re-writing
the regular expression module. My understanding from previous threads
on the topic is that the core developers would be willing to accept a
re-written regular expression engine, but none of them are interested
in doing it themselves. The general consensus seemed to be that the
pathological cases hilited in that article are not very common in the
real world, and that simply switching to the alternative approach
advocated there would require giving up things like backreferences
that are actually used in the real world, which is probably
unacceptable.

Some references:
http://mail.python.org/pipermail/pyt...ch/072241.html
http://mail.python.org/pipermail/pyt...ry/427604.html
http://mail.python.org/pipermail/pyt...il/000405.html

Personally, I know very little about the nitty gritty of regular
expression engines, but there's some reference material for you to
chew on.
Searching the tracker for open items with 'regular expression' in the
text brings up about 20 items to also consider.
Nov 17 '08 #4
On Nov 17, 10:24*pm, Terry Reedy <tjre...@udel.eduwrote:
Jerry Hill wrote:
On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
<rocksportroc...@googlemail.comwrote:
Hi,
Is anobody aware of this post: *http://swtch.com/~rsc/regexp/regexp1..html?
Yes, it's been brought up here, on python-dev and python-ideas several
times in the past year and a half.
Are there any plans *to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???
I don't think anyone has taken any concrete steps towards re-writing
the regular expression module. *My understanding from previous threads
on the topic is that the core developers would be willing to accept a
re-written regular expression engine, but none of them are interested
in doing it themselves. *The general consensus seemed to be that the
pathological cases hilited in that article are not very common in the
real world, and that simply switching to the alternative approach
advocated there would require giving up things like backreferences
that are actually used in the real world, which is probably
unacceptable.
Some references:
http://mail.python.org/pipermail/pyt...ch/072241.html
http://mail.python.org/pipermail/pyt...ry/427604.html
http://mail.python.org/pipermail/pyt...il/000405.html
Personally, I know very little about the nitty gritty of regular
expression engines, but there's some reference material for you to
chew on.

Searching the tracker for open items with 'regular expression' in the
text brings up about 20 items to also consider.
Work is currently being done on the re module.

I don't think the DFA approach works permits backreferences, capture
groups or non-greedy repetition, but it certainly could be used if
those features aren't required by the regular expression, so the
answer is definitely maybe! :-)
Nov 17 '08 #5
En Mon, 17 Nov 2008 19:37:18 -0200, Uwe Schmitt
<ro*************@googlemail.comescribió:
Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
?

Are there any plans to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???
It's a pathological case. There are some known cases of horrible behaviour
that are explained in many books on regular expressions. If you avoid
those constructs when writing the RE, you're reasonably safe. I for
myself avoid using RE at all :)

--
Gabriel Genellina

Nov 18 '08 #6
Kay Schluehr wrote:
All of this is prototyped in Python and it is still work in progress.
As long as development has not reached a stable state I refuse to
rebuild the system in an optimized C version.
And rightfully so:

1) the approach is algorithmically better, so it may even beat the current
C implementation by design.

2) switching languages before finishing and benchmarking the prototype is a
premature optimisation. It wouldn't be the first prototype going into
production.

3) even before considering a reimplementation, you should throw it into
Cython to translate it into C code, and then benchmark that.

Stefan
Nov 18 '08 #7
On 18 Nov., 18:47, Stefan Behnel <stefan...@behnel.dewrote:
Kay Schluehr wrote:
All of this is prototyped in Python and it is still work in progress.
As long as development has not reached a stable state I refuse to
rebuild the system in an optimized C version.

And rightfully so:

1) the approach is algorithmically better, so it may even beat the current
C implementation by design.

2) switching languages before finishing and benchmarking the prototype is a
premature optimisation. It wouldn't be the first prototype going into
production.

3) even before considering a reimplementation, you should throw it into
Cython to translate it into C code, and then benchmark that.

Stefan
I fully agree and in fact the Trail parser generator contains a single
extension module called cyTrail which is written in Cython - it's not
just a trivial recompilation of Python in Cython but it uses all kinds
of cdefs.

There is just a difference between optimizing existing code using the
best techniques available and writing code from scratch for speed. I
consider this as a different, subsequent project ( call it cTrail )
and I want to have more fine control than being possible with Cython -
actually I do want to understand the code in a simple language as C. I
have to idea what the code does, generated by Cython. There are entire
layers that can be stripped off when not dealing with Python types and
dynamic memory allocation.

Kay
Nov 19 '08 #8
Uwe Schmitt wrote:
Are there any plans to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???
<http://regex.info/blog/2006-09-15/247>
Nov 19 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Fuzzyman | last post by:
I'm writing a song lyric database (effectively to drive a projector - so the database contains the full song lyrics). I'm using a nice simple Python database called KirbyBase which uses regular...
20
by: Toby | last post by:
Could some tell how I could create a search replace Regular Express in .net where is would match MY_STRING_TO_BE_CONVERTED and replace with MyStringToBeConverted
3
by: Tom | last post by:
I have struggled with the issue of whether or not to use Regular Expressions for a long time now, and after implementing many text manipulating solutions both ways, I've found that writing...
5
by: James Dean | last post by:
I wanted to use regular expressions but unfortunetely it is too slow.....Should they be so slow or am i doing something wrong. I am reading in bytes from a file then converting them to char then...
4
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or...
13
by: blair.bethwaite | last post by:
Hi all, Does anybody know of a module that allows you to enumerate all the strings a particular regular expression describes? Cheers, -Blair
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...
47
by: Henning_Thornblad | last post by:
What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000):...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.