473,795 Members | 2,914 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex Speed

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements. Due to the size of the files, we
decided to run these line by line, and due to the need of regex
expressions, we could not use more traditional string find methods.

We did pre-compile the regular expressions, and attempted tricks such
as map to remove as much overhead as possible.

With the known limitations of not being able to slurp the entire log
file into memory, and the need to use regular expressions, do you have
an ideas on how we might speed this up without resorting to system
calls (our current "solution") ?

Feb 20 '07
17 1994
John Machin wrote:
Or a Glushkov NFA simulated by bit parallelism re module ... see
http://citeseer.ist.psu.edu/551772.html
(which Russ Cox (author of the paper you cited) seems not to have
read).
NR-grep looks interesting, I'll read that. Thanks.
Cox uses a "pathologic al regex" (regex = "a?" * 29 + "a" * 29, in
Python code) to make his point: grep uses a Thompson gadget and takes
linear time, while Python perl and friends use backtracking and go off
the planet.

It might be pathological but based on the original posters timings his
situation seems to relate.
My main point was that its quite possible he isn't going to get faster
than grep regardless of
the language he uses and if grep wins, use it. I frequently do.
Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :
One of these days. :)

Feb 21 '07 #11
On Feb 20, 6:14 pm, Pop User <popu...@christ est2.dc.k12us.c omwrote:
Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html
Some darned good reading. And it explains what happened fairly well.
Thanks!
And python 2.5.2.

2.5.2? Who needs crystal balls when you've got a time machine? Or did
you mean 2.5? Or 1.5.2 -- say it ain't so, Joe!
2.5. I'm not entirely sure where I got that extra 2. I blame Monday.

In short... avoid using re as a sledgehammer against every problem. I
had a feeling that would be the case.

Feb 21 '07 #12
In article <11************ *********@v33g2 000cwv.googlegr oups.com>,
"John Machin" <sj******@lexic on.netwrote:
Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)
Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL. If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.

The cheap way in terms of programmer time is to pipe out to grep or
awk on this one.
Feb 21 '07 #13
Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL. If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.
actually there is a python binding in the tre source with an example
python script so it is already done.

Feb 22 '07 #14
On Feb 21, 10:34 am, garri...@gmail. com wrote:
On Feb 20, 6:14 pm, Pop User <popu...@christ est2.dc.k12us.c omwrote:
http://swtch.com/~rsc/regexp/regexp1.html
Going back a bit on a tangent, the author of this citation states that
any regex can be expressed as a DFA machine. However, while
investigating this more I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct? Can you think of a deterministic method of computing
this expression? It would be easier with a NFA machine, but given that
the Python method of computing RE's involves pre-compiling a re
object, optimizing the matching engine would make the most sense to
me.

Here's what I have so far:

class State(object):
def __init__(self):
self.nextState = {}
self.nextStateK eys = []
self.prevState = None
self.isMatchSta te = True
def setNextState(se lf, chars, iNextState):
self.nextState[chars] = iNextState
self.nextStateK eys = self.nextState. keys()
self.isMatchSta te = False
def setPrevState(se lf, iPrevState):
self.prevState = iPrevState
def moveToNextState (self, testChar):
if testChar in self.nextStateK eys:
return self.nextState[testChar]
else:
return None

class CompiledRegex(o bject):
def __init__(self, startState):
self.startState = startState
def match(self, matchStr):
match_set = []
currentStates = [self.startState]
nextStates = [self.startState]
for character in matchStr:
for state in currentStates:
nextState = state.moveToNex tState(characte r)
if nextState is not None:
nextStates.appe nd(nextState)
if nextState.isMat chState:
print "Match!"
return
currentStates = nextStates
nextStates = [self.startState]
print "No Match!"

def compile(regexSt r):
startState = State()
currentState = startState
backRefState = None
lastChar = ""
for character in regexStr:
if character == "+":
currentState.se tNextState(last Char, currentState)
elif character == "|":
currentState = startState
elif character == "?":
backRefState = currentState.pr evState
elif character == "(":
# Implement "("
pass
elif character == ")":
# Implement ")"
pass
elif character == "*":
currentState = currentState.pr evState
currentState.se tNextState(last Char, currentState)
else:
testRepeatState = currentState.mo veToNextState(c haracter)
if testRepeatState is None:
newState = State()
newState.setPre vState(currentS tate)
currentState.se tNextState(char acter, newState)
if backRefState is not None:
backRefState.se tNextState(char acter, newState)
backRefState = None
currentState = newState
else:
currentState = testRepeatState
lastChar = character
return CompiledRegex(s tartState)
>>a = compile("ab+c")
a.match("abc" )
Match!
>>a.match("abbc ")
Match!
>>a.match("ac ")
No Match!
>>a = compile("ab+c|a bd")
a.match("abc" )
Match!
>>a.match("abbc ")
Match!
>>a.match("ac ")
No Match!
>>a.match("abd" )
Match!
>>a.match("abbd ")
Match!
>>>
Feb 23 '07 #15
ga******@gmail. com wrote:
the author of this citation states that
any regex can be expressed as a DFA machine. However ...
I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct?
No. Any NFA can be converted to an equivalent DFA.
This is how scanner generators like Lex work -- they
first construct an NFA from the regex, and then
convert it to a DFA. Going directly from the regex
to a DFA, like you're trying to do, would be a lot
harder, and I'd be surprised if anyone ever does
it that way.

There's a description of the NFA-to-DFA algorithm
here:

http://www.gamedev.net/reference/art...rticle2170.asp

--
Greg
Feb 24 '07 #16
On Feb 24, 10:15 am, garri...@gmail. com wrote:
On Feb 21, 10:34 am, garri...@gmail. com wrote:
On Feb 20, 6:14 pm, Pop User <popu...@christ est2.dc.k12us.c omwrote:
>http://swtch.com/~rsc/regexp/regexp1.html

Going back a bit on a tangent, the author of this citation states that
any regex can be expressed as a DFA machine. However, while
investigating this more I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct?
No.
Can you think of a deterministic method of computing
this expression?
Firstly rewrite a bit:

ab+c|abd
a(b+c|bd)
a(bb*c|bd)
ab(b*c|d)
ab(b+c|c|d)

Here's a DFA that recognises that:
State 0:
a -1
State 1:
b -2
State 2:
b -3 # start of the b+c branch
c -4 # the c branch
d -4 # the d branch
State 3:
b -3
c -4
State 4:
accepting state
It would be easier with a NFA machine,
What is "It"?
but given that
the Python method of computing RE's involves pre-compiling a re
object,
AFAIK all serious regex engines precompile a regex into an internal
representation which can be saved for reuse.
optimizing the matching engine
What does that mean?
would make the most sense to
me.
Compared to what alternatives?
>
Here's what I have so far:
[big snip]
>a = compile("ab+c|a bd")
[snip]
>a.match("abbd" )
Match!
Bzzzt. Neither "ab+c" nor "abd" match a prefix of "abbd".

HTH,
John

Feb 24 '07 #17
On Feb 24, 11:51 am, greg <g...@cosc.cant erbury.ac.nzwro te:
garri...@gmail. com wrote:
the author of this citation states that
any regex can be expressed as a DFA machine. However ...
I appear to have found one example of a regex
which breaks this assumption.
"ab+c|abd"
Am I correct?

No. Any NFA can be converted to an equivalent DFA.
Correct. However ...
This is how scanner generators like Lex work -- they
first construct an NFA from the regex, and then
convert it to a DFA. Going directly from the regex
to a DFA, like you're trying to do, would be a lot
harder, and I'd be surprised if anyone ever does
it that way.
>From "Compilers; Principles, Techniques, and Tools" aka "the dragon
book" by Aho, Sethi and Ullman, 1986, page 134: "The first algorithm
is suitable for inclusion in a Lex compiler because it constructs a
DFA directly from a regular expression, without constructing an
intermediate NFA along the way."
>
There's a description of the NFA-to-DFA algorithm
here:

http://www.gamedev.net/reference/art...rticle2170.asp
which is on a really serious site (pop-up flashing whizz-bangs
inciting one to get an iPod now!) and which uses the (a|b)*abb example
from the dragon book (and the diagram of its Thompson-constructed NFA)
without any credit or mention of the book, in fact no references or
attributions at all.
Feb 24 '07 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
5730
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning is to improve, but not to prove.
7
2590
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward slash then some numbers. For some reason I am not getting that. It won't work at all in 2.0
15
50264
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
6
2075
by: | last post by:
Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test it is called a 100 times but in my real app it could be called 50000 or more times. So my question is if it is somehow possible to speed this up and if so how
0
9673
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10217
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10167
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10003
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7544
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6784
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5440
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4114
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3730
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.