469,609 Members | 1,437 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,609 developers. It's quick & easy.

Regex Speed

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements. Due to the size of the files, we
decided to run these line by line, and due to the need of regex
expressions, we could not use more traditional string find methods.

We did pre-compile the regular expressions, and attempted tricks such
as map to remove as much overhead as possible.

With the known limitations of not being able to slurp the entire log
file into memory, and the need to use regular expressions, do you have
an ideas on how we might speed this up without resorting to system
calls (our current "solution")?

Feb 20 '07 #1
17 1772
On Feb 21, 8:29 am, garri...@gmail.com wrote:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements.
What is an "exclusionary set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.
Due to the size of the files, we
decided to run these line by line,
I presume you mean you didn't read the whole file into memory;
correct? 2 million lines doesn't sound like much to me; what is the
average line length and what is the spec for the machine you are
running it on?
and due to the need of regex
expressions, we could not use more traditional string find methods.

We did pre-compile the regular expressions, and attempted tricks such
as map to remove as much overhead as possible.
map is a built-in function, not a trick. What "tricks"?
>
With the known limitations of not being able to slurp the entire log
file into memory, and the need to use regular expressions, do you have
an ideas on how we might speed this up without resorting to system
calls (our current "solution")?
What system calls? Do you mean running grep as a subprocess?

To help you, we need either (a) basic information or (b) crystal
balls. Is it possible for you to copy & paste your code into a web
browser or e-mail/news client? Telling us which version of Python you
are running might be a good idea too.

Cheers,
John

Feb 20 '07 #2
ga******@gmail.com wrote:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements. Due to the size of the files, we
decided to run these line by line, and due to the need of regex
expressions, we could not use more traditional string find methods.
Just guessing (since I haven't tested this), switching from doing it line by
line to big chunks (whatever will fit in memory) at a time would help, but
I don't think you can get close to the speed of grep (eg
while True:
chunk = thefile.read(100000000))
if not len(chunk): break
for x in theRE.findall(chunk):
.....
)
Function calls in python are expensive.

Feb 21 '07 #3
On Feb 20, 4:15 pm, "John Machin" <sjmac...@lexicon.netwrote:
What is an "exclusionary set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.
My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"error|miss|issing|inval|nvalid|math")
exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
translate")

for test_text in test_file:
if error_list.match(test_text) and not
exclusion_list.match(test_text):
#Process test_text

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.
I presume you mean you didn't read the whole file into memory;
correct? 2 million lines doesn't sound like much to me; what is the
average line length and what is the spec for the machine you are
running it on?
You are correct. The individual files can be anywhere from a few bytes
to 2gig. The average is around one gig, and there are a number of
files to be iterated over (an average of 4). I do not know the machine
specs, though I can safely say it is a single core machine, sub
2.5ghz, with 2gigs of RAM running linux.
map is a built-in function, not a trick. What "tricks"?
I'm using the term "tricks" where I may be obfuscating the code in an
effort to make it run faster. In the case of map, getting rid of the
interpreted for loop overhead in favor of the implied c loop offered
by map.
What system calls? Do you mean running grep as a subprocess?
Yes. While this may not seem evil in and of itself, we are trying to
get our company to adopt Python into more widespread use. I'm guessing
the limiting factor isn't python, but us python newbies missing an
obvious way to speed up the process.
To help you, we need either (a) basic information or (b) crystal
balls. Is it possible for you to copy & paste your code into a web
browser or e-mail/news client? Telling us which version of Python you
are running might be a good idea too.
Can't copy and paste code (corp policy and all that), no crystal balls
for sale, though I hope the above information helps. Also, running a
trace on the program indicated that python was spending a lot of time
looping around lines, checking for each element of the expression in
sequence.

And python 2.5.2.

Thanks!
Feb 21 '07 #4
John Machin wrote:
[...]
>
To help you, we need either (a) basic information or (b) crystal
balls.
[...]

How on earth would having glass testicles help us help him?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 21 '07 #5
Steve Holden wrote:
John Machin wrote:
[...]
>>
To help you, we need either (a) basic information or (b) crystal
balls.
[...]

How on earth would having glass testicles help us help him?
John, of course, meant spheres of doped single crystal silicon on which we
could simulate all the specific coding problems for which all possible
values of garrickp would have posted the message that he/she/it did, then
solve them by descending order of likelyhood till garrickp emitted a "That
solves it, thanks!"
Feb 21 '07 #6
En Tue, 20 Feb 2007 21:40:40 -0300, <ga******@gmail.comescribió:
My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"error|miss|issing|inval|nvalid|math")

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.
Can you relax that restriction? Not always a regex is a good way,
specially if you want speed also:

pyimport timeit
pyline = "a sample line that will not match any condition, but long
enough to
be meaninful in the context of this problem, or at least I thik so. This
has 174
characters, is it enough?"
pytimeit.Timer('if error_list.search(line): pass',
.... 'import
re;error_list=re.compile(r"error|miss|issing|inval |nvalid|math");f
rom __main__ import line').repeat(number=10000)
[1.7704239587925394, 1.7289717746328725, 1.7057590543605246]
pytimeit.Timer('for token in tokens:\n\tif token in line: break\nelse:
pass',
.... 'from __main__ import line;tokens =
"error|miss|issing|inval|nvalid|math".
split("|")').repeat(number=10000)
[1.0268617863829661, 1.050040144755787, 1.0677314944409151]
pytimeit.Timer('if "error" in line or "miss" in line or "issing" in line
or "i
nval" in line or "nvalid" in line or "math" in line: pass',
.... 'from __main__ import line').repeat(number=10000)
[0.97102286155842066, 0.98341158348013913, 0.9651561957857222]

The fastest was is hard coding the tokens: if "error" in line or "miss" in
line or...
If that is not acceptable, iterating over a list of tokens: for token in
token: if token in line...
The regex is the slowest, a more carefully crafted regex is a bit faster,
but not enough:

pytimeit.Timer('if error_list.search(line): pass',
.... 'import
re;error_list=re.compile(r"error|m(?:iss(?:ing)|at h)|inval(?:id)")
;from __main__ import line').repeat(number=10000)
[1.3974029108719606, 1.4247005067123837, 1.4071600141470526]

--
Gabriel Genellina

Feb 21 '07 #7
On Feb 21, 11:40 am, garri...@gmail.com wrote:
On Feb 20, 4:15 pm, "John Machin" <sjmac...@lexicon.netwrote:
What is an "exclusionary set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.

My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"error|miss|issing|inval|nvalid|math")
exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
translate")

for test_text in test_file:
if error_list.match(test_text) and not
exclusion_list.match(test_text):
#Process test_text

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.
You could do a quick check on the list of patterns; if they are simple
strings (no ()+*?[] etc), then you could use Danny Yoo's ahocorasick
module (from http://hkn.eecs.berkeley.edu/~dyoo/p...ahocorasick/):
>>import ahocorasick as ac
gadget = ac.KeywordTree()
for text in "error|miss|issing|inval|nvalid|math".split('| '):
.... gadget.add(text)
....
>>gadget.make()
def showall(machine, line):
.... startpos = 0
.... while 1:
.... result = machine.search(line, startpos)
.... if not result: return
.... beg, end = result
.... print beg, end, repr(line[beg:end])
.... startpos = beg + 1
....
>>showall(gadget, 'the hissing misses invalidated erroneous mathematics')
5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
41 45 'math'
>>showall(gadget, 'the hissing misses invalidated terror mathematics')
5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
32 37 'error'
38 42 'math'
>>showall(gadget, 'these are not the droids that you are looking for')
But of course you would just use a simple search() per line -- I'm
just showing you that it works :-)
>
What system calls? Do you mean running grep as a subprocess?

Yes. While this may not seem evil in and of itself, we are trying to
get our company to adopt Python into more widespread use. I'm guessing
the limiting factor isn't python, but us python newbies missing an
obvious way to speed up the process.
You (can't/don't have to) write everything in Python or any other
single language. One of Pythons's very good points is that you can
connect it to anything -- or if you can't, just ask in this
newsgroup :-) Look, boss, yesterday we got it running grep; today
we're calling a module written in C; not bad for a bunch of newbies,
eh?
And python 2.5.2.
2.5.2? Who needs crystal balls when you've got a time machine? Or did
you mean 2.5? Or 1.5.2 -- say it ain't so, Joe!

Cheers,
John
Feb 21 '07 #8
ga******@gmail.com wrote:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.
Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html

This might not be your problem but if it is you can always popen grep.

It would be nice if there were a Thompson NFA re module.

Feb 21 '07 #9
On Feb 21, 12:14 pm, Pop User <popu...@christest2.dc.k12us.comwrote:
garri...@gmail.com wrote:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html

This might not be your problem but if it is you can always popen grep.

It would be nice if there were a Thompson NFA re module.
Or a Glushkov NFA simulated by bit parallelism re module ... see
http://citeseer.ist.psu.edu/551772.html
(which Russ Cox (author of the paper you cited) seems not to have
read).

Cox uses a "pathological regex" (regex = "a?" * 29 + "a" * 29, in
Python code) to make his point: grep uses a Thompson gadget and takes
linear time, while Python perl and friends use backtracking and go off
the planet.

The interesting thing is that in Navarro's NR-grep, that's not
pathological at all; it's a simple case of an "extended pattern" (? +
and * operators applied to a single character (or character class)) --
takes linear time with a much easier setup than an NFA/DFA and not
much code executed per byte scanned.

Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)

Cheers,
John

Feb 21 '07 #10
John Machin wrote:
Or a Glushkov NFA simulated by bit parallelism re module ... see
http://citeseer.ist.psu.edu/551772.html
(which Russ Cox (author of the paper you cited) seems not to have
read).
NR-grep looks interesting, I'll read that. Thanks.
Cox uses a "pathological regex" (regex = "a?" * 29 + "a" * 29, in
Python code) to make his point: grep uses a Thompson gadget and takes
linear time, while Python perl and friends use backtracking and go off
the planet.

It might be pathological but based on the original posters timings his
situation seems to relate.
My main point was that its quite possible he isn't going to get faster
than grep regardless of
the language he uses and if grep wins, use it. I frequently do.
Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :
One of these days. :)

Feb 21 '07 #11
On Feb 20, 6:14 pm, Pop User <popu...@christest2.dc.k12us.comwrote:
Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html
Some darned good reading. And it explains what happened fairly well.
Thanks!
And python 2.5.2.

2.5.2? Who needs crystal balls when you've got a time machine? Or did
you mean 2.5? Or 1.5.2 -- say it ain't so, Joe!
2.5. I'm not entirely sure where I got that extra 2. I blame Monday.

In short... avoid using re as a sledgehammer against every problem. I
had a feeling that would be the case.

Feb 21 '07 #12
In article <11*********************@v33g2000cwv.googlegroups. com>,
"John Machin" <sj******@lexicon.netwrote:
Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)
Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL. If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.

The cheap way in terms of programmer time is to pipe out to grep or
awk on this one.
Feb 21 '07 #13
Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL. If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.
actually there is a python binding in the tre source with an example
python script so it is already done.

Feb 22 '07 #14
On Feb 21, 10:34 am, garri...@gmail.com wrote:
On Feb 20, 6:14 pm, Pop User <popu...@christest2.dc.k12us.comwrote:
http://swtch.com/~rsc/regexp/regexp1.html
Going back a bit on a tangent, the author of this citation states that
any regex can be expressed as a DFA machine. However, while
investigating this more I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct? Can you think of a deterministic method of computing
this expression? It would be easier with a NFA machine, but given that
the Python method of computing RE's involves pre-compiling a re
object, optimizing the matching engine would make the most sense to
me.

Here's what I have so far:

class State(object):
def __init__(self):
self.nextState = {}
self.nextStateKeys = []
self.prevState = None
self.isMatchState = True
def setNextState(self, chars, iNextState):
self.nextState[chars] = iNextState
self.nextStateKeys = self.nextState.keys()
self.isMatchState = False
def setPrevState(self, iPrevState):
self.prevState = iPrevState
def moveToNextState(self, testChar):
if testChar in self.nextStateKeys:
return self.nextState[testChar]
else:
return None

class CompiledRegex(object):
def __init__(self, startState):
self.startState = startState
def match(self, matchStr):
match_set = []
currentStates = [self.startState]
nextStates = [self.startState]
for character in matchStr:
for state in currentStates:
nextState = state.moveToNextState(character)
if nextState is not None:
nextStates.append(nextState)
if nextState.isMatchState:
print "Match!"
return
currentStates = nextStates
nextStates = [self.startState]
print "No Match!"

def compile(regexStr):
startState = State()
currentState = startState
backRefState = None
lastChar = ""
for character in regexStr:
if character == "+":
currentState.setNextState(lastChar, currentState)
elif character == "|":
currentState = startState
elif character == "?":
backRefState = currentState.prevState
elif character == "(":
# Implement "("
pass
elif character == ")":
# Implement ")"
pass
elif character == "*":
currentState = currentState.prevState
currentState.setNextState(lastChar, currentState)
else:
testRepeatState = currentState.moveToNextState(character)
if testRepeatState is None:
newState = State()
newState.setPrevState(currentState)
currentState.setNextState(character, newState)
if backRefState is not None:
backRefState.setNextState(character, newState)
backRefState = None
currentState = newState
else:
currentState = testRepeatState
lastChar = character
return CompiledRegex(startState)
>>a = compile("ab+c")
a.match("abc")
Match!
>>a.match("abbc")
Match!
>>a.match("ac")
No Match!
>>a = compile("ab+c|abd")
a.match("abc")
Match!
>>a.match("abbc")
Match!
>>a.match("ac")
No Match!
>>a.match("abd")
Match!
>>a.match("abbd")
Match!
>>>
Feb 23 '07 #15
ga******@gmail.com wrote:
the author of this citation states that
any regex can be expressed as a DFA machine. However ...
I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct?
No. Any NFA can be converted to an equivalent DFA.
This is how scanner generators like Lex work -- they
first construct an NFA from the regex, and then
convert it to a DFA. Going directly from the regex
to a DFA, like you're trying to do, would be a lot
harder, and I'd be surprised if anyone ever does
it that way.

There's a description of the NFA-to-DFA algorithm
here:

http://www.gamedev.net/reference/art...rticle2170.asp

--
Greg
Feb 24 '07 #16
On Feb 24, 10:15 am, garri...@gmail.com wrote:
On Feb 21, 10:34 am, garri...@gmail.com wrote:
On Feb 20, 6:14 pm, Pop User <popu...@christest2.dc.k12us.comwrote:
>http://swtch.com/~rsc/regexp/regexp1.html

Going back a bit on a tangent, the author of this citation states that
any regex can be expressed as a DFA machine. However, while
investigating this more I appear to have found one example of a regex
which breaks this assumption.

"ab+c|abd"

Am I correct?
No.
Can you think of a deterministic method of computing
this expression?
Firstly rewrite a bit:

ab+c|abd
a(b+c|bd)
a(bb*c|bd)
ab(b*c|d)
ab(b+c|c|d)

Here's a DFA that recognises that:
State 0:
a -1
State 1:
b -2
State 2:
b -3 # start of the b+c branch
c -4 # the c branch
d -4 # the d branch
State 3:
b -3
c -4
State 4:
accepting state
It would be easier with a NFA machine,
What is "It"?
but given that
the Python method of computing RE's involves pre-compiling a re
object,
AFAIK all serious regex engines precompile a regex into an internal
representation which can be saved for reuse.
optimizing the matching engine
What does that mean?
would make the most sense to
me.
Compared to what alternatives?
>
Here's what I have so far:
[big snip]
>a = compile("ab+c|abd")
[snip]
>a.match("abbd")
Match!
Bzzzt. Neither "ab+c" nor "abd" match a prefix of "abbd".

HTH,
John

Feb 24 '07 #17
On Feb 24, 11:51 am, greg <g...@cosc.canterbury.ac.nzwrote:
garri...@gmail.com wrote:
the author of this citation states that
any regex can be expressed as a DFA machine. However ...
I appear to have found one example of a regex
which breaks this assumption.
"ab+c|abd"
Am I correct?

No. Any NFA can be converted to an equivalent DFA.
Correct. However ...
This is how scanner generators like Lex work -- they
first construct an NFA from the regex, and then
convert it to a DFA. Going directly from the regex
to a DFA, like you're trying to do, would be a lot
harder, and I'd be surprised if anyone ever does
it that way.
>From "Compilers; Principles, Techniques, and Tools" aka "the dragon
book" by Aho, Sethi and Ullman, 1986, page 134: "The first algorithm
is suitable for inclusion in a Lex compiler because it constructs a
DFA directly from a regular expression, without constructing an
intermediate NFA along the way."
>
There's a description of the NFA-to-DFA algorithm
here:

http://www.gamedev.net/reference/art...rticle2170.asp
which is on a really serious site (pop-up flashing whizz-bangs
inciting one to get an iPod now!) and which uses the (a|b)*abb example
from the dragon book (and the diagram of its Thompson-constructed NFA)
without any credit or mention of the book, in fact no references or
attributions at all.
Feb 24 '07 #18

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.