Regex Speed - Python

garrickp

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements. Due to the size of the files, we
decided to run these line by line, and due to the need of regex
expressions, we could not use more traditional string find methods.

We did pre-compile the regular expressions, and attempted tricks such
as map to remove as much overhead as possible.

With the known limitations of not being able to slurp the entire log
file into memory, and the need to use regular expressions, do you have
an ideas on how we might speed this up without resorting to system
calls (our current "solution") ?

Feb 20 '07 #1

Subscribe Reply

1993

John Machin

On Feb 21, 8:29 am, garri...@gmail. com wrote:

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements.

What is an "exclusiona ry set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.

Due to the size of the files, we
decided to run these line by line,

I presume you mean you didn't read the whole file into memory;
correct? 2 million lines doesn't sound like much to me; what is the
average line length and what is the spec for the machine you are
running it on?

and due to the need of regex
expressions, we could not use more traditional string find methods.

We did pre-compile the regular expressions, and attempted tricks such
as map to remove as much overhead as possible.

map is a built-in function, not a trick. What "tricks"?

>
With the known limitations of not being able to slurp the entire log
file into memory, and the need to use regular expressions, do you have
an ideas on how we might speed this up without resorting to system
calls (our current "solution") ?

What system calls? Do you mean running grep as a subprocess?

To help you, we need either (a) basic information or (b) crystal
balls. Is it possible for you to copy & paste your code into a web
browser or e-mail/news client? Telling us which version of Python you
are running might be a good idea too.

Cheers,
John

Feb 20 '07 #2

Alejandro Dubrovsky

ga******@gmail. com wrote:

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

The search we used was a regex of 6 elements "or"ed together, with an
exclusionary set of ~3 elements. Due to the size of the files, we
decided to run these line by line, and due to the need of regex
expressions, we could not use more traditional string find methods.

Just guessing (since I haven't tested this), switching from doing it line by
line to big chunks (whatever will fit in memory) at a time would help, but
I don't think you can get close to the speed of grep (eg
while True:
chunk = thefile.read(10 0000000))
if not len(chunk): break
for x in theRE.findall(c hunk):
.....
)
Function calls in python are expensive.

Feb 21 '07 #3

garrickp

On Feb 20, 4:15 pm, "John Machin" <sjmac...@lexic on.netwrote:

What is an "exclusiona ry set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.

My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"er ror|miss|issing |inval|nvalid|m ath")
exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
translate")

for test_text in test_file:
if error_list.matc h(test_text) and not
exclusion_list. match(test_text ):
#Process test_text

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.

I presume you mean you didn't read the whole file into memory;
correct? 2 million lines doesn't sound like much to me; what is the
average line length and what is the spec for the machine you are
running it on?

You are correct. The individual files can be anywhere from a few bytes
to 2gig. The average is around one gig, and there are a number of
files to be iterated over (an average of 4). I do not know the machine
specs, though I can safely say it is a single core machine, sub
2.5ghz, with 2gigs of RAM running linux.

map is a built-in function, not a trick. What "tricks"?

I'm using the term "tricks" where I may be obfuscating the code in an
effort to make it run faster. In the case of map, getting rid of the
interpreted for loop overhead in favor of the implied c loop offered
by map.

What system calls? Do you mean running grep as a subprocess?

Yes. While this may not seem evil in and of itself, we are trying to
get our company to adopt Python into more widespread use. I'm guessing
the limiting factor isn't python, but us python newbies missing an
obvious way to speed up the process.

To help you, we need either (a) basic information or (b) crystal
balls. Is it possible for you to copy & paste your code into a web
browser or e-mail/news client? Telling us which version of Python you
are running might be a good idea too.

Can't copy and paste code (corp policy and all that), no crystal balls
for sale, though I hope the above information helps. Also, running a
trace on the program indicated that python was spending a lot of time
looping around lines, checking for each element of the expression in
sequence.

And python 2.5.2.

Thanks!

Feb 21 '07 #4

Steve Holden

John Machin wrote:
[...]

>
To help you, we need either (a) basic information or (b) crystal
balls.

[...]

How on earth would having glass testicles help us help him?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 21 '07 #5

Alejandro Dubrovsky

Steve Holden wrote:

John Machin wrote:
[...]
>>
To help you, we need either (a) basic information or (b) crystal
balls.
[...]

How on earth would having glass testicles help us help him?

John, of course, meant spheres of doped single crystal silicon on which we
could simulate all the specific coding problems for which all possible
values of garrickp would have posted the message that he/she/it did, then
solve them by descending order of likelyhood till garrickp emitted a "That
solves it, thanks!"

Feb 21 '07 #6

Gabriel Genellina

En Tue, 20 Feb 2007 21:40:40 -0300, <ga******@gmail .comescribió:

My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"er ror|miss|issing |inval|nvalid|m ath")

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.

Can you relax that restriction? Not always a regex is a good way,
specially if you want speed also:

pyimport timeit
pyline = "a sample line that will not match any condition, but long
enough to
be meaninful in the context of this problem, or at least I thik so. This
has 174
characters, is it enough?"
pytimeit.Timer( 'if error_list.sear ch(line): pass',
.... 'import
re;error_list=r e.compile(r"err or|miss|issing| inval|nvalid|ma th");f
rom __main__ import line').repeat(n umber=10000)
[1.7704239587925 394, 1.7289717746328 725, 1.7057590543605 246]
pytimeit.Timer( 'for token in tokens:\n\tif token in line: break\nelse:
pass',
.... 'from __main__ import line;tokens =
"error|miss|iss ing|inval|nvali d|math".
split("|")').re peat(number=100 00)
[1.0268617863829 661, 1.0500401447557 87, 1.0677314944409 151]
pytimeit.Timer( 'if "error" in line or "miss" in line or "issing" in line
or "i
nval" in line or "nvalid" in line or "math" in line: pass',
.... 'from __main__ import line').repeat(n umber=10000)
[0.9710228615584 2066, 0.9834115834801 3913, 0.9651561957857 222]

The fastest was is hard coding the tokens: if "error" in line or "miss" in
line or...
If that is not acceptable, iterating over a list of tokens: for token in
token: if token in line...
The regex is the slowest, a more carefully crafted regex is a bit faster,
but not enough:

pytimeit.Timer( 'if error_list.sear ch(line): pass',
.... 'import
re;error_list=r e.compile(r"err or|m(?:iss(?:in g)|ath)|inval(? :id)")
;from __main__ import line').repeat(n umber=10000)
[1.3974029108719 606, 1.4247005067123 837, 1.4071600141470 526]

--
Gabriel Genellina

Feb 21 '07 #7

John Machin

On Feb 21, 11:40 am, garri...@gmail. com wrote:

On Feb 20, 4:15 pm, "John Machin" <sjmac...@lexic on.netwrote:

What is an "exclusiona ry set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.

My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"er ror|miss|issing |inval|nvalid|m ath")
exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
translate")

for test_text in test_file:
if error_list.matc h(test_text) and not
exclusion_list. match(test_text ):
#Process test_text

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.

You could do a quick check on the list of patterns; if they are simple
strings (no ()+*?[] etc), then you could use Danny Yoo's ahocorasick
module (from http://hkn.eecs.berkeley.edu/~dyoo/p...ahocorasick/):

>>import ahocorasick as ac
gadget = ac.KeywordTree( )
for text in "error|miss|iss ing|inval|nvali d|math".split(' |'):

.... gadget.add(text )
....

>>gadget.make ()
def showall(machine , line):

.... startpos = 0
.... while 1:
.... result = machine.search( line, startpos)
.... if not result: return
.... beg, end = result
.... print beg, end, repr(line[beg:end])
.... startpos = beg + 1
....

>>showall(gadge t, 'the hissing misses invalidated erroneous mathematics')

5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
41 45 'math'

>>showall(gadge t, 'the hissing misses invalidated terror mathematics')

5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
32 37 'error'
38 42 'math'

>>showall(gadge t, 'these are not the droids that you are looking for')

But of course you would just use a simple search() per line -- I'm
just showing you that it works :-)

>
What system calls? Do you mean running grep as a subprocess?

Yes. While this may not seem evil in and of itself, we are trying to
get our company to adopt Python into more widespread use. I'm guessing
the limiting factor isn't python, but us python newbies missing an
obvious way to speed up the process.

You (can't/don't have to) write everything in Python or any other
single language. One of Pythons's very good points is that you can
connect it to anything -- or if you can't, just ask in this
newsgroup :-) Look, boss, yesterday we got it running grep; today
we're calling a module written in C; not bad for a bunch of newbies,
eh?

And python 2.5.2.

2.5.2? Who needs crystal balls when you've got a time machine? Or did
you mean 2.5? Or 1.5.2 -- say it ain't so, Joe!

Cheers,
John

Feb 21 '07 #8

Pop User

ga******@gmail. com wrote:

While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html

This might not be your problem but if it is you can always popen grep.

It would be nice if there were a Thompson NFA re module.

Feb 21 '07 #9

John Machin

On Feb 21, 12:14 pm, Pop User <popu...@christ est2.dc.k12us.c omwrote:

garri...@gmail. com wrote:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast, using the
Linux tool grep would complete the same search in a matter of seconds.

Its very hard to beat grep depending on the nature of the regex you are
searching using. The regex engines in python/perl/php/ruby have traded
the speed of grep/awk for the ability to do more complex searches.

http://swtch.com/~rsc/regexp/regexp1.html

This might not be your problem but if it is you can always popen grep.

It would be nice if there were a Thompson NFA re module.

Or a Glushkov NFA simulated by bit parallelism re module ... see
http://citeseer.ist.psu.edu/551772.html
(which Russ Cox (author of the paper you cited) seems not to have
read).

Cox uses a "pathologic al regex" (regex = "a?" * 29 + "a" * 29, in
Python code) to make his point: grep uses a Thompson gadget and takes
linear time, while Python perl and friends use backtracking and go off
the planet.

The interesting thing is that in Navarro's NR-grep, that's not
pathological at all; it's a simple case of an "extended pattern" (? +
and * operators applied to a single character (or character class)) --
takes linear time with a much easier setup than an NFA/DFA and not
much code executed per byte scanned.

Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)

Cheers,
John

Feb 21 '07 #10

Similar topics

5730

How can I embed the *regex* engine into C program?

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning is to improve, but not to prove.

C / C++

2589

Quick regex question

by: Extremest | last post by:

I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward slash then some numbers. For some reason I am not getting that. It won't work at all in 2.0

C# / C Sharp

50261

Regex to remove \t \r \n from string

by: morleyc | last post by:

Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris

C# / C Sharp

2074

Regex performance issue

by: | last post by:

Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test it is called a 100 times but in my real app it could be called 50000 or more times. So my question is if it is somehow possible to speed this up and if so how

C# / C Sharp

9645

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10151

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10092

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8973

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7499

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5381

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5511

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3647

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2879

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General