Speeding up multiple regex matches

Talin

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

2) Use the first character of the text string to cut down the search
size. The problem here is that since I don't know the regex patterns in
advance, I would need to inspect each regex and determine the possible
set of starting characters that could be matched. This would require
creating my own regex parser.

3) Use a lexical scannner. This is probably overkill for what I want to
do.

4) Write my own regex system that does what I want. This is more work
than I want to do.

Any thoughts?

Nov 22 '05 #1

Subscribe Reply

4830

Alex Martelli

Talin <vi*****@gmail.com> wrote:
...

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:

x=re.compile('(aa)|(bb)')
mo=x.search('zaap!')
mo.groups()

('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.
Alex

Nov 22 '05 #2

Alex Martelli

Talin <vi*****@gmail.com> wrote:
...

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:

x=re.compile('(aa)|(bb)')
mo=x.search('zaap!')
mo.groups()

('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.
Alex

Nov 22 '05 #3

Talin

OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.

Nov 22 '05 #4

Talin

Nov 22 '05 #5

Fredrik Lundh

"Talin" wrote:

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm

</F>

Nov 22 '05 #6

Fredrik Lundh

"Talin" wrote:

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

Nov 22 '05 #7

Similar topics

Javascript RegEx problem

by: Mr.Clean | last post by:

I am working on modifying a syntax highlighter written in javascript and it uses several regexes. I need to add a language to the avail highlighters and need the following regexes modified to...