473,473 Members | 1,513 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Speeding up multiple regex matches

I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

2) Use the first character of the text string to cut down the search
size. The problem here is that since I don't know the regex patterns in
advance, I would need to inspect each regex and determine the possible
set of starting characters that could be matched. This would require
creating my own regex parser.

3) Use a lexical scannner. This is probably overkill for what I want to
do.

4) Write my own regex system that does what I want. This is more work
than I want to do.

Any thoughts?

Nov 22 '05 #1
6 4830
Talin <vi*****@gmail.com> wrote:
...
1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.


Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:
x=re.compile('(aa)|(bb)')
mo=x.search('zaap!')
mo.groups()

('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.
Alex
Nov 22 '05 #2
Talin <vi*****@gmail.com> wrote:
...
1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.


Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:
x=re.compile('(aa)|(bb)')
mo=x.search('zaap!')
mo.groups()

('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.
Alex
Nov 22 '05 #3
OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.

Nov 22 '05 #4
OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.

Nov 22 '05 #5
"Talin" wrote:
I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.


use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm

</F>

Nov 22 '05 #6
"Talin" wrote:
I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.


use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm

</F>

Nov 22 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Mr.Clean | last post by:
I am working on modifying a syntax highlighter written in javascript and it uses several regexes. I need to add a language to the avail highlighters and need the following regexes modified to...
7
by: bill tie | last post by:
I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence of characters "a", "b", "c", "d" with "x", (b)...
1
by: DKode | last post by:
I have the following regex that I am using: Match match = Regex.Match(global, "%(?<column>.+?)%"); my input string (global) is this: "something\%username%\somethingelse\%match2%\some\%match3%...
10
by: Chance Hopkins | last post by:
I'm trying to match a set of matches after some initial text: mytext: "something" "somethingelse" "another thing" "maybe another" (?:mytext: )(?<mymatch>{1,1}+{1,1}+)+ I only get the last one...
2
by: Ed Brown | last post by:
I'm working on a VB.Net application that needs to do quite a bit of string pattern matching, and am having problems using the "LIKE" operator to match the same string twice in the pattern. For...
0
by: Talin | last post by:
I've run in to this problem a couple of times. Say I have a piece of text that I want to test against a large number of regular expressions, where a different action is taken based on which regex...
5
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated...
1
by: al.moorthi | last post by:
the below program is working in Suse and not working on Cent 5: can any body have the solution ? #include <regex.h> #include <stdlib.h> #include <stdio.h> int main(){ char cool =...
5
by: mikko.n | last post by:
I have recently been experimenting with GNU C library regular expression functions and noticed a problem with pattern matching. It seems to recognize only the first match but ignoring the rest of...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.