Seeking regex optimizer

Kay Schluehr

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
....,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

Jun 18 '06 #1

Subscribe Post Reply

3204

Paddy

Kay Schluehr wrote:

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.

You say you want something that is optimised. What have have you tried?
- Pad.

Jun 18 '06 #2

Kay Schluehr

Paddy wrote:

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay
A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.

For special characters there might be a workaround using escapes. This
is indeed important, but I do think one should split the problem into
separate parts.
You say you want something that is optimised. What have have you tried?

Sorting the list and checking for successor inclusions. Say you have ls
= ['x','a', 'aa', 'aab' ,'ab']

This can be mapped to:

'x|a(?:(?:ab)?|b?|a?)'

or to:

'^(x|ab|aab|aa|a)'

with reverse sorting as in your proposal.The naive solution is easy to
generate but I'm sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay

Jun 18 '06 #3

John Machin

On 19/06/2006 6:30 AM, Paddy wrote:

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True.
Kay, what is the relevance of one string being a suffix of another? I
don't see how that could affect the outcome.

An extreme case would be ls = ['a', 'aa', ...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

Is there a Python module able to create an optimized regex rx from ls
for the given constraints?
Optimised with respect to what? speed? ease of construction?

I presume that you will ensure that the members of the list are unique.

Note that the Python regex engine will consider each candidate in
Paddy's solution left to right until it gets a match or reaches the end
(that's why the reverse sort is needed to get longest possible match).
This is worst-case O(N) where N is the total of the lengths of all the
strings in your list.

As far as I can see, this is the only basic solution (modulo one or two
twiddles -- see below) using Python's re.

You could possibly consider producing "zzz|foo(?:bar)?|aaa" instead of
"zzz|foobar|foo|aaa" -- but whether that would run sufficiently faster
to offset the cost of construction is anybody's guess.

How many strings in your list? Average/maximum length? Likelihood of
ls[i].startswith(ls[j]) == True? unicode or str?

Your requirements are rather constrained: "sx.match(s) yields a
SRE_Match object" ... why do you need this? Surely all you need is
matched_len (which may be zero) such that s[:matched_len] is the matched
prefix.

I would have thought the way to approach this would be a simple
character-by-character tree/trie-driven lookup. This would be worst case
O(n) where n is the length of the longest string in your list. There may
well be a Python-callable gadget for this on the net somewhere. Google
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but
more complex problem.

A cheap hack using Python's re: divide the problem according to first
character:

prefix_dict_match = {
'a': re.compile('alpaca|alligator').match,
'f': re.compile('foobar|foo').match,
'z': re.compile('zoo|zebra').match,
}
if s and s[0] in prefix_dict_match:
match_obj = prefix_dict_match[s[0]](s)
else:
match_obj = None

Regards,
Kay

A start would be:
regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
But the above does not work if you have special characters in your
strings.

Paddy, fixing that problem, and "optimising" by removing the redundant
^() metacharacters:

regexp = "|".join(map(re.escape, sorted(ls, reverse=True)))
Hoping some of this helps,
Regards,
John

Jun 18 '06 #4

Mirco Wahab

Thus spoke Kay Schluehr (on 2006-06-18 19:07):

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.

With some ideas from Kay and Paddy, it tried to get
along with Python in doing this.

If its allowed to spread the individual strings
into alterations, the following would imho do:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
text = r'this is a text containing aaaöüöaaaµaaa and more';
lstr = [
'a',
'aa',
'aaaaa',
'aaaöüöaaaµaaa',
'aaaaaaaaaaaaaaab'
]

import re
pat = re.compile( \
'(' + \
'|'.join(sorted(lstr,lambda a,b: len(b)-len(a))) + \
')', re.VERBOSE);

hits = sorted( pat.findall(text), lambda a,b: len(b)-len(a) )
print 'Longest: ', hits[0]

This will print 'aaaöüöaaaµaaa' from the text
and won't complain about specuial characters used.

in Perl, you could build up the regex
by dynamic evaluation (??{..}), but I
didn't manage to get this working in Python,
so here is (in Perl) how I thougt it would work:

my $text = "this is a text containing aaaöüöaaaµaaa and more";
my @lstr = (
'a',
'aa',
'aaaaa',
'aaaöüöaaaµaaa',
'aaaaaaaaaaaaaaab',
);

my $re = qr{
(??{
join '|',
map { quotemeta }
sort{ length $b <=> length $a }
@lstr
})
}x;

$_ = $text;
print "Longest: ", (my $hit) = reverse sort /$re/g;
Maybe the experts can bring some light to it.

Regards

Mirco

Jun 18 '06 #5

Paddy

Kay Schluehr wrote:
<SNIP>

with reverse sorting as in your proposal.The naive solution is easy to
generate but I'm sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay

Hi Kay,
The only way to know if something is optimised enough, is for you to
test it in your application. If you haven't finished your application,
then don't sweat it. take a note of your options, stick one in, then
get the application finished.
Its only when its finished that you can really say if further
optimisations are necessary, and you would have to profile the complete
prog to see where it's spending its time.

I usually find myself using string methods where possible, then regular
expressions only for things that string methods cannot do. When I
finish my script I usually find that Pythons speed is adequate for most
of my text processing tasks, or if speed IS an issue, then i needed to
profile the completed application anyway.

- Pad.

Jun 18 '06 #6

Kay Schluehr

Mirco,

with "special characters" I mentioned control characters of regular
expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127
characters.

For a workaround you simply have to "mangle" those using an escape
control character:

REGEXCHAR = ".^$()?[]{}\\|+*"
def mangle(s):
pattern = []
for c in s:
if c in REGEXCHAR:
pattern.append("\\")
pattern.append(c)
return "".join(pattern)

Regards,
Kay

Jun 19 '06 #7

John Machin

On 19/06/2006 7:06 PM, Kay Schluehr wrote:

Mirco,

with "special characters" I mentioned control characters of regular
expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127
characters.

For a workaround you simply have to "mangle" those using an escape
control character:

REGEXCHAR = ".^$()?[]{}\\|+*"
def mangle(s):
pattern = []
for c in s:
if c in REGEXCHAR:
pattern.append("\\")
pattern.append(c)
return "".join(pattern)

What's wrong with re.escape()?
Have you not read (a) my response to Paddy's first posting (b) the manual?

Jun 19 '06 #8

gry

Kay Schluehr wrote:

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n]. There might
be relations between those strings: s_k.startswith(s_1) -> True or
s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
...,'aaaa...ab']. For this reason SRE_Match should provide the longest
possible match.
In a very similar case I used a simple tree of dictionaries, one node
per letter, to represent the strings.
This naturally collapses cases like ['a','aa','aaa']. Then a recursive
function finds
the desired prefix. This was WAY faster than the "re" module for large
n (tradeoff point for me was n~1000). It requires a bit more coding,
but I think it is the natural data structure for this problem.

As others have suggested, you should first try the most naive
implementation before making a hard optimization problem out of this.
Is there a Python module able to create an optimized regex rx from ls
for the given constraints?

Regards,
Kay

Jun 19 '06 #9

andrewdalke

Kay Schluehr wrote:

I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n].

Why do you want to use a regex for this? When you have constant
strings there are other solutions. For example, this uses Nicolas
Lehuen's
pytst module from http://nicolas.lehuen.com/index.php/Pytst

import tst
tree = tst.TST()
tree["aa"] = (1, "String with 'aa'")
tree["aahhh"] = (2, "At the doctor's")
tree["a"] = (3, "indefinite article")
tree.scan("This is a bunch of text. Have you seen an aardvark?", .... tst.TupleListAction())
[('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch
of text. H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you
seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2,
None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1,
(3, 'indefinite article')), ('rk?', -3, None)]

Each returned tuple has three elements:
For a substring which is not a match these are:
- the unmatched substring
- the negative length of the substring
- None

For a substring which matches:
- the matched substring
- the positive length of the substring
- the value of the match in the TST (the tree)

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do. Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

Andrew
da***@dalkescientific.com

Jun 19 '06 #10

Mirco Wahab

Thus spoke an*********@gmail.com (on 2006-06-19 22:51):

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do. Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.

http://www.regular-expressions.info/atomic.html
...
Everything between (?>) is treated as one single token
by the regex engine, once the regex engine leaves the
group.
Because the entire group is one token, no backtracking
can take place once the regex engine has found a match
for the group.
...

Maybe Py.2.5 will have them?

Regards

Mrico

Jun 19 '06 #11

andrewdalke

Replying to me Mirco Wahab wrote:

If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.

Given that Python's re engine doesn't support this feature
it doesn't really help the original poster's problem.

Even if some future Python did support it, the limit
to 100 named groups is unaffected by backtracking.

import re
re.compile("(.)"*100) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(pattern, flags)
File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 225, in _compile
p = sre_compile.compile(pattern, flags)
File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre_compile.py",
line 506, in compile
raise AssertionError(
AssertionError: sorry, but this version only supports 100 named groups

There was no backtracking in "(.)"*100.

Andrew
da***@dalkescientific.com

Jun 19 '06 #12

Kay Schluehr

an*********@gmail.com wrote:

Kay Schluehr wrote:
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
regular expression sx from it, such that sx.match(s) yields a SRE_Match
object when s starts with an s_i for one i in [0,...,n].

Why do you want to use a regex for this?

Because it is part of a tokenizer that already uses regexps and I do
not intend to rewrite / replace it. Certain groups of token (
operators, braces and special characters ) should be user extensible.
All others will stay as they are. I found that certain groups of token
might be represented in a more compact mannerr: for matching ['+',
'++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
there is some generic approach to solve the "inverse regexp" problem in
a non-trivial fashion.

Regards,
Kay

Jun 20 '06 #13

andrewdalke

Kay Schluehr replied to my question:

Why do you want to use a regex for this?
Because it is part of a tokenizer that already uses regexps and I do
not intend to rewrite / replace it.

Switching to pytst is not a big change - there will be little
impact on the rest of your code. On the other hand, it does
change your build and deployment process because of the
need for another package.

Because you are constraining yourself to regex there isn't
much of an improvement you can do.
I found that certain groups of token
might be represented in a more compact mannerr: for matching ['+',
'++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
there is some generic approach to solve the "inverse regexp" problem in
a non-trivial fashion.
There are a few questions which come up. Why do you want to
have a "more compact" representation? Do you think it will
make the result faster? Likely it will because of the reduced
need for backtracking but unless you have a lot of strings with
the same prefix then the number of backtracks should go as
something like the log of the number of words, which scales
pretty well.

Are you speed limited by the existing implementation? Likely
not. In which case the performance isn't an issue. Consider
this anecdote from http://www.technicat.com/writing/programming.html

I was asked in a phone interview with Google how I would
search for a number in an array of ordered numbers.
Obviously, the questioner was asking for a CS 101 answer -
binary search. But in real life, I would probably do the "wrong"
thing - search the array from beginning to end. There's no point
in taking twice the time to write twice as much code that has
to be maintained and debugged if the application performance
is good enough, and in particular if that piece of code is not
the bottleneck in the application. (And I seriously doubt you'd
have that data laid out linearly in a fixed array like that if it
was
the bottleneck)

As it turns out, the regexp implementation could do the optimization
you want. That is, it could recognize that the input sequence is
a set of fixed words (no special characters) and implement something
like Aho-Corasick, in which case you wouldn't need to change
your input.

I don't know if Python's regexp implementation does this already.
Neither do you. Test the functionality and performance first
before going starting other work.

I am not the first to give you this advice:

John Machin: Optimised with respect to what? speed? ease of construction?
gr*@ll.mit.edu: As others have suggested, you should first try the most naive
implementation before making a hard optimization problem out of this.

As for the general question of the existance of a "generic approach
to solve the "inverse regexp" problem in a non-trivial fashion."

Yes, there is. Kinda of. All (formal/theoretical) regular
expressions
can be written as a regular expression without need for backtracking.
If you look up the theory of regular expressions, backtracking occurs
when there are ambiguities in traversing the state machine. These
are called nondeterministic finite state machines, or NFAs. It turns
out that every NFA can be written as a DFA, which has no need for
backtracking.

Except. Actually, two excepts. The normal algorithm for converting
an NFA to a DFA doesn't support grouping, which you'll need to
figure out which group matched. Wikipedia says there is an alternate,
slower algorithm which does support groups.

The second except is that regexps in Python, Perl, etc. are not
regular expressions. Some patterns, like r"([A-Z][a-z]+)\1", which
matches "WikiWiki", are not "regular grammars" under the definition
used in language theory. It's possible to use regexps to match
strings of prime length, for example.

So in the general case there is no solution. There are many
possible optimizations, but no general solution.

For your problem, which is a limited subset of the problem as
it involves only constant strings, then the solution is a prefix
tree, also called a trie. See http://en.wikipedia.org/wiki/Prefix_tree

John Machin in an earlier post suggested you search along these
lines when he said I would have thought the way to approach this would be a simple
character-by-character tree/trie-driven lookup. This would be worst case
O(n) where n is the length of the longest string in your list. There may
well be a Python-callable gadget for this on the net somewhere. Google
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but
more complex problem.

For that matter I suggested it when I pointed you to pytst. The tst
means "ternary search tree" (some write it "ternary search trie") which
is a different implementation of the same idea.

Your posts read like you ignored those pointers because they didn't
fall into the category "implemented with a regular expression". If
that's the case then no, there is no solution. You're letting a sense
of purity overwhelm practicality.

If you are really interested in this problem, try taking a CS
course titled something like "Automata Theory" or "Foundations
of Computer Science". The one which covers DFAs, NFAs,
CFGs, Turing machines, etc. It's the CS class I enjoyed the most.

Andrew
da***@dalkescientific.com

Jun 20 '06 #14

Mirco Wahab

Thus spoke an*********@gmail.com (on 2006-06-20 01:39):

Hi, are you the A.Dalke from the Schulten group (VMD) as
listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi

Replying to me Mirco Wahab wrote:
If you pull the strings into (?>( ... )) (atomic groups),
this would't happen.
Given that Python's re engine doesn't support this feature
it doesn't really help the original poster's problem.

Yeah, right. I misunderstood the problem
- where it was meant to be 'literally'
a problem of capture group id's. I took
it metaphorical, which meant to me
'problems by backtracking' (which was wrong).
Even if some future Python did support it, the limit
to 100 named groups is unaffected by backtracking.

Right for that one, what I had in mind was something
like the following:

# naive regex '\d+9'
# find some number only if it ends by 9
my $str="1009900000000000000000000000000000000000";

print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x;
print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x;

Second line would not backtrack and not find the 9 behind \d+
because atomic grouping prevents that, so line 1 matches
10099 and line 2 nothing (but very fast, which is the point).

import re
re.compile("(.)"*100) Traceback (most recent call last):
...
...
AssertionError: sorry, but this version only supports 100 named groups

I tried:

my $re = '(\d+)\D' x 10000;
my $text = join ' ', 0..10000;

print $+ if $text =~ /$re/;

print "\n", $1001;

which prints:
9999
1000

Why is there a named group count
restriction in Python? Any reasons
one should know?

Regards & thanks

Mirco

Jun 20 '06 #15

andrewdalke

Mirco Wahab wrote:

Hi, are you the A.Dalke from the Schulten group (VMD) as
listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi
Yes. But I left there nearly a decade ago.
# naive regex '\d+9'
# find some number only if it ends by 9
my $str="1009900000000000000000000000000000000000";

print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x;
print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x;
What makes me less than enthusiastic about various new regexp
features is that they can be written other ways. For example,

import re
pat = re.compile("([0-8]*9)+")
pat.match("100990000000000000000000000000000000000 0") <_sre.SRE_Match object at 0x590a0> _.group(0) '10099' pat = re.compile("([0-8]*9)+(?!\d)")
pat.match("100990000000000000000000000000000000000 0")

They are different ways to reach the same goal. Understanding
the alternatives requires some non-trivial understanding about
how regexp engines work. The diversity of solutions with no
clear preference of one over the other overly promotes
idiosyncratic patterns and requires that everyone reading the
regexps must understand all the special syntax contructions
even when a person's working vocabulary only uses a subset
of those.
Second line would not backtrack and not find the 9 behind \d+
because atomic grouping prevents that, so line 1 matches
10099 and line 2 nothing (but very fast, which is the point).
While my pattern is not as fast. It's linearly slower than yours
as it does the backtracking. (Unless the engine is very smart.)
That factor is not something I'm concerned with.

I can also get performance without backtracking (though there
is setup for backtracking) using
pat = re.compile("([0-8]*9)+([0-8]*)")
m = pat.match("100990000000000000000000000000000000000 0")
if m.group(2): print "does not end in a 9" ....
does not end in a 9
m = pat.match("100990000000000000000000000000000000000 09a")
if m.group(2): print "does not end in a 9" .... Why is there a named group count
restriction in Python? Any reasons
one should know?

While not definite, I think it introduces an ambiguity with octal
escape codes. (*That's* something to omit in Python 3000!)
Here's the quote from perlre

There is no limit to the number of captured substrings that you
may
use. However Perl also uses \10, \11, etc. as aliases for \010,
\011,
etc. (Recall that 0 means octal, so \011 is the character at
number 9
in your coded character set; which would be the 10th character,
a hori-
zontal tab under ASCII.) Perl resolves this ambiguity by
interpreting
\10 as a backreference only if at least 10 left parentheses have
opened
before it. Likewise \11 is a backreference only if at least 11
left
parentheses have opened before it. And so on. \1 through \9
are
always interpreted as backreferences.

Python doesn't have "\10", "\11", etc. as aliases for "\010", "\011",
etc.
re.compile("\10") <_sre.SRE_Pattern object at 0x575c0> pat = re.compile(r"\10") Traceback (most recent call last):
File "<stdin>", line 1, in ?
File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(pattern, flags)
File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 227, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape: '\\10' pat = re.compile(r"\010")

so there is no ambiguity until reaching \100.
pat = re.compile(r"\100")
pat.match("@") <_sre.SRE_Match object at 0x232560>

Most likely /F (who implemented the regex engine) decided
to refuse the temptation to guess the meaning, eg by
parens counting as Perl does.

Try this variation of your program to see how Perl changes
the meaning of "\100" based on the input string

for (my $i=98; $i<=103; $i++) {
my $re = '(\d+)\D' x $i . '\100';
my $text = join ' ', 0..$i;
$text .= " @";

print "Testing $i\n";
if ($text =~ /$re/) {
print " Number of matches: $+\n";
print " Last group: ${$i}\n";
} else {
print "No matches\n";
}
}

I get

Testing 98
Number of matches: 98
Last group: 98
Testing 99
Number of matches: 99
Last group: 99
Testing 100
No matches
Testing 101
No matches
Testing 102
No matches
Testing 103
No matches
Andrew
da***@dalkescientific.com

Jun 20 '06 #16

Seeking regex optimizer

Similar topics