470,811 Members | 1,220 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,811 developers. It's quick & easy.

Does my RE rock, or suck?


Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.
I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?

What would be the worst-case time?

Am I likely to exceed recursion depth?

Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?
Thanks for any help answering these.

--
--Bryan
Jul 18 '05 #1
8 1471
Bryan Olson <fa*********@nowhere.org> writes:
Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.
I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?
Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
"Dozens to hundreds" sounds like a fairly small number in context,
but: time it! Only you know what fast enough is.

Here's a cute approach that should have running time independent of
the number of triggers:

def build_dict(prefix_list):
d = {}
for p in prefix_list:
if len(p) > 0:
d.setdefault(p[0], []).append(p[1:])
r = {}
for p in d:
r[p] = build_dict(d[p])
if '' in d[p]:
r[p][''] = 1
return r

def longest_prefix(s, d):
i = 0
j = 0
for c in s:
if c in d:
d = d[c]
i += 1
if '' in d:
j = i
else:
return s[:j]
else:
return s[:j]

if __name__ == '__main__':
d = build_dict(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix("John Smith", d)
print longest_prefix("John Smith", d)

Here's a less cute but probably saner approach:

def prepare(prefix_list):
d = {}
for p in prefix_list:
d.setdefault(len(p), {})[p] = p
l = d.keys()
l.sort()
l.reverse()
return d, l

def longest_prefix2(s, d, l):
for i in l:
r = d[i].get(s[:i])
if r is not None:
return r
return ''

if __name__ == '__main__':
d, l = prepare(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix2("John Smith", d, l)
print longest_prefix2("John Smith", d, l)

this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
What would be the worst-case time?

Am I likely to exceed recursion depth?
Not in 2.4 :-) And I doubt it before that, actually.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?
Thanks for any help answering these.


This thread from a year or so back might interest:

http://groups.google.com/gr*********...ing.google.com

Cheers,
mwh

--
: Giant screaming pieces of excrement, they are.
I have a feeling that some of the people in here have a
MUCH more exciting time relieving themselves than I do.
-- Mike Sphar & Dave Brown, asr
Jul 18 '05 #2

Thanks Michael. I still have more investigating to do.

Michael Hudson wrote:
Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.

In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

"Dozens to hundreds" sounds like a fairly small number in context,
I agree, though hundreds would make the longest RE string I've
ever used.
but: time it! Only you know what fast enough is.
It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.

Here's a cute approach that should have running time independent
of the number of triggers: [snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure. My (trivial)
tests so far indicate that the RE form is faster in typical
cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.
Here's a less cute but probably saner approach: [...] this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.
[I, Bryan asked:]
Am I likely to exceed recursion depth?


Not in 2.4 :-) And I doubt it before that, actually.


But Python is only up to ... oh, the smiley. I get it.

This thread from a year or so back might interest:

http://groups.google.com/gr*********...ing.google.com

Thanks,

--
--Bryan
Jul 18 '05 #3
Bryan Olson <fa*********@nowhere.org> writes:
Thanks Michael. I still have more investigating to do.

Michael Hudson wrote:
> Well, the re module is an NFA engine, not a DFA engine, so I'd expect
> this re to run in time proportional to the number of triggers.
I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.


Hmm, you sound like you know more about this sort of thing than I do
:-)
In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)
Well, I don't know much about the re compiler. It's written in
Python, though, so it shouldn't be *that* hard to check for
yourself...
> "Dozens to hundreds" sounds like a fairly small number in context,


I agree, though hundreds would make the longest RE string I've
ever used.
> but: time it! Only you know what fast enough is.


It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.
> Here's a cute approach that should have running time independent
> of the number of triggers:

[snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure.


Heh, I'd sort of heard of tries but didn't realize I was implementing
one. I just thought it was a simple minded DFA engine...
My (trivial) tests so far indicate that the RE form is faster in
typical cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.
Well, the re engine is written in C, so you would hope that it beats
my Python engine...

The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.
> Here's a less cute but probably saner approach:

[...]
> this should run in time proportional to the number of different
> lengths of trigger (and use rather less memory than my first version).


I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.


I guess, but the string hashing is done in C and is "probably"
insignificant.
[I, Bryan asked:]
>> Am I likely to exceed recursion depth?

>
> Not in 2.4 :-) And I doubt it before that, actually.


But Python is only up to ... oh, the smiley. I get it.


Well, my point was there is no recursion limit in Python 2.4's sre
engine. But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.

Cheers,
mwh

--
Need to Know is usually an interesting UK digest of things that
happened last week or might happen next week. [...] This week,
nothing happened, and we don't care.
-- NTK Now, 2000-12-29, http://www.ntk.net/
Jul 18 '05 #4

I think you need some binary search or some form of tree here.

Make a list of prefixes, sort them, store this list somewhere it's quick
to access.
It is then very quick (N log N) to look in that list which key
corresponds to the beginning of a path.
Basically you do a bisection search and, of course you don't have an
exact match but the last key you end up on is your prefix.

I believe there is a btree data type in the Python library.


Jul 18 '05 #5
Hello Bryan,
Is this likely to be fast in normal cases? Is there a
clearly better approach?
Yes, it should be fast, since no repetitive backtracking will happen.
What would be the worst-case time?
Given the context (person names), your worst case will probably be
failing every name.
Am I likely to exceed recursion depth?
You won't, unless you find a name with thousands of words, and
variants for every word. With Python 2.4, you'll only get in
trouble if you exceed your system's memory.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?


It looks like a sane use of regular expressions.

--
Gustavo Niemeyer
http://niemeyer.net
Jul 18 '05 #6
[...]
In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
This won't happen right now. Minimum and maximum lenght is currently
only checked for the whole pattern, not for every branch.
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)


OTOH, there's some intelligence for this specific case. The first
character is checked even before "entering" the branch (IOW,
preparing the enviornment for checking the branch), so your
pattern should match really fast.

It will be something close to:

(1) Entering a branch, save current state as necessary.
(2) Check first character, does it match?
(3) If yes, get into the branch, saving state as necessary, and
checking other characters, subpatterns, etc.
(4) No, jump to next alternative, go back to (2).
(5) Yes, leave the current branch with success, restoring states
as necessary.

--
Gustavo Niemeyer
http://niemeyer.net
Jul 18 '05 #7
Pierre-Frédéric Caillaud wrote:

I think you need some binary search or some form of tree here.

Make a list of prefixes, sort them, store this list somewhere it's
quick to access.
It is then very quick (N log N) to look in that list which key
corresponds to the beginning of a path.
Basically you do a bisection search and, of course you don't have
an exact match but the last key you end up on is your prefix.


That doesn't quite work for my problem. Consider the list:

'John',
'Johnson',
'Jones'

A search for the string 'Johnston' (note the 't') will locate
the point between 'Johnson' and 'Jones'. 'Johnson' shares a
long prefix with 'Johnston', but I want the longest string in
the list that *is* a prefix of 'Johnston', and in this case that
is 'John'.
--
--Bryan
Jul 18 '05 #8

Thanks for responses Gustavo. That's most of what I need to
know.

Gustavo Niemeyer wrote:
[...]
Given the context (person names), your worst case will probably be
failing every name.
The names were just for demo. The actual application is to
match resource paths with the most-specific handler.

[...] It looks like a sane use of regular expressions.


Since it's not obviously insane, I'll include the code below,
in case anyone else wants to use it or QA it for free.
--Bryan
"""
The Problem: We're given a set of "start strings". Later
we'll be called (many times) with a "path string", and we
need to find the longest of the start strings that is a
prefix of the given path string.

Method: Build a Python regular expression (RE) from the
start strings, such that re.search(path) will match the
desired prefix. The RE has no loop-like things, and never
has two branches beginning with the same character.
Searching should be fast in all cases.

"""

import re
def build_prefix_re(str_list):
""" Given a sequence of strings, return a compiled RE such
that build_prefix_re(str_list).search(x) will match the
longest string in str_list that is a prefix of x.
"""

def trie_insert(map, str):
if str:
submap = map.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
map[""] = {}

def sequentialize(map, lst):
if map:
keys = map.keys()
# Order so that longer matches are first
keys.sort()
keys.reverse()
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = map[key]
while len(submap) == 1:
# While follow-set is singleton, condense.
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')

map = {}
for str in str_list:
trie_insert(map, str)
lst = ['^']
sequentialize(map, lst)
re_str = "".join(lst)
# print "Prefix finding RE is: '%s'\n" % re_str
return re.compile(re_str)
if __name__ == '__main__':
slist = ["a.b/cde/fg", "bchij", "bivwd", "cdj", "cdjwv", "cdjxi"]
prematcher = build_prefix_re(slist)
for str in slist:
match = prematcher.search(str)
assert(match.group(0) == str)
s = str + 'extraneous'
match = prematcher.search(s)
assert(match.group(0) == str)
for str in ('', 'cd', 'bb', 'cdi', 'xbchij', 'bchiij'):
match = prematcher.search(str)
assert not match, "BAD, found '%s'" % str
Jul 18 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

20 posts views Thread by Dan Vande More | last post: by
7 posts views Thread by TooNaive | last post: by
10 posts views Thread by Ruediger Klaehn | last post: by
458 posts views Thread by wellstone9912 | last post: by
31 posts views Thread by Bill Norton | last post: by
4 posts views Thread by nass | last post: by
11 posts views Thread by blackhacker | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.