Does my RE rock, or suck?

Bryan Olson

Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.
I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?

What would be the worst-case time?

Am I likely to exceed recursion depth?

Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?
Thanks for any help answering these.

--
--Bryan

Jul 18 '05 #1

Subscribe Post Reply

1527

Michael Hudson

Bryan Olson <fa*********@nowhere.org> writes:

Here's my algorithmic problem:

Given a set of 'trigger' strings, and one 'path' string,
find the longest of the triggers that is a prefix of the
path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers. Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil. This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger. For example, given:

triggers = ["Alice", "Bob" "John", "John Smith",
"John Smith Jr", "John Smith Sr"]

My re-generator returns:

re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(', but
no need to worry about that now. There are never *'s or +'s or
other loop-like things in the generated RE's. The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right." The order
should ensure finding the longest matching trigger. In (a
couple minutes of testing, it seems to work.
I've been burned using Python's RE module in the past. Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense. In particular:

Is this likely to be fast in normal cases? Is there a
clearly better approach?
Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
"Dozens to hundreds" sounds like a fairly small number in context,
but: time it! Only you know what fast enough is.

Here's a cute approach that should have running time independent of
the number of triggers:

def build_dict(prefix_list):
d = {}
for p in prefix_list:
if len(p) > 0:
d.setdefault(p[0], []).append(p[1:])
r = {}
for p in d:
r[p] = build_dict(d[p])
if '' in d[p]:
r[p][''] = 1
return r

def longest_prefix(s, d):
i = 0
j = 0
for c in s:
if c in d:
d = d[c]
i += 1
if '' in d:
j = i
else:
return s[:j]
else:
return s[:j]

if __name__ == '__main__':
d = build_dict(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix("John Smith", d)
print longest_prefix("John Smith", d)

Here's a less cute but probably saner approach:

def prepare(prefix_list):
d = {}
for p in prefix_list:
d.setdefault(len(p), {})[p] = p
l = d.keys()
l.sort()
l.reverse()
return d, l

def longest_prefix2(s, d, l):
for i in l:
r = d[i].get(s[:i])
if r is not None:
return r
return ''

if __name__ == '__main__':
d, l = prepare(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix2("John Smith", d, l)
print longest_prefix2("John Smith", d, l)

this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
What would be the worst-case time?

Am I likely to exceed recursion depth?
Not in 2.4 :-) And I doubt it before that, actually.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?
Thanks for any help answering these.

This thread from a year or so back might interest:

http://groups.google.com/gr*********...ing.google.com

Cheers,
mwh

--
: Giant screaming pieces of excrement, they are.
I have a feeling that some of the people in here have a
MUCH more exciting time relieving themselves than I do.
-- Mike Sphar & Dave Brown, asr

Jul 18 '05 #2

Bryan Olson

Thanks Michael. I still have more investigating to do.

Michael Hudson wrote:

Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.

In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

"Dozens to hundreds" sounds like a fairly small number in context,
I agree, though hundreds would make the longest RE string I've
ever used.
but: time it! Only you know what fast enough is.
It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.

Here's a cute approach that should have running time independent
of the number of triggers: [snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure. My (trivial)
tests so far indicate that the RE form is faster in typical
cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.
Here's a less cute but probably saner approach: [...] this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.
[I, Bryan asked:]
Am I likely to exceed recursion depth?

Not in 2.4 :-) And I doubt it before that, actually.

But Python is only up to ... oh, the smiley. I get it.

This thread from a year or so back might interest:

http://groups.google.com/gr*********...ing.google.com

Thanks,

--
--Bryan

Jul 18 '05 #3

Michael Hudson

Bryan Olson <fa*********@nowhere.org> writes:

Thanks Michael. I still have more investigating to do.

Michael Hudson wrote:
> Well, the re module is an NFA engine, not a DFA engine, so I'd expect
> this re to run in time proportional to the number of triggers.
I guess I didn't point this out except for my example: I tried
to make the RE efficient even for a search-and-backtrack engine.
Not only are there no loop-like things, there are never multiple
choices that begin with the same character. Any time it
backtracks, it will fail immediately at every choice point.

Hmm, you sound like you know more about this sort of thing than I do
:-)
In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)
Well, I don't know much about the re compiler. It's written in
Python, though, so it shouldn't be *that* hard to check for
yourself...

> "Dozens to hundreds" sounds like a fairly small number in context,

I agree, though hundreds would make the longest RE string I've
ever used.
> but: time it! Only you know what fast enough is.

It seems fast. Testing doesn't allay my fears of a bad worst-
case that an adversary might discover.
> Here's a cute approach that should have running time independent
> of the number of triggers:

[snip]

I've seen that kind of structure called a "trie" (Knuth cites
the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
my RE, I sort of use the same trie structure.

Heh, I'd sort of heard of tries but didn't realize I was implementing
one. I just thought it was a simple minded DFA engine...
My (trivial) tests so far indicate that the RE form is faster in
typical cases. Your trie-searcher is nice in that I have much more
confidence in the worst-case run-time.
Well, the re engine is written in C, so you would hope that it beats
my Python engine...

The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.

> Here's a less cute but probably saner approach:

[...]
> this should run in time proportional to the number of different
> lengths of trigger (and use rather less memory than my first version).

I'm not convinced. The time to hash a string is proportional to
the length, so I think searches can take time proportional to
the sum of all the trigger lengths.

I guess, but the string hashing is done in C and is "probably"
insignificant.
[I, Bryan asked:]
>> Am I likely to exceed recursion depth?

>
> Not in 2.4 :-) And I doubt it before that, actually.

But Python is only up to ... oh, the smiley. I get it.

Well, my point was there is no recursion limit in Python 2.4's sre
engine. But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.

Cheers,
mwh

--
Need to Know is usually an interesting UK digest of things that
happened last week or might happen next week. [...] This week,
nothing happened, and we don't care.
-- NTK Now, 2000-12-29, http://www.ntk.net/

Jul 18 '05 #4

Pierre-Frédéric Caillaud

I think you need some binary search or some form of tree here.

Make a list of prefixes, sort them, store this list somewhere it's quick
to access.
It is then very quick (N log N) to look in that list which key
corresponds to the beginning of a path.
Basically you do a bisection search and, of course you don't have an
exact match but the last key you end up on is your prefix.

I believe there is a btree data type in the Python library.

Jul 18 '05 #5

Gustavo Niemeyer

Hello Bryan,

Is this likely to be fast in normal cases? Is there a
clearly better approach?
Yes, it should be fast, since no repetitive backtracking will happen.
What would be the worst-case time?
Given the context (person names), your worst case will probably be
failing every name.
Am I likely to exceed recursion depth?
You won't, unless you find a name with thousands of words, and
variants for every word. With Python 2.4, you'll only get in
trouble if you exceed your system's memory.
Any other gotcha? Am I likely to find it when I compile the
RE, or when I try to match a path?

It looks like a sane use of regular expressions.

--
Gustavo Niemeyer
http://niemeyer.net

Jul 18 '05 #6

Gustavo Niemeyer

[...]

In my example I compile the RE:

'^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'

If the RE compiler is smart, searching should involve scanning
the longest prefix that could still match, plus looking at just
This won't happen right now. Minimum and maximum lenght is currently
only checked for the whole pattern, not for every branch.
one character for each branch-point along the way. (If it uses
a jump-table at decision points, it should be truly linear.)

OTOH, there's some intelligence for this specific case. The first
character is checked even before "entering" the branch (IOW,
preparing the enviornment for checking the branch), so your
pattern should match really fast.

It will be something close to:

(1) Entering a branch, save current state as necessary.
(2) Check first character, does it match?
(3) If yes, get into the branch, saving state as necessary, and
checking other characters, subpatterns, etc.
(4) No, jump to next alternative, go back to (2).
(5) Yes, leave the current branch with success, restoring states
as necessary.

--
Gustavo Niemeyer
http://niemeyer.net

Jul 18 '05 #7

Bryan Olson

Pierre-Frédéric Caillaud wrote:

I think you need some binary search or some form of tree here.

Make a list of prefixes, sort them, store this list somewhere it's
quick to access.
It is then very quick (N log N) to look in that list which key
corresponds to the beginning of a path.
Basically you do a bisection search and, of course you don't have
an exact match but the last key you end up on is your prefix.

That doesn't quite work for my problem. Consider the list:

'John',
'Johnson',
'Jones'

A search for the string 'Johnston' (note the 't') will locate
the point between 'Johnson' and 'Jones'. 'Johnson' shares a
long prefix with 'Johnston', but I want the longest string in
the list that *is* a prefix of 'Johnston', and in this case that
is 'John'.
--
--Bryan

Jul 18 '05 #8

Bryan Olson

Thanks for responses Gustavo. That's most of what I need to
know.

Gustavo Niemeyer wrote:
[...]

Given the context (person names), your worst case will probably be
failing every name.
The names were just for demo. The actual application is to
match resource paths with the most-specific handler.

[...] It looks like a sane use of regular expressions.

Since it's not obviously insane, I'll include the code below,
in case anyone else wants to use it or QA it for free.
--Bryan
"""
The Problem: We're given a set of "start strings". Later
we'll be called (many times) with a "path string", and we
need to find the longest of the start strings that is a
prefix of the given path string.

Method: Build a Python regular expression (RE) from the
start strings, such that re.search(path) will match the
desired prefix. The RE has no loop-like things, and never
has two branches beginning with the same character.
Searching should be fast in all cases.

"""

import re
def build_prefix_re(str_list):
""" Given a sequence of strings, return a compiled RE such
that build_prefix_re(str_list).search(x) will match the
longest string in str_list that is a prefix of x.
"""

def trie_insert(map, str):
if str:
submap = map.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
map[""] = {}

def sequentialize(map, lst):
if map:
keys = map.keys()
# Order so that longer matches are first
keys.sort()
keys.reverse()
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = map[key]
while len(submap) == 1:
# While follow-set is singleton, condense.
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')

map = {}
for str in str_list:
trie_insert(map, str)
lst = ['^']
sequentialize(map, lst)
re_str = "".join(lst)
# print "Prefix finding RE is: '%s'\n" % re_str
return re.compile(re_str)
if __name__ == '__main__':
slist = ["a.b/cde/fg", "bchij", "bivwd", "cdj", "cdjwv", "cdjxi"]
prematcher = build_prefix_re(slist)
for str in slist:
match = prematcher.search(str)
assert(match.group(0) == str)
s = str + 'extraneous'
match = prematcher.search(s)
assert(match.group(0) == str)
for str in ('', 'cd', 'bb', 'cdi', 'xbchij', 'bchiij'):
match = prematcher.search(str)
assert not match, "BAD, found '%s'" % str

Jul 18 '05 #9

Similar topics

Mirrors that don't suck.

by: Dan Vande More | last post by:

Hey list, I'm just wondering if anyone can point me in the direction of a mirror that doesn't suck. I generally don't do alot with postgres other than downloading and installing the newest...

PostgreSQL Database

Building Paper, Rock, Scissors

by: TooNaive | last post by:

Hello all, I'm taking a C class and am having to write a program to play a game of rock, paper scissors, and with the output of: You chose paper and I chose rock. You Win where paper is the...

C / C++

Generics in .NET 2.0 SUCK

by: Ruediger Klaehn | last post by:

Sorry about the harsh language, but I have to vent my anger to somebody who actually understands what I am talking about. Complaining to my girlfriend is not going to produce any meaningful results...

C# / C Sharp

458

On Java and C++

by: wellstone9912 | last post by:

Java programmers seem to always be whining about how confusing and overly complex C++ appears to them. I would like to introduce an explanation for this. Is it possible that Java programmers...

C / C++

Does CSS Suck?

by: Bill Norton | last post by:

Over at PCMag.com there is a debate going on over the usability of CSS. The discussion was initiated by an article by John Dvorak called "Why CSS Bugs Me"...

HTML / CSS

Tkinter--does anyone use it for sophisticated GUI development?

by: Kevin Walzer | last post by:

I'm a Tcl/Tk developer who has been working, slowly, at learning Python, in part because Python has better support for certain kinds of applications that I want to develop than Tcl/Tk does....

Python

how does aio_write() work

by: nass | last post by:

hello everyone i am looking into asynchronous file writing (appending). what the program is doing: i am running a loop and when the condition is met (basically a timer) a buffer gets appended to...

C / C++

Paper,Scissor,Rock Help

by: blackhacker | last post by:

Please can anyone help me to make a simple simple Java code to force the game scissor,paper,rock to work ? for example i know how to do it in Visual Basic but not in java,for example this is kind...

Java

rock, paper, scissor help C programming

by: jmf777 | last post by:

Hi here is my problem I want to have the outcome of my rock paper scissors game to print outcomes like: You chose paper and I chose rock. You win. The value for player_choice and machine_choice is...

C / C++

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing