Jeff Epler wrote:
The last time this question went around, I learned that REs like
a|b|c
essentially tries the alternatives one after another, rather than
compiling into an FSM
that depends somewhat on what a, b, and c happens to be.
for example,
if all alternatives consist of a single literal character, the entire
subexpression is replaced with a character set ("a|b|c" becomes
"[abc]")
for alternatives that start with literal text or a character set, the
engine never checks alternatives that cannot possible match (if
you feed "aha" to "a...|b...|c...", the second and third alternative
are never checked).
if all alternatives share a common prefix, that prefix will be checked
before any alternative is tried; if the prefix matches, only the suffixes
will be checked for each alternative. ("aa|ab|ac" becomes "a(?:a|b|c)"
becomes "a[abc]")
when searching, the engine uses a KMP-style overlap table to skip
over places where the prefix cannot possibly match. (which explains
why re.search can sometimes run faster than string.find)
</F>