By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,341 Members | 1,396 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,341 IT Pros & Developers. It's quick & easy.

How to get the "longest possible" match with Python's RE module?

P: n/a
Basically, the problem is this:
>>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

Sep 12 '06 #1
Share this Question
Share on Google+
32 Replies


P: n/a
Licheng Fang wrote:
Basically, the problem is this:
>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'
>From what I understand, this isn't python specific, it is the expected
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.
There's another example:
>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'
Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".

I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.

Regards,
Jordan

Sep 12 '06 #2

P: n/a
Licheng Fang wrote:
Basically, the problem is this:
>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

So to get all the combinations you would probably require to give
different patterns each time.

Sep 12 '06 #3

P: n/a

MonkeeSage wrote:
Licheng Fang wrote:
Basically, the problem is this:
>>p = re.compile("do|dolittle")
>>p.match("dolittle").group()
'do'
From what I understand, this isn't python specific, it is the expected
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.
There's another example:
>>p = re.compile("one(self)?(selfsufficient)?")
>>p.match("oneselfsufficient").group()
'oneself'

Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".

I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.

Regards,
Jordan
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.

http://www.softec.st/en/OpenSource/D...onEngines.html
http://www.softec.st/en/OpenSource/D...onEngines.html

Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:

Pattern: one(self)?(selfsufficient)?

PYTHON'S NFA ENGINE:

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

DFA ENGINE:

one self
(start)------->((123))------------>((23))
|
|
| selfsufficient
--------------->((3))

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?

Sep 12 '06 #4

P: n/a
>
I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?
Maybe if you posted a (tested) grep example and data, that does as you
want, the group could better understand what you are asking for?

- Paddy.

Sep 12 '06 #5

P: n/a
[Licheng Fang]
...
I want to know if there is some way to make Python RE behave like grep
does,
Not in general, no. The matching strategies couldn't be more
different, and that's both deep and intentional. See Friedl's book
for details:

http://regex.info/
or do I have to change to another engine?
Yes, if POSIX regexp semantics are what you require. Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching. You may be more motivated to search longer ;-)
Sep 12 '06 #6

P: n/a
Licheng Fang wrote:
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]
Well, I just double-checked in ruby (oniguruma regexp engine):

r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do

r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself

And perl:

if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}

if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}

And Javascript (whatever regexp engine Spidermonkey uses):

var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);

var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);

So, it seems they are all broken, or python is correct as well.

Regards,
Jordan

Sep 12 '06 #7

P: n/a
MonkeeSage wrote:
So, it seems they are all broken, or python is correct as well.
Aha, sorry about that Licheng (re: Tim's post). I guess "broken"
depends on if you are expecting perl-compatible behavior or otherwise.
I have my own scripts I use to do (f)grep and sed-like operations, so I
almost never use those programs and forgot about the different pattern
semantics (part of the reason I made my own scripts).

Regards,
Jordan

Sep 12 '06 #8

P: n/a
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

MonkeeSage wrote:
Licheng Fang wrote:
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]

Well, I just double-checked in ruby (oniguruma regexp engine):

r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do

r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself

And perl:

if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}

if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}

And Javascript (whatever regexp engine Spidermonkey uses):

var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);

var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);

So, it seems they are all broken, or python is correct as well.

Regards,
Jordan
Sep 12 '06 #9

P: n/a
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
Sorry! *blush* I admit I skipped over your links. I'll have a look now.

BTW, just an idea that may or may not work. What about finding all
matches that meet the absolute baseline pattern, then taking the
longest of them...something like this mabye:

def matcher(strings, pattern):
out = ''
reg = re.compile(pattern)
for string in strings:
match = reg.match(string).group()
if (len(match) >= len(out)): # should this use or >= ?
out = match
return out # empty is no matches, else longest match

p = ['dodad', 'dolittle', 'dodaday']
print matcher(p, r'do.*')
# dolittle

Just a thought...

Regards,
Jordan

Sep 12 '06 #10

P: n/a
Thank you very much, Tim and Monkee.

In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
.... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?

Sep 12 '06 #11

P: n/a
Or mabye something like this is better:

def matcher(string, pattern):
out = ''
for match in re.findall(r'\S*%s\S*' % pattern, string):
if (len(match) >= len(out)):
out = match
return out

p1 = 'dodad donkeykong dolittle dodaday'
p2 = 'oneself self-serving selfsufficient oneselfsufficient'
print matcher(p1, 'do')
# donkeykong
print matcher(p2, 'self')
# oneselfsufficient

Sep 12 '06 #12

P: n/a
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
--
--Bryan
Sep 12 '06 #13

P: n/a
[Licheng Fang[
...
In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?
My best advice is to find a different approach entirely. For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses. For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with. OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.
Sep 12 '06 #14

P: n/a
[Licheng Fang]
>Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
[Bryan Olson]
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book
Strongly disagree: it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented. It's not at
all about "regular expressions" in the CompSci sense of the term,
which appears to be your complaint.
was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.
As far as I could tell at the time, he originated it. I'm not happy
about that either.
"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.
And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.
What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
Sep 12 '06 #15

P: n/a

kondal wrote:
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.
I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.

Lorenzo Gatti

Sep 12 '06 #16

P: n/a

Bryan Olson wrote:
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
--
--Bryan
Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.

But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.

Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state

In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:

1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state

Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.

I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.

Sep 12 '06 #17

P: n/a

ga***@dsdata.it wrote:
kondal wrote:
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.

Lorenzo Gatti
Thanks. I think make use of the expresiveness of CFG may be better
idea.

Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.

How can I solve this problem?

Sep 12 '06 #18

P: n/a
"Licheng Fang" <fa*********@gmail.comwrites:
I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.
The backtracking engine really can recognize languages that are not
describable by classical regexps, by using backreferences, negation,
etc. But exactly what it can do is nowhere near as well-understood as
what classical regexps can do.

I seem to remember that the fully general problem of recognizing
regexps with negation is very hard, so the backtracking matcher either
has to reject some strings it should really match, or else it has to
bog down horribly with certain kinds of patterns.
Sep 12 '06 #19

P: n/a

Licheng Fang wrote:
Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.
You can collect all successful parser results beginning from each index
in the string; this gives you all matches with that first index.
You could extend to multiple results general bottom-up context-free
language parsing like Earley or Tomita's algorithms; for reasonable
languages most locations can be excluded for most rules at the
beginning, with great performance improvements over trying over and
over again.

Lorenzo Gatti

Sep 12 '06 #20

P: n/a
Licheng Fang wrote:

The idea is like this:
>
(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution
what's hiding under those ellipses? things like "TERM" and "FACTOR" and
"EXPRESSION", perhaps?

</F>

Sep 12 '06 #21

P: n/a
Licheng Fang wrote:
Basically, the problem is this:

>>>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

>>>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

Licheng,

If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.
>>targets = ['be', 'bee', 'been', 'being']
targets.sort ()
targets.reverse ()
regex = '|'.join (targets)
re.findall (regex, 'Having been a bee in a former life, I don\'t
mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']

You might also take a look at a stream editor I recently came out with:
http://cheeseshop.python.org/pypi/SE/2.2%20beta

It has been well received, especially by newbies, I believe because it
is so simple to use and allows very compact coding.
>>import SE
Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING')
Bee_Editor ('Having been a bee in a former life, I don\'t mind
being what I am and wouldn\'t want to be a bee ever again.')

"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."

Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.

>>SE.SE ('<EATbe==, bee==, been==, being==,')(above_sting)
'been,bee,being,be,bee,'

You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.
>>Key_Word_Translator = SE.SE ('''
"*INT=int substitute"
"*DECIMAL=decimal substitute"
"*FACTION=faction substitute"
"*NUMERALS=numerals substitute"
# ... etc.
''')

I don't know if that could serve.

Regards

Frederic
Sep 12 '06 #22

P: n/a
Tim Peters wrote:
[Bryan Olson]
>Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book

Strongly disagree: [...]
I know I'm disagreeing with a lot of smart people in panning
the book.
>What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.
--
--Bryan
Sep 13 '06 #23

P: n/a
Licheng Fang wrote:
Basically, the problem is this:
>>>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?
Yes. Here's a way, but it sucks real bad:
def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None
--
--Bryan
Sep 13 '06 #24

P: n/a
[Licheng Fang]
>Basically, the problem is this:
>>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'
....
>The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?
[Bryan Olson]
Yes. Here's a way, but it sucks real bad:
def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None
If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.
Sep 13 '06 #25

P: n/a
[Bryan Olson]
>>Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book
[Tim Peters]
>Strongly disagree: [...]
[Bryan]
I know I'm disagreeing with a lot of smart people in panning
the book.
That's allowed :-)
>>What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
>Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.
My guess then is that the book moved at too slow a pace for you, and
you got bored. The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal: double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character: not a
double quote, backslash, or newline. The "special case" is a
backslash followed by any character whatsoever. The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end) fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense". It's a set of obscure & specialized learned skills.
Sep 13 '06 #26

P: n/a
Tim Peters wrote:
[...] The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
[...]
As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).
So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.
--
--Bryan
Sep 13 '06 #27

P: n/a
Licheng Fang wrote:
Basically, the problem is this:
>>>p = re.compile("do|dolittle")
p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>>p = re.compile("one(self)?(selfsufficient)?")
p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

A reply from Tim Peters reminded me that I once programmed a
simple special case of this problem. In particular, given a
lot of strings, my function returns a regexp that efficiently
matches the longest of prefix equal to any of the strings.
If the application is like the examples, where the target RE
is just the or of strings, the method should work well.
http://groups.google.com/group/comp....2eb274845653e9
Now I'm thinking of a more general version, that handles all
true regular expressions.
--
--Bryan
Sep 13 '06 #28

P: n/a

Frederic Rentsch wrote:

If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.
Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Sep 13 '06 #29

P: n/a
ga***@dsdata.it wrote:
Frederic Rentsch wrote:
> If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Very true! Funny you should remind me, considering that I spent quite
some time upgrading SE to allow regular expressions. Version 1 didn't
and could resolve precedence at compile time. Version 2 resolves
precedence at runtime by length of the matches and should function
correctly in this respect, although it might not function fast enough
for speed-critical applications. But then there is in general a
trade-off between convenience and speed.
Frederic

Sep 13 '06 #30

P: n/a
ga***@dsdata.it wrote:
Frederic Rentsch wrote:
> If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Oh yes, and my proposal to reverse-sort the targets was in response to
the OP who wanted to generate a regex from a bunch of strings. It should
work for that.
Frederic

Sep 13 '06 #31

P: n/a
Thank you guys. I've written a CYK parser and realized this is the
right direction. It gives every possible interpretation of the string
and I can retrieve whatever I want.

Sep 15 '06 #32

P: n/a
[Tim Peters]
>[...] The most valuable general technique [Friedl] (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
[Bryan Olson]
So how general is that?
Between 6 and 7 ;-)
The books just says "certain common expressions". I think the real
answer is that one can unroll exactly the true regular expressions.
I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'
The unrolling is essentially resolving the states of a DFA by hand.
Oh yes, for those who've studied the theoretical underpinnings. Most
people have not. Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works. And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here. Such explanations also extend
naturally to gimmicks like backreferences, which have no counterpart
in CompSci regexps.
Sep 16 '06 #33

This discussion thread is closed

Replies have been disabled for this discussion.