Basically, the problem is this:
>>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.
Is there a way to do this in Python? 32 14195
Licheng Fang wrote:
Basically, the problem is this:
>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
>From what I understand, this isn't python specific, it is the expected
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.
There's another example:
>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".
I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.
Regards,
Jordan
Licheng Fang wrote:
Basically, the problem is this:
>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.
Is there a way to do this in Python?
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.
So to get all the combinations you would probably require to give
different patterns each time.
MonkeeSage wrote:
Licheng Fang wrote:
Basically, the problem is this:
>>p = re.compile("do|dolittle")
>>p.match("dolittle").group()
'do'
From what I understand, this isn't python specific, it is the expected
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.
There's another example:
>>p = re.compile("one(self)?(selfsufficient)?")
>>p.match("oneselfsufficient").group()
'oneself'
Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".
I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.
Regards,
Jordan
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation. http://www.softec.st/en/OpenSource/D...onEngines.html http://www.softec.st/en/OpenSource/D...onEngines.html
Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:
Pattern: one(self)?(selfsufficient)?
PYTHON'S NFA ENGINE:
one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))
DFA ENGINE:
one self
(start)------->((123))------------>((23))
|
|
| selfsufficient
--------------->((3))
I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?
>
I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?
Maybe if you posted a (tested) grep example and data, that does as you
want, the group could better understand what you are asking for?
- Paddy.
[Licheng Fang]
...
I want to know if there is some way to make Python RE behave like grep
does,
Not in general, no. The matching strategies couldn't be more
different, and that's both deep and intentional. See Friedl's book
for details: http://regex.info/
or do I have to change to another engine?
Yes, if POSIX regexp semantics are what you require. Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching. You may be more motivated to search longer ;-)
Licheng Fang wrote:
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]
Well, I just double-checked in ruby (oniguruma regexp engine):
r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do
r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself
And perl:
if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}
if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}
And Javascript (whatever regexp engine Spidermonkey uses):
var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);
var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);
So, it seems they are all broken, or python is correct as well.
Regards,
Jordan
MonkeeSage wrote:
So, it seems they are all broken, or python is correct as well.
Aha, sorry about that Licheng (re: Tim's post). I guess "broken"
depends on if you are expecting perl-compatible behavior or otherwise.
I have my own scripts I use to do (f)grep and sed-like operations, so I
almost never use those programs and forgot about the different pattern
semantics (part of the reason I made my own scripts).
Regards,
Jordan
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
MonkeeSage wrote:
Licheng Fang wrote:
Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.
[snip]
Well, I just double-checked in ruby (oniguruma regexp engine):
r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do
r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself
And perl:
if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}
if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}
And Javascript (whatever regexp engine Spidermonkey uses):
var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);
var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);
So, it seems they are all broken, or python is correct as well.
Regards,
Jordan
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
Sorry! *blush* I admit I skipped over your links. I'll have a look now.
BTW, just an idea that may or may not work. What about finding all
matches that meet the absolute baseline pattern, then taking the
longest of them...something like this mabye:
def matcher(strings, pattern):
out = ''
reg = re.compile(pattern)
for string in strings:
match = reg.match(string).group()
if (len(match) >= len(out)): # should this use or >= ?
out = match
return out # empty is no matches, else longest match
p = ['dodad', 'dolittle', 'dodaday']
print matcher(p, r'do.*')
# dolittle
Just a thought...
Regards,
Jordan
Thank you very much, Tim and Monkee.
In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:
(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
.... ...
What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.
My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?
Or mabye something like this is better:
def matcher(string, pattern):
out = ''
for match in re.findall(r'\S*%s\S*' % pattern, string):
if (len(match) >= len(out)):
out = match
return out
p1 = 'dodad donkeykong dolittle dodaday'
p2 = 'oneself self-serving selfsufficient oneselfsufficient'
print matcher(p1, 'do')
# donkeykong
print matcher(p2, 'self')
# oneselfsufficient
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.
"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.
What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
--
--Bryan
[Licheng Fang[
...
In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:
(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...
What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.
My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?
My best advice is to find a different approach entirely. For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.
Regexps are a brittle tool, best tolerated in small doses. For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with. OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).
Just sounds to me like the wrong tool for the job.
[Licheng Fang]
>Oh, please do have a look at the second link I've posted. There's a table comparing the regexp engines. The engines you've tested probably all use an NFA implementation.
[Bryan Olson]
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book
Strongly disagree: it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented. It's not at
all about "regular expressions" in the CompSci sense of the term,
which appears to be your complaint.
was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.
As far as I could tell at the time, he originated it. I'm not happy
about that either.
"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.
And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.
What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
kondal wrote:
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.
I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.
Lorenzo Gatti
Bryan Olson wrote:
Licheng Fang wrote:
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.
Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.
"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.
What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.
--
--Bryan
Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.
But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.
Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between
one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))
1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state
In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:
1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state
Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.
I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself. ga***@dsdata.it wrote:
kondal wrote:
This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.
I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.
Lorenzo Gatti
Thanks. I think make use of the expresiveness of CFG may be better
idea.
Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.
How can I solve this problem?
"Licheng Fang" <fa*********@gmail.comwrites:
I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.
The backtracking engine really can recognize languages that are not
describable by classical regexps, by using backreferences, negation,
etc. But exactly what it can do is nowhere near as well-understood as
what classical regexps can do.
I seem to remember that the fully general problem of recognizing
regexps with negation is very hard, so the backtracking matcher either
has to reject some strings it should really match, or else it has to
bog down horribly with certain kinds of patterns.
Licheng Fang wrote:
Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.
You can collect all successful parser results beginning from each index
in the string; this gives you all matches with that first index.
You could extend to multiple results general bottom-up context-free
language parsing like Earley or Tomita's algorithms; for reasonable
languages most locations can be excluded for most rules at the
beginning, with great performance improvements over trying over and
over again.
Lorenzo Gatti
Licheng Fang wrote:
The idea is like this:
>
(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...
What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution
what's hiding under those ellipses? things like "TERM" and "FACTOR" and
"EXPRESSION", perhaps?
</F>
Licheng Fang wrote:
Basically, the problem is this:
>>>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.
Is there a way to do this in Python?
Licheng,
If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.
>>targets = ['be', 'bee', 'been', 'being'] targets.sort () targets.reverse () regex = '|'.join (targets) re.findall (regex, 'Having been a bee in a former life, I don\'t
mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']
You might also take a look at a stream editor I recently came out with: http://cheeseshop.python.org/pypi/SE/2.2%20beta
It has been well received, especially by newbies, I believe because it
is so simple to use and allows very compact coding.
>>import SE Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING') Bee_Editor ('Having been a bee in a former life, I don\'t mind
being what I am and wouldn\'t want to be a bee ever again.')
"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."
Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.
>>SE.SE ('<EATbe==, bee==, been==, being==,')(above_sting)
'been,bee,being,be,bee,'
You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.
>>Key_Word_Translator = SE.SE ('''
"*INT=int substitute"
"*DECIMAL=decimal substitute"
"*FACTION=faction substitute"
"*NUMERALS=numerals substitute"
# ... etc.
''')
I don't know if that could serve.
Regards
Frederic
Tim Peters wrote:
[Bryan Olson]
>Unfortunately, the stuff about NFA's is wrong. Friedl's awful book
Strongly disagree: [...]
I know I'm disagreeing with a lot of smart people in panning
the book.
>What Python uses is search-and-backtrack. Unfortunately such engines don't have much theory behind them, and it's hard to reason generally about what they do.
Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.
--
--Bryan
Licheng Fang wrote:
Basically, the problem is this:
>>>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.
Is there a way to do this in Python?
Yes. Here's a way, but it sucks real bad:
def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None
--
--Bryan
[Licheng Fang]
>Basically, the problem is this:
>>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
....
>The Python regular expression engine doesn't exaust all the possibilities, but in my application I hope to get the longest possible match, starting from a given point.
Is there a way to do this in Python?
[Bryan Olson]
Yes. Here's a way, but it sucks real bad:
def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None
If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.
[Bryan Olson]
>>Unfortunately, the stuff about NFA's is wrong. Friedl's awful book
[Tim Peters]
>Strongly disagree: [...]
[Bryan]
I know I'm disagreeing with a lot of smart people in panning
the book.
That's allowed :-)
>>What Python uses is search-and-backtrack. Unfortunately such engines don't have much theory behind them, and it's hard to reason generally about what they do.
>Yup X 3, and the last is precisely why Friedl's book is valuable for people using "NFA" implementations: Friedl does a good job of explaining when and why you're likely to trigger atrocious runtime performance, and gives practical general techniques for avoiding those problems. None of that has anything to do with CompSci regexps either, but is vital knowledge for people hoping to make happy non-trivial use of Python/Perl/etc regexps.
I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.
My guess then is that the book moved at too slow a pace for you, and
you got bored. The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:
normal* (?: special normal* )*
where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.
For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal: double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):
" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "
The "normal case" is that it's just a regular old character: not a
double quote, backslash, or newline. The "special case" is a
backslash followed by any character whatsoever. The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end) fits the
"unrolling" pattern to a tee.
As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).
The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).
A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense". It's a set of obscure & specialized learned skills.
Tim Peters wrote:
[...] The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:
normal* (?: special normal* )*
where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.
[...]
As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).
So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.
--
--Bryan
Licheng Fang wrote:
Basically, the problem is this:
>>>p = re.compile("do|dolittle") p.match("dolittle").group()
'do'
Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:
>>>p = re.compile("one(self)?(selfsufficient)?") p.match("oneselfsufficient").group()
'oneself'
The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.
Is there a way to do this in Python?
A reply from Tim Peters reminded me that I once programmed a
simple special case of this problem. In particular, given a
lot of strings, my function returns a regexp that efficiently
matches the longest of prefix equal to any of the strings.
If the application is like the examples, where the target RE
is just the or of strings, the method should work well. http://groups.google.com/group/comp....2eb274845653e9
Now I'm thinking of a more general version, that handles all
true regular expressions.
--
--Bryan
Frederic Rentsch wrote:
If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.
Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.
Lorenzo Gatti ga***@dsdata.it wrote:
Frederic Rentsch wrote:
> If you need regexes, why not just reverse-sort your expressions? This seems a lot easier and faster than writing another regex compiler. Reverse-sorting places the longer ones ahead of the shorter ones.
Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.
Lorenzo Gatti
Very true! Funny you should remind me, considering that I spent quite
some time upgrading SE to allow regular expressions. Version 1 didn't
and could resolve precedence at compile time. Version 2 resolves
precedence at runtime by length of the matches and should function
correctly in this respect, although it might not function fast enough
for speed-critical applications. But then there is in general a
trade-off between convenience and speed.
Frederic ga***@dsdata.it wrote:
Frederic Rentsch wrote:
> If you need regexes, why not just reverse-sort your expressions? This seems a lot easier and faster than writing another regex compiler. Reverse-sorting places the longer ones ahead of the shorter ones.
Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.
Lorenzo Gatti
Oh yes, and my proposal to reverse-sort the targets was in response to
the OP who wanted to generate a regex from a bunch of strings. It should
work for that.
Frederic
Thank you guys. I've written a CYK parser and realized this is the
right direction. It gives every possible interpretation of the string
and I can retrieve whatever I want.
[Tim Peters]
>[...] The most valuable general technique [Friedl] (eventually ;-) explained he called "unrolling", and consists of writing a regexp in the form:
normal* (?: special normal* )*
where the sets of characters with which `normal` and `special` can start are disjoint. Under most "NFA" implementation, such a regexp is immune to most bad behaviors, whether finding very long matches, or in searching a very long string that happens not to contain any matches.
[Bryan Olson]
So how general is that?
Between 6 and 7 ;-)
The books just says "certain common expressions". I think the real
answer is that one can unroll exactly the true regular expressions.
I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'
The unrolling is essentially resolving the states of a DFA by hand.
Oh yes, for those who've studied the theoretical underpinnings. Most
people have not. Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works. And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.
That explanation may not work for you, but it's effective for people
who haven't studied the theory here. Such explanations also extend
naturally to gimmicks like backreferences, which have no counterpart
in CompSci regexps. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Chris Stromberger |
last post by:
When issuing updates in mysql (in the console window), mysql will tell
you if any rows matched and how many rows were updated (see below). I
know how to get number of rows udpated using MySQLdb,...
|
by: Jim Hill |
last post by:
I've done some Googling around on this and it seems like creating a here
document is a bit tricky with Python. Trivial via triple-quoted strings
if there's no need for variable interpolation but...
|
by: Mark Shroyer |
last post by:
I guess this sort of falls under the "shameless plug" category, but
here it is: Recently I used a custom metaclass in a Python program
I've been working on, and I ended up doing a sort of write-up...
|
by: Evan |
last post by:
Hello,
one of my PC is window system, and in "control panel -Network
Connections", I can see some network connections such as PPPOE or VPN
which I created by click "create a new connection".
...
|
by: Kurda Yon |
last post by:
Hi,
I try to "build" and "install" pysqlite? After I type "python setup.py
build" I get a lot of error messages? The first error is "src/
connection.h:33:21: error: sqlite3.h: No such file or...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM)
The start time is equivalent to 19:00 (7PM) in Central...
|
by: kcodez |
last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
|
by: Taofi |
last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same
This are my field names
ID, Budgeted, Actual, Status and Differences
...
|
by: DJRhino1175 |
last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this -
If...
|
by: Rina0 |
last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
|
by: DJRhino |
last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer)
If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _
310030356 Or 310030359 Or 310030362 Or...
|
by: lllomh |
last post by:
Define the method first
this.state = {
buttonBackgroundColor: 'green',
isBlinking: false, // A new status is added to identify whether the button is blinking or not
}
autoStart=()=>{
|
by: lllomh |
last post by:
How does React native implement an English player?
|
by: DJRhino |
last post by:
Was curious if anyone else was having this same issue or not....
I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
| |