By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,814 Members | 1,111 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,814 IT Pros & Developers. It's quick & easy.

re.match -- not greedy?

P: n/a
the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:
>>import re;re.match('(a+)(ab)','aaab').groups()
('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

in real life, i am trying to match #defines:
>>re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

so what's the definition of greedy?

(pls copy me on responses)

ta,
jack
Nov 19 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
EXI-Andrews, Jack wrote:
the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:
>import re;re.match('(a+)(ab)','aaab').groups()
('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching
It's called backtracking. If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.
in real life, i am trying to match #defines:
>re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.
Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)
so what's the definition of greedy?
It means match as much you can *while getting a match*.

Just remember regexp parsing is not a one way street: if it hits a dead
end, it'll go back and try another path. In fact, greediness usually
doesn't affect *whether* a regexp matches; it only affects the
groupings. I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.
(pls copy me on responses)
Ah, too bad you won't see it then....
Carl Banks

Nov 19 '06 #2

P: n/a
* Carl Banks wrote:
I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.
nope, per definitionem ;-)

greedy := longest leftmost match

nd
--
die (eval q-qq[Just Another Perl Hacker
]
;-)
# André Malo, <http://www.perlig.de/#
Nov 20 '06 #3

P: n/a
Carl Banks wrote:
EXI-Andrews, Jack wrote:
>the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:
>>>>import re;re.match('(a+)(ab)','aaab').groups()
('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

It's called backtracking. If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.
>in real life, i am trying to match #defines:
>>>>re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)
Another way that seems clearer to me is to use negative lookahead
assertions.
>>>defPatt = re.compile(r'#define\s+(\w+)\b(?!\()')
defPatt.match("#define abc(x)")
defPatt.match("#define foo_BarBaz")
<_sre.SRE_Match object at 0xb7dd5820>
>>defPatt.match("#define foo_BarBaz").groups()
('foo_BarBaz',)

In general '\w' is the same as [A-Za-z0-9_]

There are other considerations too... I don't know if

#define abc (x)

But the main thing here is the use of '\b' to require there to be a word
boundary at the end your match and a (?!\() to require that the match is
not followed by a '('

see http://docs.python.org/lib/re-syntax.html

noah

>so what's the definition of greedy?

It means match as much you can *while getting a match*.

Just remember regexp parsing is not a one way street: if it hits a dead
end, it'll go back and try another path. In fact, greediness usually
doesn't affect *whether* a regexp matches; it only affects the
groupings. I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.
>(pls copy me on responses)

Ah, too bad you won't see it then....
Carl Banks
Nov 20 '06 #4

P: n/a
"EXI-Andrews, Jack" <Ja**********@boeing.comwrote in message
news:ma**************************************@pyth on.org...

in real life, i am trying to match #defines:
>>re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.
>
Here's a pyparsing grammar that matches your #defines as you describe (skip
over #defines with parameters), and also skips over ones that are commented
out. It also parses the rest of the definition line (definitions with
end-of-line '\' continuations not included in this sample):

from pyparsing import *
defWithoutParen = "#define" + \
Word(alphas,alphanums+"_") + \
~Literal("(").leaveWhitespace() + \
Optional(restOfLine)
defWithoutParen.ignore(cStyleComment)

Here's sample code showing how to use this grammar:

sample = """
#define PI 3.141592
#define SUCCESS (href==1)
#define MIN(a,b) a<b ? a : b
#define E 2.71828
/*
#define lockMutex mockLockMutex
*/
#define WIN32
"""
matches = defWithoutParen.searchString(sample)
for m in matches:
print m

Prints out:
['#define', 'PI', '3.141592']
['#define', 'SUCCESS', '(href==1)']
['#define', 'E', '2.71828']
['#define', 'WIN32', '']

-- Paul
Nov 20 '06 #5

P: n/a
Carl Banks wrote:
I'm suddenly curious if there are any cases at all where greediness
changes whether it finds a match.
the "|" operator picks the leftmost subpattern that matches in any way,
so the *overall* greediness of a Python RE can vary. e.g.

re.match("(ab?a|a.a+)", "abaaaaaaaa")

matches three characters, not ten.

</F>

Nov 20 '06 #6

P: n/a
Ant

EXI-Andrews, Jack wrote:
the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:
>import re;re.match('(a+)(ab)','aaab').groups()
('aa', 'ab')

this is the sort of behaviour i'd expect from
'(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching
They are greedy - they consume as much as possible *without*
sacrificing the match.

You are thinking I think of possessive quantifiers, such as found in
Java, which *will* match as much as possible even if doing so causes
the match to fail. This would be written:

re.match('(a++)(ab)','aaab')

Python doesn't support these possessive quantifiers.

Nov 22 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.