By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,341 Members | 1,395 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,341 IT Pros & Developers. It's quick & easy.

Regular expression intricacies: why do REs skip some matches?

P: n/a
Hey guys and gals,
This is a followup of my "Counting all permutations of a substring"
thread (see
http://groups.google.com/group/comp....57235b3fd3966f
in Google Groups) I'm still having a difficult time figuring out the
intricacies of regular expressions and consecutive matches. Here's a
brief example:

In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Using redemo.py, one can see that the matches are occurring at the
following spots:
abababa
^ ^ (where ^ indicates the start of a match)
Ideally, there'd be a way to create the regular expression to get at
this match, too:
abababa
^
So that the total matches are:
abababa
^ ^ ^

Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?

Confusedly yours,
Chris

Apr 11 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
> Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?


Yes. The reason is basically that consumed characters can't be
"unconsumed". However, if you use the search-variant with a
start-argument you can search from the last occurence start+1 to achieve
what you're after.

Diez
Apr 11 '06 #2

P: n/a
It's nothing to do with how/what a regular expression matches. It's all
to do with the definition of whatever convenience methods like
findall() that have been built on top of the basic match() method --
having found a match, where do they start looking for the next match?
Typically, one does not want overlapping matches.

Apr 11 '06 #3

P: n/a
> In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?


Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:
print re.findall('(?=aba)', 'abababa')

['','','']

You'd have to do some chicanary to get the actual pieces:

s = 'abababa'
for f in re.finditer('(?=aba)', s):
print "Found %s at %i" % (
s[f.start():f.start()+3],
f.start())

or

[s[f.start():f.start()+3] for f in
re.finditer('(?=aba)', s)]

Note that both of these know the length of the desired
piece. If not, you may have to do additional processing to
get them to work the way you want. Yippie.

All lovely hacks, but they each return all three hits.

-tim
PS: These likely only work in Python...to use them in grep
or another regexp engine, you'd have to tweak them :*)


Apr 11 '06 #4

P: n/a
Tim Chase wrote:
In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?
It's just the way regexes work. You may disagree, but it's more
intuitive that iterated pattern searching be non-overlapping by
default. See also:
'abababa'.count('aba') 2
Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

>>> print re.findall('(?=aba)', 'abababa')
['','','']

You'd have to do some chicanary to get the actual pieces:

(snip)

Actually, you can just define a group inside the lookahead assertion:
re.findall('(?=(aba))', 'abababa')

['aba', 'aba', 'aba']

--Ben

Apr 11 '06 #5

P: n/a
Diez, John, Tim, and Ben, thank you all so much. I now "get it". It
makes logical sense now that the difficulty was actually in the
implementation of findall, which does non-overlapping matches. It also
makes sense, now, that one can get around this by using a lookahead
assertion. Thanks a bunch, guys; this really helped!

Chris

Apr 12 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.