473,399 Members | 2,774 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Regular expression intricacies: why do REs skip some matches?

Hey guys and gals,
This is a followup of my "Counting all permutations of a substring"
thread (see
http://groups.google.com/group/comp....57235b3fd3966f
in Google Groups) I'm still having a difficult time figuring out the
intricacies of regular expressions and consecutive matches. Here's a
brief example:

In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Using redemo.py, one can see that the matches are occurring at the
following spots:
abababa
^ ^ (where ^ indicates the start of a match)
Ideally, there'd be a way to create the regular expression to get at
this match, too:
abababa
^
So that the total matches are:
abababa
^ ^ ^

Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?

Confusedly yours,
Chris

Apr 11 '06 #1
5 1439
> Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?


Yes. The reason is basically that consumed characters can't be
"unconsumed". However, if you use the search-variant with a
start-argument you can search from the last occurence start+1 to achieve
what you're after.

Diez
Apr 11 '06 #2
It's nothing to do with how/what a regular expression matches. It's all
to do with the definition of whatever convenience methods like
findall() that have been built on top of the basic match() method --
having found a match, where do they start looking for the next match?
Typically, one does not want overlapping matches.

Apr 11 '06 #3
> In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?


Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:
print re.findall('(?=aba)', 'abababa')

['','','']

You'd have to do some chicanary to get the actual pieces:

s = 'abababa'
for f in re.finditer('(?=aba)', s):
print "Found %s at %i" % (
s[f.start():f.start()+3],
f.start())

or

[s[f.start():f.start()+3] for f in
re.finditer('(?=aba)', s)]

Note that both of these know the length of the desired
piece. If not, you may have to do additional processing to
get them to work the way you want. Yippie.

All lovely hacks, but they each return all three hits.

-tim
PS: These likely only work in Python...to use them in grep
or another regexp engine, you'd have to tweak them :*)


Apr 11 '06 #4
Tim Chase wrote:
In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?
It's just the way regexes work. You may disagree, but it's more
intuitive that iterated pattern searching be non-overlapping by
default. See also:
'abababa'.count('aba') 2
Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

>>> print re.findall('(?=aba)', 'abababa')
['','','']

You'd have to do some chicanary to get the actual pieces:

(snip)

Actually, you can just define a group inside the lookahead assertion:
re.findall('(?=(aba))', 'abababa')

['aba', 'aba', 'aba']

--Ben

Apr 11 '06 #5
Diez, John, Tim, and Ben, thank you all so much. I now "get it". It
makes logical sense now that the difficulty was actually in the
implementation of findall, which does non-overlapping matches. It also
makes sense, now, that one can get around this by using a lookahead
assertion. Thanks a bunch, guys; this really helped!

Chris

Apr 12 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
26
by: Shannon Jacobs | last post by:
Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
11
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
9
by: Pete Davis | last post by:
I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...
7
by: Tizzah | last post by:
What is wrong with that? regex = /^(http|https):\/\/+({1}+)*\.{2,5}(({1,5})?\/.*)?$/ if(field.hpage.value != regex.test(field.hpage.value)){ alert("Bad Homepage") field.hpage.focus()...
5
by: Cylix | last post by:
I am going to write a function that the search engine done. in search engine, we may using double quotation to specify a pharse like "I love you", How can I using regular expression to sperate...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
14
by: Andy B | last post by:
I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.