Regular expression intricacies: why do REs skip some matches?

Chris Lasher

Hey guys and gals,
This is a followup of my "Counting all permutations of a substring"
thread (see
http://groups.google.com/group/comp....57235b3fd3966f
in Google Groups) I'm still having a difficult time figuring out the
intricacies of regular expressions and consecutive matches. Here's a
brief example:

In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Using redemo.py, one can see that the matches are occurring at the
following spots:
abababa
^ ^ (where ^ indicates the start of a match)
Ideally, there'd be a way to create the regular expression to get at
this match, too:
abababa
^
So that the total matches are:
abababa
^ ^ ^

Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?

Confusedly yours,
Chris

Apr 11 '06 #1

Subscribe Post Reply

1439

Diez B. Roggisch

> Is this simply not the way REs work? Does this sort of matching really

have to be home-coded?

Yes. The reason is basically that consumed characters can't be
"unconsumed". However, if you use the search-variant with a
start-argument you can search from the last occurence start+1 to achieve
what you're after.

Diez

Apr 11 '06 #2

John Machin

It's nothing to do with how/what a regular expression matches. It's all
to do with the definition of whatever convenience methods like
findall() that have been built on top of the basic match() method --
having found a match, where do they start looking for the next match?
Typically, one does not want overlapping matches.

Apr 11 '06 #3

Tim Chase

> In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

print re.findall('(?=aba)', 'abababa')

['','','']

You'd have to do some chicanary to get the actual pieces:

s = 'abababa'
for f in re.finditer('(?=aba)', s):
print "Found %s at %i" % (
s[f.start():f.start()+3],
f.start())

or

[s[f.start():f.start()+3] for f in
re.finditer('(?=aba)', s)]

Note that both of these know the length of the desired
piece. If not, you may have to do additional processing to
get them to work the way you want. Yippie.

All lovely hacks, but they each return all three hits.

-tim
PS: These likely only work in Python...to use them in grep
or another regexp engine, you'd have to tweak them :*)

Apr 11 '06 #4

Ben Cartwright

Tim Chase wrote:

In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?
It's just the way regexes work. You may disagree, but it's more
intuitive that iterated pattern searching be non-overlapping by
default. See also:

'abababa'.count('aba') 2

Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

>>> print re.findall('(?=aba)', 'abababa')
['','','']

You'd have to do some chicanary to get the actual pieces:

(snip)

Actually, you can just define a group inside the lookahead assertion:
re.findall('(?=(aba))', 'abababa')

['aba', 'aba', 'aba']

--Ben

Apr 11 '06 #5

Chris Lasher

Diez, John, Tim, and Ben, thank you all so much. I now "get it". It
makes logical sense now that the difficulty was actually in the
implementation of findall, which does non-overlapping matches. It also
makes sense, now, that one can get around this by using a lookahead
assertion. Thanks a bunch, guys; this really helped!

Chris

Apr 12 '06 #6

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Regular expression to exclude lines?

by: Shannon Jacobs | last post by:

Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get...

Javascript

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular expression problem - Replacing a pattern

by: Dimitris Georgakopuolos | last post by:

Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...

C# / C Sharp

Regular Expression Matches

by: Pete Davis | last post by:

I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...

C# / C Sharp

Regular Expression for validating a url field

by: Tizzah | last post by:

What is wrong with that? regex = /^(http|https):\/\/+({1}+)*\.{2,5}(({1,5})?\/.*)?$/ if(field.hpage.value != regex.test(field.hpage.value)){ alert("Bad Homepage") field.hpage.focus()...

Javascript

Regular expression

by: Cylix | last post by:

I am going to write a function that the search engine done. in search engine, we may using double quotation to specify a pharse like "I love you", How can I using regular expression to sperate...

.NET Framework

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

using a regular expression to match up to but not including html start/end tags

by: Andy B | last post by:

I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any...

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Regular expression intricacies: why do REs skip some matches?

Similar topics