472,373 Members | 1,845 Online

# Searching for Regular Expressions in a string WITH overlap

I apologize in advance for the newbie question. I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
print "string type is:", type(string)
print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. I just haven't figured out a function that gives me the
matches including the overlap. Can anyone point me in the right
direction? I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

-Ben
Nov 21 '08 #1
1 4160
On Nov 20, 4:31*pm, Ben <bmn...@gmail.comwrote:
I apologize in advance for the newbie question. *I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
* * * * print "string type is:", type(string)
* * * * print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. *I just haven't figured out a function that gives me the
matches including the overlap. *Can anyone point me in the right
direction? *I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

-Ben
'findall' returns a list of matched groups. A group is anything
surrounded by parens. The groups are ordered based on the position of
the opening paren. so, the first result is matching the parens you
have around the whole expression, the second one is matching the
parens that are around '(2)|(4)', the third is matching '(2)', and the
last one is matching '(4)', which is empty.

I don't know of a way to find all overlapping strings automatically. I
would just do something like this:
>>import re
text = "0123456789"
p = re.compile(r"(?:2|4)[0-9]{3}") # The (?:...) is a way of isolating the values without grouping them.
start = 0
found = []
while True:
.... m = p.search(text, start)
.... if m is None:
.... break
.... start = m.start() + 1
.... found.append(m.group(0))
....
>>found
['2345', '4567']
Matt
Nov 21 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.