By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,173 Members | 796 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,173 IT Pros & Developers. It's quick & easy.

Searching for Regular Expressions in a string WITH overlap

P: n/a
Ben
I apologize in advance for the newbie question. I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
print "string type is:", type(string)
print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. I just haven't figured out a function that gives me the
matches including the overlap. Can anyone point me in the right
direction? I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

Thanks for your help!
-Ben
Nov 21 '08 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On Nov 20, 4:31*pm, Ben <bmn...@gmail.comwrote:
I apologize in advance for the newbie question. *I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
* * * * print "string type is:", type(string)
* * * * print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. *I just haven't figured out a function that gives me the
matches including the overlap. *Can anyone point me in the right
direction? *I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

Thanks for your help!
-Ben
'findall' returns a list of matched groups. A group is anything
surrounded by parens. The groups are ordered based on the position of
the opening paren. so, the first result is matching the parens you
have around the whole expression, the second one is matching the
parens that are around '(2)|(4)', the third is matching '(2)', and the
last one is matching '(4)', which is empty.

I don't know of a way to find all overlapping strings automatically. I
would just do something like this:
>>import re
text = "0123456789"
p = re.compile(r"(?:2|4)[0-9]{3}") # The (?:...) is a way of isolating the values without grouping them.
start = 0
found = []
while True:
.... m = p.search(text, start)
.... if m is None:
.... break
.... start = m.start() + 1
.... found.append(m.group(0))
....
>>found
['2345', '4567']
Matt
Nov 21 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.