469,643 Members | 2,064 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,643 developers. It's quick & easy.

Searching for Regular Expressions in a string WITH overlap

Ben
I apologize in advance for the newbie question. I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
print "string type is:", type(string)
print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. I just haven't figured out a function that gives me the
matches including the overlap. Can anyone point me in the right
direction? I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

Thanks for your help!
-Ben
Nov 21 '08 #1
1 4046
On Nov 20, 4:31*pm, Ben <bmn...@gmail.comwrote:
I apologize in advance for the newbie question. *I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
* * * * print "string type is:", type(string)
* * * * print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. *I just haven't figured out a function that gives me the
matches including the overlap. *Can anyone point me in the right
direction? *I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

Thanks for your help!
-Ben
'findall' returns a list of matched groups. A group is anything
surrounded by parens. The groups are ordered based on the position of
the opening paren. so, the first result is matching the parens you
have around the whole expression, the second one is matching the
parens that are around '(2)|(4)', the third is matching '(2)', and the
last one is matching '(4)', which is empty.

I don't know of a way to find all overlapping strings automatically. I
would just do something like this:
>>import re
text = "0123456789"
p = re.compile(r"(?:2|4)[0-9]{3}") # The (?:...) is a way of isolating the values without grouping them.
start = 0
found = []
while True:
.... m = p.search(text, start)
.... if m is None:
.... break
.... start = m.start() + 1
.... found.append(m.group(0))
....
>>found
['2345', '4567']
Matt
Nov 21 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Fuzzyman | last post: by
11 posts views Thread by Ron Rohrssen | last post: by
7 posts views Thread by Brian Mitchell | last post: by
7 posts views Thread by Billa | last post: by
25 posts views Thread by Mike | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.