Connecting Tech Pros Worldwide Forums | Help | Site Map

Searching for Regular Expressions in a string WITH overlap

Ben
Guest
 
Posts: n/a
#1: Nov 21 '08
I apologize in advance for the newbie question. I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.

For example, given the string 123456789

I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:

2345
4567

Here's what I'm trying so far:
<code>
#!/usr/bin/env python

import re, repr, sys

string = "123456789"

pattern = '(((2)|(4))[0-9]{3})'

r1 = re.compile(pattern)

stringList = r1.findall(string)

for string in stringList:
print "string type is:", type(string)
print "string is:", string
</code>

Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>

I understand that the findall method only returns the non-overlapping
matches. I just haven't figured out a function that gives me the
matches including the overlap. Can anyone point me in the right
direction? I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.

Thanks for your help!
-Ben

Matimus
Guest
 
Posts: n/a
#2: Nov 21 '08

re: Searching for Regular Expressions in a string WITH overlap


On Nov 20, 4:31*pm, Ben <bmn...@gmail.comwrote:
Quote:
I apologize in advance for the newbie question. *I'm trying to figure
out a way to find all of the occurrences of a regular expression in a
string including the overlapping ones.
>
For example, given the string 123456789
>
I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:
>
2345
4567
>
Here's what I'm trying so far:
<code>
#!/usr/bin/env python
>
import re, repr, sys
>
string = "123456789"
>
pattern = '(((2)|(4))[0-9]{3})'
>
r1 = re.compile(pattern)
>
stringList = r1.findall(string)
>
for string in stringList:
* * * * print "string type is:", type(string)
* * * * print "string is:", string
</code>
>
Which produces:
<code>
string type is: <type 'tuple'>
string is: ('2345', '2', '2', '')
</code>
>
I understand that the findall method only returns the non-overlapping
matches. *I just haven't figured out a function that gives me the
matches including the overlap. *Can anyone point me in the right
direction? *I'd also really like to understand why it returns a tuple
and what the '2', '2' refers to.
>
Thanks for your help!
-Ben
'findall' returns a list of matched groups. A group is anything
surrounded by parens. The groups are ordered based on the position of
the opening paren. so, the first result is matching the parens you
have around the whole expression, the second one is matching the
parens that are around '(2)|(4)', the third is matching '(2)', and the
last one is matching '(4)', which is empty.

I don't know of a way to find all overlapping strings automatically. I
would just do something like this:
Quote:
Quote:
Quote:
>>import re
>>text = "0123456789"
>>p = re.compile(r"(?:2|4)[0-9]{3}") # The (?:...) is a way of isolating the values without grouping them.
>>start = 0
>>found = []
>>while True:
.... m = p.search(text, start)
.... if m is None:
.... break
.... start = m.start() + 1
.... found.append(m.group(0))
....
Quote:
Quote:
Quote:
>>found
['2345', '4567']


Matt
Closed Thread