reusing parts of a string in RE matches?

John Salerno

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Thanks.

May 10 '06 #1

Subscribe Post Reply

2345

Murali

John Salerno wrote:

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

Why not something like

import re
string = 'abababababababab'
pattern = re.compile(r"^aba")
ans = []
for x in xrange(len(string)):
m = pattern.match(string[x:])
if m: ans.append( (x+m.start(),x+m.end()))

# now ans is a list of pairs (p,q) where the substring string[p:q]
matches pattern

- Murali

May 10 '06 #2

Bo Yang

John Salerno å†™é“:

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
Below from the python library reference

*|(?=...)|*
Matches if ... matches next, but doesn't consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match |'Isaac '| only if it's followed by |'Asimov'|.
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

I try the code , but I give seven results ! Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Thanks.

May 10 '06 #3

BartlebyScrivener

Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

May 10 '06 #4

John Salerno

Bo Yang wrote:

This matches all the 'ab' followed by an 'a', but it doesn't include
the 'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

I try the code , but I give seven results !

Sorry, I meant that findall() only returns 4 results when searching for
'aba', when there are actually seven instances of 'aba'. This doesn't
involve the look-ahead RE.

May 10 '06 #5

John Salerno

BartlebyScrivener wrote:

Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

Heh heh, I'm sure you're right, but this is more just an exercise for me
in REs, so I'm curious how you might do it, unless the answer is that
it's just too complicated to be worth it (like Murali's example!) That
goes beyond just an RE pattern.

May 10 '06 #6

BartlebyScrivener

I have to at least try :)

s = "abababababababab"

for x in range(len(s)):
.... try:
.... s.index("aba", x, x + 3)
.... except ValueError:
.... pass

rd

May 10 '06 #7

John Salerno

BartlebyScrivener wrote:

I have to at least try :)

s = "abababababababab"

for x in range(len(s)):
... try:
... s.index("aba", x, x + 3)
... except ValueError:
... pass

rd

yeah, looks like index() or find() can be used to do it instead of RE,
but still, i'd like to know if there's a way you can write an RE
expression to do it (and just an RE expression, without all the other
for loops and extra nonsense...otherwise i might as well just use string
methods)

May 10 '06 #8

mpeters42

>From the Python 2.4 docs:

findall( pattern, string[, flags])
Return a list of all ***non-overlapping*** matches of pattern in
string....

By design, the regex functions return non-overlapping patterns.

Without doing some kind of looping, I think you are out of luck.

If you pattern is fixed, then a solution might be:

string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)] ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

If the pattern is not fixed (i.e. 'a.a') then this method can still get
a count of overlapping matches, but cannot get the individual match
strings themselves.

A simple loop should do in this case, though:
for i in range(len(string)):

.... r= re.match(pat,string[i:])
.... if r: print r.group()
....
aba
aba
aba
aba
aba
aba
aba

May 10 '06 #9

BartlebyScrivener

>> otherwise i might as well just use string

methods

I think you're supposed to use string methods if you can, to avoid the
old adage about having two problems instead of one when using regex.

rd

May 10 '06 #10

John Salerno

mp*******@gmail.com wrote:

string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

May 10 '06 #11

John Salerno

John Salerno wrote:

mp*******@gmail.com wrote:
> string = 'abababababababab'
> pat = 'aba'
> [pat for s in re.compile('(?='+pat+')').findall(string)]

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

Nevermind, I get it! The point is that you *aren'* matching anything
(except the empty string), and this only to see how many times it
occurs, then you are replacing those occurrences with the pattern
string. So this is basically like a counter?

May 10 '06 #12

Kent Johnson

John Salerno wrote:

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches?

You can specify a start location to re.search(), and get the location of
a match from a match object. This allows you to loop, searching the
string following the last match:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')

ans = []
start = 0
while True:
m = pattern.search(string, start)
if not m: break
ans.append( (m.start(), m.end()) )
start = m.start() + 1

print ans # => [(0, 2), (2, 4), (4, 6), (6, 8), (8, 10), (10, 12), (12, 14)]

Kent

May 10 '06 #13

mpeters42

Exactly,

Now this will work as long as there are no wildcards in the pattern.
Thus, only with fixed strings. But if you have a fixed string, there
is really no need to use regex, as it will complicate you life for no
real reason (as opposed to simple string methods).

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

To actually do that you will need to iterate through the string and
apply the pattern match (which matches only the beginning of a string)
to a indexed subset of the original (see example in the last post)

May 10 '06 #14

Ben Cartwright

John Salerno wrote:

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:

import re
re.findall(r'(?=(aba))', 'abababababababab')

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

--Ben

May 11 '06 #15

Murali

> Yes, and no extra for loops are needed! You can define groups inside

the lookahead assertion:
>>> import re
>>> re.findall(r'(?=(aba))', 'abababababababab')

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

- Murali

May 11 '06 #16

Ben Cartwright

Murali wrote:

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
>>> import re
>>> re.findall(r'(?=(aba))', 'abababababababab')

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

Careful. That won't work as expected for *all* regexps. Example:

import re
re.findall(r'(?=(a.*a))', 'abaca') ['abaca', 'aca']

Note that this does *not* find 'aba'. You might think that making it
non-greedy might help, but:
re.findall(r'(?=(a.*?a))', 'abaca')

['aba', 'aca']

Nope, now it's not finding 'abaca'.

This is by design, though. From
http://www.regular-expressions.info/lookaround.html (a good read, by
the way):

"""As soon as the lookaround condition is satisfied, the regex engine
forgets about everything inside the lookaround. It will not backtrack
inside the lookaround to try different permutations."""

Moral of the story: keep lookahead assertions simple whenever
possible. :-)

--Ben

May 11 '06 #17

BartlebyScrivener

Thanks, Ben. Quite an education!

rick

May 11 '06 #18

Mirco Wahab

Hi mpeters42 & John

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

Lets take this as a starting point for another example
that comes to mind. You have a string of characters
interspersed with numbers: tx = 'a1a2a3A4a35a6b7b8c9c'

Now you try to find all _numbers_, which have
symmetrical characters (like a<-2->a) which
are not in 3/3/3... synced groups.

This can easy be done in P(ytho|nerl) etc. by
positive lookahead (even the same pattern does:)
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)
Pe:
$_ = 'a1a2a3A4a35a6b7b8c9c';
print /(\w)(?=(.)\1)/g;

(should find 1,2,7,9 only, python regex
written to var in order to prevent
clunky lines ;-)

BTW, Py Regex Engine seems to
be very close to the perl one:
Naive (!) matching of a pattern
with 14 o's (intersperded by
anything) against a string of
16 o's takes about exaclty the same
time here in Py(2.4.3) and Pe (5.8.7):

tl = 'oooooooooooooooo'
rg = r'o*o*o*o*o*o*o*o*o*o*o*o*o*o*[\W]'
print re.search(rg, tl)

Py: 101 sec
Pe: 109 sec

(which would find no match because there's
no \W-like character at the end of the
string here)

Regards

Mirco

May 11 '06 #19

John Salerno

Mirco Wahab wrote:

Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)

The only problem seems to be (and I ran into this with my original
example too) that what gets returned by this code isn't exactly what you
are looking for, i.e. the numbers '1', '2', etc. You get a list of
tuples, and the second item in this tuple contains the number, but also
the following \w character.

So there still seems to be some work that must be done when dealing with
overlapping patterns/look-ahead/behind.

Oh wait, a thought just hit me. Instead of doing it as you did:

rg = r'(\w)(?=(.\1))'

Could you do:

rg = r'(\w)(?=(.)\1)'

That would at least isolate the number, although you'd still have to get
it out of the list/tuple.

May 11 '06 #20

John Salerno

Ben Cartwright wrote:

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
>>> import re
>>> re.findall(r'(?=(aba))', 'abababababababab')

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, that was like magic! :)

May 11 '06 #21

Mirco Wahab

Hi John

rg = r'(\w)(?=(.)\1)'

That would at least isolate the number, although you'd still have to get
it out of the list/tuple.

I have no idea how to do this
in Python in a terse way - but
I'll try ;-)

In Perl, its easy. Here, the
"match construct" (\w)(?=(.)\1)
returns all captures in a
list (a 1 a 2 a 4 b 7 c 9)
because we capture 2 fields per
comparision:
The first is (\w), needed for
backreference, the second is
the dot (.), which finds the
number in the center (or any-
thing else).

So, in perl you 'filter' by the
'grep' function on each list
element: grep{ /\d/ } - this
means, only numbers (\d) will
pass through:

$_ = 'a1a2a3Aa4a35a6b7b8c9c';
print grep{/\d/} /(\w)(?=(.)\1)/g;
#prints => 1 2 4 7 9

I'll try to fiddle somthing out
that works in Python too ...

Regards

M.

May 11 '06 #22

John Salerno

Mirco Wahab wrote:

I have no idea how to do this
in Python in a terse way - but
I'll try ;-)

In Perl, its easy. Here, the
"match construct" (\w)(?=(.)\1)
returns all captures in a
list (a 1 a 2 a 4 b 7 c 9)

Ah, I see the difference. In Python you get a list of tuples, so there
seems to be a little extra work to do to get the number out.

May 11 '06 #23

Mirco Wahab

Hi John

Ah, I see the difference. In Python you get a list of tuples, so there
seems to be a little extra work to do to get the number out.

Dohh, after two cups of coffee
ans several bars of chocolate
I eventually mad(e) it ;-)

In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':

char,num = list[:][:]

in a loop and using the apropriate element then:

import re

t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)

for a,b in (l[:][:]) : print b

In the moment, I find this syntax
awkward and arbitary, but my mind
should change if I'm adopted more
to this in the end ;-)

Regards,

M.

May 11 '06 #24

Mirco Wahab

Hi John

Ah, I see the difference. In Python you get a list of
tuples, so there seems to be a little extra work to do
to get the number out.

Dohh, after two cups of coffee
ans several bars of chocolate
I eventually mad(e) it ;-)

In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':

char,num = list[:][:]

in a loop and using the apropriate element then:

import re

t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)

for a,b in l : print b

(l sould implicitly be decoded
sequentially as l[:][->a : ->b]
in the loop context.)

In the moment, I find this syntax
somehow hard to remember, but my
mind should change if I'm adopted
more to this in the end ;-)

Regards,

M.

May 11 '06 #25

Fredrik Lundh

Mirco Wahab wrote:

In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':

char,num = list[:][:]

in a loop and using the apropriate element then:

import re

t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)

for a,b in (l[:][:]) : print b

In the moment, I find this syntax
awkward and arbitary, but my mind
should change if I'm adopted more
to this in the end ;-)

in contemporary Python, this is best done by a list comprehension:

l = [m[1] for m in re.findall(r, t)]

or, depending on what you want to do with the result, a generator
expression:

g = (m[1] for m in re.findall(r, t))

or

process(m[1] for m in re.findall(r, t))

if you want to avoid creating the tuples, you can use finditer instead:

l = [m.group(2) for m in re.finditer(r, t)]
g = (m.group(2) for m in re.finditer(r, t))

finditer is also a good tool to use if you need to do more things with
each match:

for m in re.finditer(r, t):
s = m.group(2)
... process s in some way ...

the code body will be executed every time the RE engine finds a match,
which can be useful if you're working on large target strings, and only
want to process the first few matches.

for m in re.finditer(r, t):
s = m.group(2)
if s == something:
break
... process s in some way ...

</F>

May 12 '06 #26

Mirco Wahab

Hi Fredrik

you brought up some terse and
somehow expressive lines with
their own beauty ...

[this] is best done by a list comprehension:
l = [m[1] for m in re.findall(r, t)]

or, [...] a generator expression:
g = (m[1] for m in re.findall(r, t))

or
process(m[1] for m in re.findall(r, t))

... avoid creating the tuples, ... finditer instead:
l = [m.group(2) for m in re.finditer(r, t)]
g = (m.group(2) for m in re.finditer(r, t))

finditer is also a good tool to use
for m in re.finditer(r, t):
s = m.group(2)
... process s in some way ...

.... which made me wish to internalize such wisdom too ;-)

This looks almost beautiful, it made me stand up and
go to some large book stores in order to grab a good
book on python.

Sadly, there were none (except one small 'dictionary',
ISBN: 3826615123). I live in a fairly large city
in Germany w/three large bookstores in the center,
where one can get loads of PHP and Java books, lots
of C/C++ and the like - even some Ruby books (some
"Rails" too) on display (WTF).

Not that I wouldn't order books (I do that all the
time for 'original versions') but it makes one
sad-faced to see the small impact of the Python
language here today on bookstore-tournarounds ...

Thanks & regards

Mirco

May 13 '06 #27

by: Eric Linders | last post by:

Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the...

PHP

Regexp: unexspected splitting of string in several groups

by: Piet | last post by:

Hello, I have a very strange problem with regular expressions. The problem consists of analyzing the properties of columns of a MySQL database. When I request the column type, I get back a string...

Python

Help on string

by: SnakeS | last post by:

Hi, I want modify a string as "112233445566" in "11:22:33:44:55:66" which is the best method? Using a RegExp? and if yes how? Thanks in advance

C# / C Sharp

string routines and code libraries

by: zoro | last post by:

Hi, I am new to C#, coming from Delphi. In Delphi, I am using a 3rd party string handling library that includes some very useful string functions, in particular I'm interested in BEFORE (return...

C# / C Sharp

Removing a simple pattern from a string!

by: mehdi_mousavi | last post by:

Hi folks, Consider the following JavaScript function: function removeParam(str, name) { var rgx = new RegExp('(' + name + '=\\w*)|(' + name + '=\\w*;)'); rgx.global = true; rgx.ignoreCase =...

Javascript

Reusing connection strings in config files?

by: Jen | last post by:

..NET 2.0 introduced the <connectionStrings> section in .config files, but how do I reuse the connection strings defined under <connectionStrings> in other parts of the config files? ...

C# / C Sharp

how to split a string using ,fixed character length, variable text delimmiter

by: garyusenet | last post by:

I'm working on a data file and can't find any common delimmiters in the file to indicate the end of one row of data and the start of the next. Rows are not on individual lines but run accross...

C# / C Sharp

$parts = $struct->parts;

by: bill | last post by:

Turning on error_reporting(E_ALL); was quite an eye opener as to how much "fixing" PHP will do for the sloppy coder. I fixed all of the errors except: Notice: Undefined property: parts in...

PHP

Screen scraping, then caching.... not working on web parts page?

by: itfetish | last post by:

We have a web parts intranet at our office here I have been working on. I have been trying to create a way of getting our OCE TDS 600 plotter queue to be a web part on the page, seeing as their...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

reusing parts of a string in RE matches?

Similar topics