468,106 Members | 1,536 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,106 developers. It's quick & easy.

Regular expression help

I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
p = 'begin(.*)end'
m = re.search(p,s,re.DOTALL)


and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

TIA

David Lees

Jul 18 '05 #1
7 2412
David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
>>> p = 'begin(.*)end'
>>> m = re.search(p,s,re.DOTALL)


and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.


people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->


Jul 18 '05 #2
On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <ab***************@verizon.net> wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
p = 'begin(.*)end'
m = re.search(p,s,re.DOTALL)
and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.

You were close. For non-greedy add the question mark after the greedy expression:
import re
s = """ ... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """ p = 'begin(.*?)end'
rx =re.compile(p,re.DOTALL)
rx.findall(s)

[' first ', '\nsecond\n', ' problem begin nested ', ' last ']

Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.

Regards,
Bengt Richter
Jul 18 '05 #3
Fredrik,

Not sure about the original poster, but I can use that. Thanks!

--Alan

"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:<ma**********************************@python. org>...
David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
>>> p = 'begin(.*)end'
>>> m = re.search(p,s,re.DOTALL)


and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.


people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->

Jul 18 '05 #4
On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fr*****@pythonware.com> wrote:
David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.

I tried:
>>> p = 'begin(.*)end'
>>> m = re.search(p,s,re.DOTALL)
and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.


people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-

would you say so for this case? Or how like this case?
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory). For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>


Or breaking your loop with an exception instead of tests:
text = """begin s1 end ... sdfsdf
... begin s2 end
... """
def process(s): print 'processing(%r)'%s ... try: ... end = 0 # end of previous search
... while 1:
... start = text.index("begin", end) + 5
... end = text.index("end", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')

Or if you're guaranteed that every begin has an end, you could also write
for begxxx in text.split('begin')[1:]:

... process(begxxx.split('end')[0])
...
processing(' s1 ')
processing(' s2 ')
Regards,
Bengt Richter
Jul 18 '05 #5
Andrew Bennetts wrote:
On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a


^^^^^^^^

How about re.findall?

E.g.:
>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END') [' foo ', ' bar ']

-Andrew.


Actually this fails with the multi-line type of file I was asking about.
re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

[' bar ']

Jul 18 '05 #6
On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <ab***************@verizon.net> wrote:
Andrew Bennetts wrote:
On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a


^^^^^^^^

How about re.findall?

E.g.:
>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')

[' foo ', ' bar ']

-Andrew.


Actually this fails with the multi-line type of file I was asking about.
re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[' bar ']

It works if you include the DOTALL flag (?s) at the beginning, which makes
.. also match \n: (BTW, (?si) would make it case-insensitive).
import re
re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter
Jul 18 '05 #7
Bengt Richter wrote:
On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <ab***************@verizon.net> wrote:

Andrew Bennetts wrote:
On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a

^^^^^^^^

How about re.findall?

E.g.:

>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
[' foo ', ' bar ']

-Andrew.


Actually this fails with the multi-line type of file I was asking about.

>re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')


[' bar ']


It works if you include the DOTALL flag (?s) at the beginning, which makes
. also match \n: (BTW, (?si) would make it case-insensitive).
>>> import re
>>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')

[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter

I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149

Interesting how much faster the non-regex approach is.

Thanks again.

David Lees

The code (which I have not carefully checked) is:

import re, time

def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,re.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0

def timeFredrik1(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0

START = re.compile("begin")
END = re.compile("end")

while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0
def timeFredrik(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begin msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward

t1 = time.clock()
return t1-t0

fh = open('scu.cfg','rb')
s = fh.read()
fh.close()

N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N)

Jul 18 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Bradley Plett | last post: by
4 posts views Thread by Neri | last post: by
10 posts views Thread by Lee Kuhn | last post: by
3 posts views Thread by James D. Marshall | last post: by
7 posts views Thread by Billa | last post: by
9 posts views Thread by Pete Davis | last post: by
3 posts views Thread by Zach | last post: by
25 posts views Thread by Mike | last post: by
3 posts views Thread by Mr.Steskal | last post: by
18 posts views Thread by Lit | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.