I forget how to find multiple instances of stuff between tags using
regular expressions. Specifically I want to find all the text between a
series of begin/end pairs in a multiline file.
I tried: p = 'begin(.*)end' m = re.search(p,s,re.DOTALL)
and got everything between the first begin and last end. I guess
because of a greedy match. What I want to do is a list where each
element is the text between another begin/end pair.
TIA
David Lees 7 2412
David Lees wrote: I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a series of begin/end pairs in a multiline file.
I tried: >>> p = 'begin(.*)end' >>> m = re.search(p,s,re.DOTALL)
and got everything between the first begin and last end. I guess because of a greedy match. What I want to do is a list where each element is the text between another begin/end pair.
people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).
a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"
import re
pos = 0
START = re.compile("begin")
END = re.compile("end")
while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward
at this point, it's also obvious that you don't really have to use
regular expressions:
pos = 0
while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward
</F>
<!-- (the eff-bot guide to) the python standard library (redux): http://effbot.org/zone/librarybook-index.htm
-->
On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <ab***************@verizon.net> wrote: I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a series of begin/end pairs in a multiline file.
I tried: p = 'begin(.*)end' m = re.search(p,s,re.DOTALL) and got everything between the first begin and last end. I guess because of a greedy match. What I want to do is a list where each element is the text between another begin/end pair.
You were close. For non-greedy add the question mark after the greedy expression: import re s = """
... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """ p = 'begin(.*?)end' rx =re.compile(p,re.DOTALL) rx.findall(s)
[' first ', '\nsecond\n', ' problem begin nested ', ' last ']
Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.
Regards,
Bengt Richter
Fredrik,
Not sure about the original poster, but I can use that. Thanks!
--Alan
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:<ma**********************************@python. org>... David Lees wrote:
I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a series of begin/end pairs in a multiline file.
I tried: >>> p = 'begin(.*)end' >>> m = re.search(p,s,re.DOTALL)
and got everything between the first begin and last end. I guess because of a greedy match. What I want to do is a list where each element is the text between another begin/end pair.
people will tell you to use non-greedy matches, but that's often a bad idea in cases like this: the RE engine has to store lots of back- tracking information, and your program will consume a lot more memory than it has to (and may run out of stack and/or memory).
a better approach is to do two searches: first search for a "begin", and once you've found that, look for an "end"
import re
pos = 0
START = re.compile("begin") END = re.compile("end")
while 1: m = START.search(text, pos) if not m: break start = m.end() m = END.search(text, start) if not m: break end = m.start() process(text[start:end]) pos = m.end() # move forward
at this point, it's also obvious that you don't really have to use regular expressions:
pos = 0
while 1: start = text.find("begin", pos) if start < 0: break start += 5 end = text.find("end", start) if end < 0: break process(text[start:end]) pos = end # move forward
</F>
<!-- (the eff-bot guide to) the python standard library (redux): http://effbot.org/zone/librarybook-index.htm -->
On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fr*****@pythonware.com> wrote: David Lees wrote:
I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a series of begin/end pairs in a multiline file.
I tried: >>> p = 'begin(.*)end' >>> m = re.search(p,s,re.DOTALL) and got everything between the first begin and last end. I guess because of a greedy match. What I want to do is a list where each element is the text between another begin/end pair.
people will tell you to use non-greedy matches, but that's often a bad idea in cases like this: the RE engine has to store lots of back-
would you say so for this case? Or how like this case?
tracking information, and your program will consume a lot more memory than it has to (and may run out of stack and/or memory).
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
at this point, it's also obvious that you don't really have to use regular expressions:
pos = 0
while 1: start = text.find("begin", pos) if start < 0: break start += 5 end = text.find("end", start) if end < 0: break process(text[start:end]) pos = end # move forward
</F>
Or breaking your loop with an exception instead of tests: text = """begin s1 end
... sdfsdf
... begin s2 end
... """
def process(s): print 'processing(%r)'%s
... try:
... end = 0 # end of previous search
... while 1:
... start = text.index("begin", end) + 5
... end = text.index("end", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')
Or if you're guaranteed that every begin has an end, you could also write
for begxxx in text.split('begin')[1:]:
... process(begxxx.split('end')[0])
...
processing(' s1 ')
processing(' s2 ')
Regards,
Bengt Richter
Andrew Bennetts wrote: On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a
^^^^^^^^
How about re.findall?
E.g.:
>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END') [' foo ', ' bar ']
-Andrew.
Actually this fails with the multi-line type of file I was asking about. re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
[' bar ']
On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <ab***************@verizon.net> wrote: Andrew Bennetts wrote: On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a
^^^^^^^^
How about re.findall?
E.g.:
>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END') [' foo ', ' bar ']
-Andrew.
Actually this fails with the multi-line type of file I was asking about.
re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[' bar ']
It works if you include the DOTALL flag (?s) at the beginning, which makes
.. also match \n: (BTW, (?si) would make it case-insensitive). import re re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
[' foo\nmumble ', ' bar ']
Regards,
Bengt Richter
Bengt Richter wrote: On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <ab***************@verizon.net> wrote:
Andrew Bennetts wrote:
On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
I forget how to find multiple instances of stuff between tags using regular expressions. Specifically I want to find all the text between a
^^^^^^^^
How about re.findall?
E.g.:
>>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END') [' foo ', ' bar ']
-Andrew.
Actually this fails with the multi-line type of file I was asking about.
>re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
[' bar ']
It works if you include the DOTALL flag (?s) at the beginning, which makes . also match \n: (BTW, (?si) would make it case-insensitive).
>>> import re >>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
[' foo\nmumble ', ' bar ']
Regards, Bengt Richter
I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149
Interesting how much faster the non-regex approach is.
Thanks again.
David Lees
The code (which I have not carefully checked) is:
import re, time
def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,re.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0
def timeFredrik1(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
START = re.compile("begin")
END = re.compile("end")
while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0
def timeFredrik(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begin msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward
t1 = time.clock()
return t1-t0
fh = open('scu.cfg','rb')
s = fh.read()
fh.close()
N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N) This discussion thread is closed Replies have been disabled for this discussion. Similar topics
5 posts
views
Thread by Bradley Plett |
last post: by
|
4 posts
views
Thread by Neri |
last post: by
|
10 posts
views
Thread by Lee Kuhn |
last post: by
|
3 posts
views
Thread by James D. Marshall |
last post: by
|
7 posts
views
Thread by Billa |
last post: by
|
9 posts
views
Thread by Pete Davis |
last post: by
|
3 posts
views
Thread by Zach |
last post: by
|
25 posts
views
Thread by Mike |
last post: by
|
3 posts
views
Thread by Mr.Steskal |
last post: by
|
18 posts
views
Thread by Lit |
last post: by
| | | | | | | | | | |