Flyzone:
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
My first try:
data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""
import re
date_find = re.compile(r"\d \d:\d\d:\d\d \d{4}$")
section = []
for line in data.splitlines ():
if date_find.searc h(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(secti on)
section = [line]
else:
if line:
section.append( line)
print "\n" + "-" * 10 + "\n", "\n".join(secti on)
itertools.group by() is fit to split sequences like:
111110001111110 001110010101111 1
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100 001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.
Along those lines I have devised this different (and maybe over-
engineered) version:
from itertools import groupby
import re
class Splitter(object ):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_ el = None
self.state = True
def __call__(self, el):
if self.predicate( el):
self.state = not self.state
self.precedent_ el = el
return self.state
date_find = re.compile(r"\d \d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_f ind.search)
sections = ("\n".join(g ) for h,g in groupby(data.sp litlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section
The Splitter class + the groupby can become a single simpler
generator, like in this this version:
def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(pa rt)
yield group
import re
date_find = re.compile(r"\d \d:\d\d:\d\d \d{4}$")
for section in grouper(data.sp litlines(), date_find.searc h):
print "\n" + "-" * 10 + "\n", "\n".join(secti on)
Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.
Flyzone (seen later):
>Amm..not! I need to get the text-block between the two data, not the data! :)
Then you can modify the code like this:
def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(pa rt)
yield group
Bye,
bearophile