Flyzone:
i have a problem with the split function and regexp.
I have a file that i want to split using the date as token.
My first try:
data = """
error text
Mon Apr 9 22:30:18 2007
text
text
Mon Apr 9 22:31:10 2007
text
text
Mon Apr 10 22:31:10 2007
text
text
"""
import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
section = []
for line in data.splitlines():
if date_find.search(line):
if section:
print "\n" + "-" * 10 + "\n", "\n".join(section)
section = [line]
else:
if line:
section.append(line)
print "\n" + "-" * 10 + "\n", "\n".join(section)
itertools.groupby() is fit to split sequences like:
1111100011111100011100101011111
as:
11111 000 111111 000 111 00 1 0 1 0 11111
While here we have a sequence like:
100001000101100001000000010000
that has to be splitted as:
10000 1000 10 1 10000 10000000 10000
A standard itertool can be added for such quite common situation too.
Along those lines I have devised this different (and maybe over-
engineered) version:
from itertools import groupby
import re
class Splitter(object):
# Not tested much
def __init__(self, predicate):
self.predicate = predicate
self.precedent_el = None
self.state = True
def __call__(self, el):
if self.predicate(el):
self.state = not self.state
self.precedent_el = el
return self.state
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
splitter = Splitter(date_find.search)
sections = ("\n".join(g) for h,g in groupby(data.splitlines(),
key=splitter))
for section in sections:
if section:
print "\n" + "-" * 10 + "\n", section
The Splitter class + the groupby can become a single simpler
generator, like in this this version:
def grouper(seq, key=bool):
# A fast identity function can be used instead of bool()
# Not tested much
group = []
for part in seq:
if key(part):
if group: yield group
group = [part]
else:
group.append(part)
yield group
import re
date_find = re.compile(r"\d\d:\d\d:\d\d \d{4}$")
for section in grouper(data.splitlines(), date_find.search):
print "\n" + "-" * 10 + "\n", "\n".join(section)
Maybe that grouper can be modified to manage group lazily, like
groupby does, instead of building a true list.
Flyzone (seen later):
>Amm..not! I need to get the text-block between the two data, not the data! :)
Then you can modify the code like this:
def grouper(seq, key=bool):
group = []
for part in seq:
if key(part):
if group: yield group
group = [] # changed
else:
group.append(part)
yield group
Bye,
bearophile