(sorry long)
i think i have missed something in the code below, i would like to
design some kind of detector with python, but i feel totally in a no
way now and need some advices to advance :(
data = "it is an <atag> example of the kind of </atag> data it must
handle and another kind of data".split(" ")
(actually data are splitted line by line in a file, and contained
other than simple words so using ' '<space> is just to post here)
i would like to be able to write some kind of easy rule like :
detect1 = """th.* kind of data"""
or better :
detect2 = """th.* * data""" ### second '*' could be seen like a joker,
as in re, some sort of "skip zero or more line"
which would give me spans where it matched, here :
[(6, 11), (15, 19)]
i have written code below which may handle detect1 , but still unable
to adapt it to detect2. i think i may miss some step back in case of
failed match.
def ignore(s): if s.startswith("<"):
return True
return False
class Rule: def __init__(self, rule, separator = " "):
self.rule = tuple(rule.split(separator))
self.length = len(self.rule)
self.compiled = []
self.filled = 0
for i in range(self.length):
current = self.rule[i]
if current == '*':
### special case, one may advance...
self.compiled.append('*')
else:
self.filled += 1
self.compiled.append(re.compile(current))
self.compiled = tuple(self.compiled)
###
def match(self, lines, ignore = None):
spans = []
i, current, memorized, matched = 0, 0, None, None
while 1:
if i == len(lines):
break
line = lines[i]
i += 1
print "%3d: %s (%s)" % (i, line, current),
if ignore and ignore(line):
print ' - ignored'
continue
regexp = self.compiled[current]
if regexp == '*':
### HERE I NEED SOME ADVICES...
elif hasattr(regexp, 'search') and regexp.search(line):
### match current pattern
print ' + matched',
matched = True
else:
current, memorized, matched = 0, None, None
if matched:
if memorized is None:
memorized = i - 1
if current == self.filled - 1:
print " + detected!",
spans.append((memorized, i))
current, memorized = 0, None
current += 1
return spans
data = "it is an <atag> example of the kind of </atag> data it must handle and another kind of data".split(" ") detect = """th.* kind of data"""
r = Rule(detect, ' ') ; r.match(data, ignore)
1: it (0)
2: is (0)
3: an (0)
4: <atag> (0) - ignored
5: example (0)
6: of (0)
7: the (0) + matched
8: kind (1) + matched
9: of (2) + matched
10: </atag> (3) - ignored
11: data (3) + matched + detected!
12: it (1)
13: must (0)
14: handle (0)
15: and (0)
16: another (0) + matched
17: kind (1) + matched
18: of (2) + matched
19: data (3) + matched + detected!
[(6, 11), (15, 19)] ### actually they are indexes in list and +1 to
have line numbers