473,233 Members | 1,437 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,233 software developers and data experts.

some kind of detector, need advices...


(sorry long)

i think i have missed something in the code below, i would like to
design some kind of detector with python, but i feel totally in a no
way now and need some advices to advance :(

data = "it is an <atag> example of the kind of </atag> data it must
handle and another kind of data".split(" ")
(actually data are splitted line by line in a file, and contained
other than simple words so using ' '<space> is just to post here)

i would like to be able to write some kind of easy rule like :
detect1 = """th.* kind of data"""
or better :
detect2 = """th.* * data""" ### second '*' could be seen like a joker,
as in re, some sort of "skip zero or more line"
which would give me spans where it matched, here :
[(6, 11), (15, 19)]

i have written code below which may handle detect1 , but still unable
to adapt it to detect2. i think i may miss some step back in case of
failed match.
def ignore(s): if s.startswith("<"):
return True
return False
class Rule: def __init__(self, rule, separator = " "):
self.rule = tuple(rule.split(separator))
self.length = len(self.rule)
self.compiled = []
self.filled = 0
for i in range(self.length):
current = self.rule[i]
if current == '*':
### special case, one may advance...
self.filled += 1
self.compiled = tuple(self.compiled)
def match(self, lines, ignore = None):
spans = []
i, current, memorized, matched = 0, 0, None, None
while 1:
if i == len(lines):
line = lines[i]
i += 1
print "%3d: %s (%s)" % (i, line, current),
if ignore and ignore(line):
print ' - ignored'
regexp = self.compiled[current]
if regexp == '*':
elif hasattr(regexp, 'search') and regexp.search(line):
### match current pattern
print ' + matched',
matched = True
current, memorized, matched = 0, None, None
if matched:
if memorized is None:
memorized = i - 1
if current == self.filled - 1:
print " + detected!",
spans.append((memorized, i))
current, memorized = 0, None
current += 1
return spans

data = "it is an <atag> example of the kind of </atag> data it must handle and another kind of data".split(" ") detect = """th.* kind of data"""
r = Rule(detect, ' ') ; r.match(data, ignore)

1: it (0)
2: is (0)
3: an (0)
4: <atag> (0) - ignored
5: example (0)
6: of (0)
7: the (0) + matched
8: kind (1) + matched
9: of (2) + matched
10: </atag> (3) - ignored
11: data (3) + matched + detected!
12: it (1)
13: must (0)
14: handle (0)
15: and (0)
16: another (0) + matched
17: kind (1) + matched
18: of (2) + matched
19: data (3) + matched + detected!
[(6, 11), (15, 19)] ### actually they are indexes in list and +1 to
have line numbers
Jul 18 '05 #1
2 2281
On 14 Jul 2004 23:37:37 -0700, Joh <jo******@yahoo.fr> wrote:

(sorry long)


am I the only one who didn't understand this?

John Lenton (jl*****@gmail.com) -- Random fortune:
bash: fortune: command not found
Jul 18 '05 #2
I'm not really sure I followed that either, but here's a restatement
of the problem according to my feeble understanding-

Joh wants to break a string into tokens (in this case, just separated
by whitespace) and then perform pattern matching based on the tokens.
He then wants to find the span of matched patterns, in terms of token

For example, given the input "this is a test string" and the pattern
"is .* test", the program will first tokenize the string:

tokens = ['this', 'is', 'a', 'test', 'string']

....and then will match ['is', 'a', test'] and return the indices of
the matched interval [(1, 3)]. (I don't know if that interval is
supposed to be inclusive)

Note that .* would match zero or more tokens, not characters.
Additionally, it seems that xml-ish tags ("<atag>" in the example)
would be ignored.

Implementing this from scratch would require hacking together some
kind of LR parser. Blah. Fortunately, because tokenization is
trivial, it is possible translate all of the "detectors" (aka
token-matching patterns) directly into regular expressions; then, all
you have to do is correlate the match object intervals (indexed by
character) into token intervals (indexed by token).

To translate a "detector" into a proper regular expression, just make
a few substitutions:

def detector_to_re(detector):
""" translate a token pattern into a regular expression """
# could be more efficient, but this is good for readability

# "." => "(\S+)" (match one token)
detector = re.sub('\.', r'(\S+)', detector)

# whitespace block => "(\s+)" (match a stretch of whitespace)
detector = re.sub('\s+', r'(\s+)', detector)

return detector

def apply_detector(detector, data):

# compile mapping of character -> token indices
i = 0
token_indices = {}
for match in re.finditer('\S+', data):
token_indices[match.start()] = i
token_indices[match.end()] = i
i += 1

# ignore tags
data = re.sub('<.*?>', '', data)

detector_re = re.compile(detector_to_re(detector))

intervals = []

for match in detector_re.finditer():

return intervals
On second thought, probably best to throw out this whole scheme and
just use regular expressions, even if it's less convenient. This way
you don't mix syntaxes of regexp vs. token exp (like "th.* *" would
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

by: Holly | last post by:
I'm trying to validate my code and I can't figure out what kind of doctype I have. The validator can't tell me anything because it can't move beyond the doctype declaration. ...
by: PHP2 | last post by:
how create small program that push some button automaticaly (every 9 mins) for Linux with C++? any fast advices?
by: Chinmoy Mukherjee | last post by:
Hi All, Do you know of any free memory leak detector for C++ for windows OS? Regards, Chinmoy
by: Nico | last post by:
Hello everyone, I have re-formulated the question I asked on my last post:: I am trying to capture the System event raised when a Network Connection is established in VB.NET (ie. connection...
by: Kevin | last post by:
Hey guys. I'm looking to get together some VB programmers on Yahoo messenger. I sit at a computer and program all day. I have about 3 or 4 people already, but it would be really cool to have a...
by: Jim Michaels | last post by:
I can't get any "universal" code working that tries to detect whether the document it's in is xhtml or html. I found this, which tells me I have a hill to climb with no equipment....
by: Lighter | last post by:
Is there a way to write a memory leak detector supporting new(nothrow)? For example, #include <My_Debug_New.h> using namespace std; int main() {
by: murdla | last post by:
Hello. I am working on a project where users can automatically create personnel advices on the mainframe through a .NET Web Service. My current problem is that I am trying to call a...
by: Dmitriy V'jukov | last post by:
I want to announce release 1.1 of Relacy Race Detector. First of all, now you can freely DOWNLOAD latest version of Relacy Race Detector DIRECTLY FROM WEB:...
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.