473,748 Members | 3,604 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

some kind of detector, need advices...

Joh
Hello,

(sorry long)

i think i have missed something in the code below, i would like to
design some kind of detector with python, but i feel totally in a no
way now and need some advices to advance :(

data = "it is an <atag> example of the kind of </atag> data it must
handle and another kind of data".split(" ")
(actually data are splitted line by line in a file, and contained
other than simple words so using ' '<space> is just to post here)

i would like to be able to write some kind of easy rule like :
detect1 = """th.* kind of data"""
or better :
detect2 = """th.* * data""" ### second '*' could be seen like a joker,
as in re, some sort of "skip zero or more line"
which would give me spans where it matched, here :
[(6, 11), (15, 19)]

i have written code below which may handle detect1 , but still unable
to adapt it to detect2. i think i may miss some step back in case of
failed match.
def ignore(s): if s.startswith("< "):
return True
return False
class Rule: def __init__(self, rule, separator = " "):
self.rule = tuple(rule.spli t(separator))
self.length = len(self.rule)
self.compiled = []
self.filled = 0
for i in range(self.leng th):
current = self.rule[i]
if current == '*':
### special case, one may advance...
self.compiled.a ppend('*')
else:
self.filled += 1
self.compiled.a ppend(re.compil e(current))
self.compiled = tuple(self.comp iled)
###
def match(self, lines, ignore = None):
spans = []
i, current, memorized, matched = 0, 0, None, None
while 1:
if i == len(lines):
break
line = lines[i]
i += 1
print "%3d: %s (%s)" % (i, line, current),
if ignore and ignore(line):
print ' - ignored'
continue
regexp = self.compiled[current]
if regexp == '*':
### HERE I NEED SOME ADVICES...
elif hasattr(regexp, 'search') and regexp.search(l ine):
### match current pattern
print ' + matched',
matched = True
else:
current, memorized, matched = 0, None, None
if matched:
if memorized is None:
memorized = i - 1
if current == self.filled - 1:
print " + detected!",
spans.append((m emorized, i))
current, memorized = 0, None
current += 1
print
return spans

data = "it is an <atag> example of the kind of </atag> data it must handle and another kind of data".split(" ") detect = """th.* kind of data"""
r = Rule(detect, ' ') ; r.match(data, ignore)

1: it (0)
2: is (0)
3: an (0)
4: <atag> (0) - ignored
5: example (0)
6: of (0)
7: the (0) + matched
8: kind (1) + matched
9: of (2) + matched
10: </atag> (3) - ignored
11: data (3) + matched + detected!
12: it (1)
13: must (0)
14: handle (0)
15: and (0)
16: another (0) + matched
17: kind (1) + matched
18: of (2) + matched
19: data (3) + matched + detected!
[(6, 11), (15, 19)] ### actually they are indexes in list and +1 to
have line numbers
Jul 18 '05 #1
2 2325
On 14 Jul 2004 23:37:37 -0700, Joh <jo******@yahoo .fr> wrote:
Hello,

(sorry long)

[snip]


am I the only one who didn't understand this?

--
John Lenton (jl*****@gmail. com) -- Random fortune:
bash: fortune: command not found
Jul 18 '05 #2
I'm not really sure I followed that either, but here's a restatement
of the problem according to my feeble understanding-

Joh wants to break a string into tokens (in this case, just separated
by whitespace) and then perform pattern matching based on the tokens.
He then wants to find the span of matched patterns, in terms of token
numbers.

For example, given the input "this is a test string" and the pattern
"is .* test", the program will first tokenize the string:

tokens = ['this', 'is', 'a', 'test', 'string']

....and then will match ['is', 'a', test'] and return the indices of
the matched interval [(1, 3)]. (I don't know if that interval is
supposed to be inclusive)

Note that .* would match zero or more tokens, not characters.
Additionally, it seems that xml-ish tags ("<atag>" in the example)
would be ignored.

Implementing this from scratch would require hacking together some
kind of LR parser. Blah. Fortunately, because tokenization is
trivial, it is possible translate all of the "detectors" (aka
token-matching patterns) directly into regular expressions; then, all
you have to do is correlate the match object intervals (indexed by
character) into token intervals (indexed by token).

To translate a "detector" into a proper regular expression, just make
a few substitutions:

def detector_to_re( detector):
""" translate a token pattern into a regular expression """
# could be more efficient, but this is good for readability

# "." => "(\S+)" (match one token)
detector = re.sub('\.', r'(\S+)', detector)

# whitespace block => "(\s+)" (match a stretch of whitespace)
detector = re.sub('\s+', r'(\s+)', detector)

return detector

def apply_detector( detector, data):

# compile mapping of character -> token indices
i = 0
token_indices = {}
for match in re.finditer('\S +', data):
token_indices[match.start()] = i
token_indices[match.end()] = i
i += 1

# ignore tags
data = re.sub('<.*?>', '', data)

detector_re = re.compile(dete ctor_to_re(dete ctor))

intervals = []

for match in detector_re.fin diter():
intervals.appen d(
(
token_indices[match.start()],
token_indices[match.end()]
)
)

return intervals
On second thought, probably best to throw out this whole scheme and
just use regular expressions, even if it's less convenient. This way
you don't mix syntaxes of regexp vs. token exp (like "th.* *" would
do...)
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

39
2909
by: Holly | last post by:
I'm trying to validate my code and I can't figure out what kind of doctype I have. The validator can't tell me anything because it can't move beyond the doctype declaration. http://www.wavian.com/clients/pugwash/ Is there anyway to tell what kind of doctype this is? I tried inserting a few different types (please excuse me if this is the stupid way to do it, I am learning...) but am unsuccessful.
8
1842
by: PHP2 | last post by:
how create small program that push some button automaticaly (every 9 mins) for Linux with C++? any fast advices?
4
2377
by: Chinmoy Mukherjee | last post by:
Hi All, Do you know of any free memory leak detector for C++ for windows OS? Regards, Chinmoy
2
1304
by: Nico | last post by:
Hello everyone, I have re-formulated the question I asked on my last post:: I am trying to capture the System event raised when a Network Connection is established in VB.NET (ie. connection from a client to a network resource, just like the winsock control did successfully for so many years) AND the one that is raised when data is recieved on this connection. I have seen many posts suggesting to create a timer and check the network...
24
1715
by: Kevin | last post by:
Hey guys. I'm looking to get together some VB programmers on Yahoo messenger. I sit at a computer and program all day. I have about 3 or 4 people already, but it would be really cool to have a much larger list of people (for all of us to benefit). I have found it to be an invaluable resource to answer those hard or weird programming questions that come up. If you are interested, please reply, or send me an email at imgroup@gmail.com
24
9172
by: Jim Michaels | last post by:
I can't get any "universal" code working that tries to detect whether the document it's in is xhtml or html. I found this, which tells me I have a hill to climb with no equipment. http://javascript.about.com/library/bliebug.htm I was going to use the document.doctype property if I could, but apparently that isn't available unless I use strict. (just tried it with Strict, still doesn't do anything). here's what I've got. anybody got...
1
2073
by: Lighter | last post by:
Is there a way to write a memory leak detector supporting new(nothrow)? For example, #include <My_Debug_New.h> using namespace std; int main() {
1
1816
by: murdla | last post by:
Hello. I am working on a project where users can automatically create personnel advices on the mainframe through a .NET Web Service. My current problem is that I am trying to call a function multiple times to write out the advices to different folder, based on the color sent to the function. However, the program only writes the advices out to one folder. Can you help based on the code that I am attaching to the email?
0
2213
by: Dmitriy V'jukov | last post by:
I want to announce release 1.1 of Relacy Race Detector. First of all, now you can freely DOWNLOAD latest version of Relacy Race Detector DIRECTLY FROM WEB: http://groups.google.com/group/relacy/files Main change in release 1.1 is support for standard synchronization primitives: 1. mutex (std::mutex, pthread_mutex_init, InitializeCriticalSection) 2. rw_mutex (pthread_rwlock_init, InitializeSRWLock)
0
8991
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8830
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9324
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9247
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6796
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6074
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4606
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4874
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3313
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.