By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,932 Members | 1,944 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,932 IT Pros & Developers. It's quick & easy.

Sequence and/or pattern matching

P: n/a
Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :

1) I have a list of connexion between some computers. This list has
this format :

Ip A Date Ip B
.... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6
.... ... ...

2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A

3) Then, the software gives the sequences it has found and how many
times they appear...

I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?

Thanks

Séb

Oct 19 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Séb <sc***@unil.ch> wrote:
Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A


Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?
Alex
Oct 19 '05 #2

P: n/a
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.
Exactly, but the sequence has to be discovered by the piece of code !
Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?


Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?

Thank you very much

Oct 19 '05 #3

P: n/a
Séb wrote:
1) I have a list of connexion between some computers. This list has
this format :
It looks like you want graph theory.
Ip A Date Ip B
... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6
That looks like a list of edges between graph nodes, you see. Each node
corresponds to an address and each edge is a connection between two
nodes - ip addresses, in your case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A
That looks like finding a path between Node A and Node C. This is a
common application of graph theory, and especially when finding routes
(eg. for train journeys, or for AI characters in computer games).
3) Then, the software gives the sequences it has found and how many
times they appear...
You can find all the routes between 1 node and others by using
depth-first search (or indeed, any other simple graph search
algorithm). Basically, this says that, for any node, examine all the
nodes it leads to. Then examine those nodes... repeat until you run out
of nodes or find where you're looking for. The only complication is
remembering the route you took.
I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?


I found a couple of Python graph libraries on Google but I can't vouch
for their quality. However, writing your own simple graph traversal
algorithm should be quite trivial. I would start by populating a
dictionary where the keys are instances of address A and the values are
lists of address B values for address A. Add each link again with
address B as the key and address A as the value if you need to
represent bidirectional connections. Then you can perform search on
that as required. The only slight complication is avoiding infinite
loops - I might use a dictionary of address->boolean values here and
check off each address as the algorithm progresses.

--
Ben Sizer

Oct 19 '05 #4

P: n/a
"Séb" <sc***@unil.ch> writes:
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.


Exactly, but the sequence has to be discovered by the piece of code !
Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?


Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?


As Ben Sizer pointed out, this is a fairly well-known graph
algorithm. You just want to add information to add ordering
information to each edge, so you can verify that the an edge that is
further down the path is "older" than the last edge added. Given your
ordering requirement, simply numbering the edges and checking that the
"older" edge has a larger number than the previous edge should do.

With my admin hat on, I have to wonder - don't you really want to deal
with duration? I.e - each connection has a "start" and "end" time, and
you're really only interested in paths that happen while all the
connections actually exist? Just a wild ass guess, mind you. However,
it doesn't radically change the test. You now tag each edge with a
start and stop time, and check that all previous edges on the path
exist while the one you are adding exists.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 19 '05 #5

P: n/a
Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)

Thank you !

Ps Python community is very nice, I'm glad I learn this language !

Oct 20 '05 #6

P: n/a
"Séb" <sc***@unil.ch> writes:
Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)


You're right - I was confused. That's why I asked about the timing
issue. But I also indicated that I might be way off base with that.

Everyone may be confused as well - we thought you were looking for
*any* series of connections that started at one computer and ended at
a second computer. That problem is the well-understood problem of
finding a path through a graph, where "path" is used in a
graph-theoretic sense and not a network sense.

From what you say above, it looks like you're looking for a specfic
sequence of connections in your connetions record. That's a much
easier problem. Roughly, you'd do:

def matches(desired, history):
"""Determines if history contains the desired path.
desired = # List of connections we're looking for.
history = # List of connections that were actually made."""

for con in desired:
try:
while not con.matches(history[hp]):
hp += 1
except IndexError:
return False
return True
<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 20 '05 #7

P: n/a
Sorry for the confusion, I think my example was unclear. Thank you Mike
for this piece of code who solves a part of my problem. In fact, the
sequences are unknown at the beginning, so the first part of the code
has to find possible sequences and if those sequences are repeated,
counts how many time they appear (as your code does).

I have found this morning that there's a software produced by i2
software who does this kind of job, but for telephone call analysis.
Maybe the description could help to better understand my goal :

http://www.i2.co.uk/products/Pattern_Tracer/default.asp

Séb

Oct 21 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.