473,382 Members | 1,387 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Sequence and/or pattern matching

Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :

1) I have a list of connexion between some computers. This list has
this format :

Ip A Date Ip B
.... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6
.... ... ...

2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A

3) Then, the software gives the sequences it has found and how many
times they appear...

I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?

Thanks

Séb

Oct 19 '05 #1
7 2794
Séb <sc***@unil.ch> wrote:
Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A


Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?
Alex
Oct 19 '05 #2
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.
Exactly, but the sequence has to be discovered by the piece of code !
Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?


Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?

Thank you very much

Oct 19 '05 #3
Séb wrote:
1) I have a list of connexion between some computers. This list has
this format :
It looks like you want graph theory.
Ip A Date Ip B
... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6
That looks like a list of edges between graph nodes, you see. Each node
corresponds to an address and each edge is a connection between two
nodes - ip addresses, in your case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A
That looks like finding a path between Node A and Node C. This is a
common application of graph theory, and especially when finding routes
(eg. for train journeys, or for AI characters in computer games).
3) Then, the software gives the sequences it has found and how many
times they appear...
You can find all the routes between 1 node and others by using
depth-first search (or indeed, any other simple graph search
algorithm). Basically, this says that, for any node, examine all the
nodes it leads to. Then examine those nodes... repeat until you run out
of nodes or find where you're looking for. The only complication is
remembering the route you took.
I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?


I found a couple of Python graph libraries on Google but I can't vouch
for their quality. However, writing your own simple graph traversal
algorithm should be quite trivial. I would start by populating a
dictionary where the keys are instances of address A and the values are
lists of address B values for address A. Add each link again with
address B as the key and address A as the value if you need to
represent bidirectional connections. Then you can perform search on
that as required. The only slight complication is avoiding infinite
loops - I might use a dictionary of address->boolean values here and
check off each address as the algorithm progresses.

--
Ben Sizer

Oct 19 '05 #4
"Séb" <sc***@unil.ch> writes:
Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.


Exactly, but the sequence has to be discovered by the piece of code !
Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?


Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?


As Ben Sizer pointed out, this is a fairly well-known graph
algorithm. You just want to add information to add ordering
information to each edge, so you can verify that the an edge that is
further down the path is "older" than the last edge added. Given your
ordering requirement, simply numbering the edges and checking that the
"older" edge has a larger number than the previous edge should do.

With my admin hat on, I have to wonder - don't you really want to deal
with duration? I.e - each connection has a "start" and "end" time, and
you're really only interested in paths that happen while all the
connections actually exist? Just a wild ass guess, mind you. However,
it doesn't radically change the test. You now tag each edge with a
start and stop time, and check that all previous edges on the path
exist while the one you are adding exists.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 19 '05 #5
Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)

Thank you !

Ps Python community is very nice, I'm glad I learn this language !

Oct 20 '05 #6
"Séb" <sc***@unil.ch> writes:
Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)


You're right - I was confused. That's why I asked about the timing
issue. But I also indicated that I might be way off base with that.

Everyone may be confused as well - we thought you were looking for
*any* series of connections that started at one computer and ended at
a second computer. That problem is the well-understood problem of
finding a path through a graph, where "path" is used in a
graph-theoretic sense and not a network sense.

From what you say above, it looks like you're looking for a specfic
sequence of connections in your connetions record. That's a much
easier problem. Roughly, you'd do:

def matches(desired, history):
"""Determines if history contains the desired path.
desired = # List of connections we're looking for.
history = # List of connections that were actually made."""

for con in desired:
try:
while not con.matches(history[hp]):
hp += 1
except IndexError:
return False
return True
<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 20 '05 #7
Sorry for the confusion, I think my example was unclear. Thank you Mike
for this piece of code who solves a part of my problem. In fact, the
sequences are unknown at the beginning, so the first part of the code
has to find possible sequences and if those sequences are repeated,
counts how many time they appear (as your code does).

I have found this morning that there's a software produced by i2
software who does this kind of job, but for telephone call analysis.
Maybe the description could help to better understand my goal :

http://www.i2.co.uk/products/Pattern_Tracer/default.asp

Séb

Oct 21 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: | last post by:
Hi, I'm fairly new to regular expressions, and this may be a rather dumb question, but so far I haven't found the answer in any tutorial or reference yet... If I have f.i. the string "The...
8
by: gsv2com | last post by:
One of my weaknesses has always been pattern matching. Something I definitely need to study up on and maybe you guys can give me a pointer here. I'm looking to remove all of this code and just...
176
by: Thomas Reichelt | last post by:
Moin, short question: is there any language combining the syntax, flexibility and great programming experience of Python with static typing? Is there a project to add static typing to Python? ...
9
by: Xah Lee | last post by:
# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images...
5
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I...
43
by: Roger L. Cauvin | last post by:
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. "xyz123aaabbaabbbbababbbbaaabb" I'm...
8
by: regis | last post by:
Greetings, about scanf matching nonempty sequences using the "%" matches a nonempty sequence of anything except '-' "%" matches a nonempty sequence of anything except ']" matches a nonempty...
8
by: Daneel | last post by:
Hello! I'm looking for an algorithm which finds all occurences of a bit sequence (e.g., "0001") in a file. This sequence can start at any bit in the file (it is not byte aligned). I have some...
8
by: Slaunger | last post by:
Hi all, I am a Python novice, and I have run into a problem in a project I am working on, which boils down to identifying the patterns in a sequence of integers, for example ..... 1 6 6 1 6 6...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.