Middle matching - any Python library functions (besides re)?

Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP

Aug 27 '06 #1

Subscribe Post Reply

1655

Paul Rubin

"EP" <er***********@gmail.comwrites:

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

Aug 28 '06 #2

bearophileHUGS

If you want to avoid an O(n^2) algorithm, you may need to find a
signature for each file. Then you use such signatures to compute
hashes, and unique them with a dict (dict values may be the file
names). Later you can weed out the few wrong collisions produced by the
possibly approximated signature. If you can determine of a given a file
where its heading garbage stops, then you can compute the signature
just computing the python hash of some of the following bytes (30 or
200 byte may suffice).

Bye,
bearophile

Aug 28 '06 #3

Paddy

EP wrote:

Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Are they source files?
Is the cruft comments?

Have you done an exhaustive search for info on the files and the
histories? If they are systematically generated then there may be a way
to systematically uncover the matching info from their histories, such
as source code management histories, file pathnames, file dates?

What more can you find out about cruft? Is there a way to strip the
cruft by program? If so, file sizes and checksums could be compared on
the stripped files.

>
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

Can you change the flow that creates these files so that the
information you want is generated alongside the files?

Can you use data later in the flow to more easily extract the info you
want?

>
TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP

Just some questions that I woud ask myself if given the task.

- Paddy.

Aug 28 '06 #4

Simon Forman

Paul Rubin wrote:

"EP" <er***********@gmail.comwrites:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :-)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Aug 28 '06 #5

Andrew Robert

Simon Forman wrote:

Paul Rubin wrote:
>"EP" <er***********@gmail.comwrites:
>>Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :-)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?
Utilizing the functions below may be of some help.
#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file
import md5, sys
def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

--
Adversity: That which does not kill me only postpones the inevitable.

Aug 28 '06 #6

Simon Forman

Andrew Robert wrote:

Simon Forman wrote:
Paul Rubin wrote:
"EP" <er***********@gmail.comwrites:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.
If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :-)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)

>
Utilizing the functions below may be of some help.
#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file
import md5, sys
def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon

Aug 28 '06 #7

Andrew Robert

Simon Forman wrote:

Andrew Robert wrote:
>Simon Forman wrote:
>>Paul Rubin wrote:
"EP" <er***********@gmail.comwrites:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.
If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :-)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)

>Utilizing the functions below may be of some help.
#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file
import md5, sys
def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon

Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..

Aug 28 '06 #8

Simon Forman

Andrew Robert wrote:

Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..

Aaaahh the subtle joys of cut-and-paste programming... :-D

(I've done it myself... ;-)

Aug 28 '06 #9

Similar topics

226

reduce() anomaly?

by: Stephen C. Waterbury | last post by:

This seems like it ought to work, according to the description of reduce(), but it doesn't. Is this a bug, or am I missing something? Python 2.3.2 (#1, Oct 20 2003, 01:04:35) on linux2 Type...

Python

UserLinux chooses Python as "interpretive language" of choice

by: Ville Vainio | last post by:

I don't know if you have seen this before, but here goes: http://text.userlinux.com/white_paper.html There is a jab at Python, though, mentioning that Ruby is more "refined". -- Ville...

Python

Reimplenting Linux Kernel in Python

by: Maboroshi | last post by:

Hello My question has to do with python and linux - I was interested in finding out what it would take to reimplement the Linux Kernel in python basically just taking the source code from linux...

Python

How did you learn Python?

by: Shawn Milo | last post by:

I was just wondering what the best books were for learning Python. Which books are good for getting started, and which should be saved for later, or or not useful except as a reference for the...

Python

pyvm -- faster python

by: Stelios Xanthakis | last post by:

Hi. pyvm is a program which can run python 2.4 bytecode (the .pyc files). A demo pre-release is available at: http://students.ceid.upatras.gr/~sxanth/pyvm/ Facts about pyvm: - It's FAST....

Python

217

Ten Essential Development Practices

by: gyromagnetic | last post by:

The following url points to an article written by Damian Conway entitled "Ten Essential Development Practices": http://www.perl.com/pub/a/2005/07/14/bestpractices.html Althought the article has...

Python

Dr. Dobb's Python-URL! - weekly Python news and links (Aug 12)

by: Cameron Laird | last post by:

QOTW: "... So I started profiling the code and the slowdown was actually taking place at places where I didn't expect it." -- Guyon Mor?e (and about twenty-three thousand others) " suggestion...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server