473,322 Members | 1,398 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

String similarity

Introduction
------------
The need to find files that "resembled" in the name has pushed me to write
an utility that unlike the other it was not based on the content of the files
but on its name. Initially I start adding this functionality to
one "C" program for Unix called "fdupes" witch give me good performance
and good precision.
The algorithm that I have chosen for the comparison between string was
"Levenshtein Distance".
For reasons that I leave you to imagine I have rewrite that utility using
Python reducing the code from 1394 lines of "C" to 152 lines of Python
enclosed comments :)
Initially I have rearranged part of "C" code in one module "simil"
to which I have added an helper and others two algorithms that in reality
are two versions of the same algorithm "Ratcliff-Obershelp".
The small script works great scanning my download directory or my audio files
collection finding duplicates that I'll never be able to find manually.

Considerations
--------------
During the test I have been able to verify that the performances of the
Python module "difflib" are light years far from the algorithms of my
module "simil", initially I thought that the bottle neck was the
fact of having to compare all the possible combinations (without repetitions)
of files. This is bad IMHO.
This implementation is still in an embryonic phase however I have decided
of posting in this group for being able to discuss it and to carry ahead,
remaining in my HD would not have some future ;)
I'm sorry for the quality of Python code, I'm a beginner on that language..

Installation
------------
Once decompressed the file: http://spazioinwind.libero.it/montec...fdupes-0.7.tgz
run the Build.sh command in order to compile and to install the "simil" module.
To the test the modules you need to modify the Test.sh shell before the execution.
( the "int" mode is intentionally commented because on slower machine you could
never see the end, 515 files will generate 132355 iterations! )
Comments, credits, TODO and FIXME can be found inside the source code.

Ideas and optimization
----------------------
In the comparison between the content of files where the key is the size, the
iterations can be limited using a binary tree structure.
Comparing string this is not possible (IMHO) and the only way that I have found
was a "preliminary Check" that avoids to make comparison between string of a great
different length.

Critics, ideas, comments, fix and help are welcomes
luca

PS: sorry for my english

Jul 18 '05 #1
0 1766

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Bosconian | last post by:
I'm looking for a method to compare two strings and grade them for similarity. My idea is to strip out common words and punctuation and create a checksum of each remaining string. I would then...
5
by: Achim Domma | last post by:
Hi, I have a list of lets say 100-1000 strings and want to know which one is most similar to a reference string. Does somebody know such a library for Python? I don't need complicated scientific...
21
by: Chris S. | last post by:
I have a number of strings, containing wildcards (e.g. 'abc#e#' where # is anything), which I want to match with a test string (e.g 'abcdef'). What would be the best way for me to store my strings...
0
by: Anibal Acosta | last post by:
Somebody know an algorithm for determine the similarity between two string for example: string 1: "Hello what is your name?, where are you from?" string 2: "Hello man, where are you from?" ...
0
by: himoundary | last post by:
hi all i am using microsoft speech sdk (sapi 5.1). i have the following requirements: 1) i need to compare current input sound (through mic) with recorded sound file (say .wav). based of...
4
by: almurph | last post by:
Hi, Hope you can help me with this one. I'm looking for some nice string comparison algorithms. I want to be able to compare 2 strings (fairly smallish, less than 50 characters) and return a %...
9
by: Rajarshi | last post by:
Hi, I have some code that takes a string and obtains a compressed version using zlib.compress Does anybody know how I can remove the header portion of the compressed bytes, such that I only have...
7
by: IanWright | last post by:
I'm trying to come up with an effective assignment algorithm to allocate nodes to a nearest outlet, where each node should be have a time slot checked against the already allocated nodes for...
3
by: Jeff | last post by:
I've got a series of data like this: Long Sleeve White P/C Sm 32/33 Long Sleeve White P/C Med 32/33 .... What I'd like to do is extract the differences and the similarity. In this case: ...
6
by: aznimah | last post by:
hi, i'm work on image comparison. i'm using the similarity measurement which i need to: 1) convert the image into the binary form since the algorithm that i've use works with binary data for the...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.