473,799 Members | 3,061 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fuzzy Lookups

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Jan 30 '06
24 2702
I wonder which algorithm determines the similarity between two strings better?

On 1/31/06, Kent Johnson <ke**@kentsjohn son.com> wrote:
Gregory Piñero wrote:
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=di fflib.get_close _matches(AFileN ame,AllFiles,20 ,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent
--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #11
On Tue, 31 Jan 2006 10:51:44 -0500, Gregory Piñero wrote:
http://www.blendedtechnologies.com/r...zy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.


This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?
--
Steven.

Jan 31 '06 #12
Steven D'Aprano <st***@REMOVETH IScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.
Jan 31 '06 #13
On Tue, 31 Jan 2006, it was written:
Steven D'Aprano <st***@REMOVETH IScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare?
I often wonder that!
Is it purely a caching thing (once you have the checksum, you don't
need to read the file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates (i.e.
if files 637 and 2945 have the same contents, you want to discover
that). The obvious way is make a list of hashes, and sort the list.


Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, you'd be better off just
using the first N bytes of the file, since this would be just as
effective, and cheaper to read. Looking at some random MP3s i have to
hand, they all differ within the first 20 bytes - probably due to the ID3
tags, so this should work for these.

Better yet, note that if two files are identical, they must have the same
length, and that finding the length can be done very cheaply, so a quicker
yet approach is to make a list of lengths, sort that, and look for
duplicates; when you find one, do a byte-by-byte comparison of the files
(probably terminating in the first block) to see if they really are the
same.

By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within the
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule
out duplication with just 22 data blocks read from disk (plus rather more
blocks of directory information and inodes, of course). A hash-based
approach would have had to wade through a touch over 13 GB of data before
it could even get started.

Of course, there are situations where this is the wrong approach - if you
have a collection of serialised sparse matrices, for example, which
consist of identically-sized blocks of zeroes with a scattering of ones
throughout, then lengths and prefixes will be useless, whereas hashes will
work perfectly. However, here, we're looking at MP3s, where lengths and
prefixes will be a win.

tom

[1] The distribution of those is a bit weird: ten pairs consist of two
tracks from The Conet Project's 'Recordings of Shortwave Numbers
Stations', one is a song from that and The Doors' 'Horse Latitudes', and
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from
the same artist? Is it really that they're pairs of songs from the same
compressor (those tracks aren't from CD), i wonder?

--
Not all legislation can be eye-catching, and it is important that the
desire to achieve the headlines does not mean that small but useful
measures are crowded out of the legislative programme. -- Select Committee
on Transport
Feb 1 '06 #14
On Tue, 31 Jan 2006 13:38:50 -0800, Paul Rubin wrote:
Steven D'Aprano <st***@REMOVETH IScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.


Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)

I can't see any, but I thought maybe that's because I'm not thinking
outside the box.
--
Steven.

Feb 1 '06 #15
Tom Anderson <tw**@urchin.ea rth.li> writes:
The obvious way is make a list of hashes, and sort the list.
Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, ...


The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for. Hashing the first
few hundred bytes would be more reliable, since it would include some
actual audio in the hash. But then you're back to a list of hashes.
Better yet, note that if two files are identical, they must have the
same length, and that finding the length can be done very cheaply, so
a quicker yet approach is to make a list of lengths, sort that, and
look for duplicates; when you find one, do a byte-by-byte comparison
of the files (probably terminating in the first block) to see if they
really are the same.
Yes, checking the file lengths first is an obvious heuristic, but if
you fine you have a bunch of files with the same length, what do you
do? You're back to a list of hashes.
By way of example, of the 2690 music files in my iTunes library, i
have twelve pairs of same-sized files [1], and all of these differ
within the first 18 bytes (mostly, within the first 9 bytes).


That's a small enough set of matches that you don't need a general
purpose algorithm.
Feb 2 '06 #16
Steven D'Aprano <st***@REMOVETH IScyber.com.au> writes:
Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)


No of course not, except in special situations, like some problem
opening and reading both files simultaneously. E.g.: the files are on
two different DVD-R's, they are too big to fit in ram, and you only
have one DVD drive. If you want to compare byte by byte, you have to
either copy one of the DVD's to your hard disk (if you have the space
available) or else swap DVD's back and forth in the DVD drive reading
and comparing a bufferload at a time. But you can easily read in the
first DVD and compute its hash on the fly, then read and hash the
second DVD and compare the hashes.

If it's a normal situation with two files on HD, just open both files
simultaneously, and use large buffers to keep the amount of seeking
reasonable. That will be faster than big md5 computations, and more
reliable (there are known ways to construct pairs of distinct files
that have the same md5 hash.)
Feb 2 '06 #17
Steven D'Aprano wrote:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


Because if you store a hash, then you can keep that around even when the
original file is archived, moved elsewhere, or deleted. It's awfully
helpful for building databases of files you've seen before.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
Everyone wants to look good at his own funeral.
-- Louis Wu
Feb 2 '06 #18
On Thu, 1 Feb 2006, it was written:
Tom Anderson <tw**@urchin.ea rth.li> writes:
The obvious way is make a list of hashes, and sort the list.


Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, ...


The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for.


So read from the end of the file, rather than the beginning.
Better yet, note that if two files are identical, they must have the
same length, and that finding the length can be done very cheaply, so a
quicker yet approach is to make a list of lengths, sort that, and look
for duplicates; when you find one, do a byte-by-byte comparison of the
files (probably terminating in the first block) to see if they really
are the same.


Yes, checking the file lengths first is an obvious heuristic, but if you
fine you have a bunch of files with the same length, what do you do?
You're back to a list of hashes.


Or prefixes or suffixes.
By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within
the first 18 bytes (mostly, within the first 9 bytes).


That's a small enough set of matches that you don't need a general
purpose algorithm.


True - and this is *exactly* the situation that the OP was talking about,
so this algorithm is appropriate. Moreover, i believe is representative of
most situations where you have a bunch of files to compare. Of course,
cases where files are tougher to tell apart do exist, but i think they're
corner cases. Could you suggest a common kind of file with degenerate
lengths, prefixes and suffixes?

The only one that springs to mind is a set of same-sized image files in
some noncompressed format, recording similar images (frames in a movie,
say), where the differences might be buried deep in the pixel data. As it
happens, i have just such a dataset on disk: with the images in TIFF
format, i get differences between subsequent frames after 9 bytes, but i
suspect that's a timestamp or something; if i convert everything to a nice
simple BMP (throwing away 8 bits per sample of precision in the process -
probably turning most of the pixels to 0!), then i find differences about
a megabyte in. If i compare from the tail in, i also have to wade through
about a megabyte before finding a difference. Here, hashes would be ideal.

tom

--
The revolution is here. Get against the wall, sunshine. -- Mike Froggatt
Feb 2 '06 #19
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one

<snip>

Hello,

I adapted your approach to my needs and thought you might like to see
the result

def LevenshteinRela tive(a, b):
"""
Returns the Levenshtein distance between two strings
as a relative quantity in the range 1 to 0 where
1.0 is a perfect match.
"""
# Calculate the Levenshtein distance. (returns an integer)
dist = LevenshteinDist ance(a, b)
# dist is at most the length of the longer string.
max_dist = float(max(len(a ), len(b)))
# dist is always at least the difference of the sizes of the two
strings.
min_dist = max_dist - float(min(len(a ), len(b)))
try: # If max_dist and min_dist are equal use simple form.
relative = 1.0 - (dist - min_dist) / (max_dist - min_dist)
except ZeroDivisionErr or:
relative = 1.0 - dist / max_dist
return relative

Thanks,

jab

Feb 3 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
14074
by: Andrew McLean | last post by:
I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More precisely, for each address in database A I want to find a single matching address in database B. I'm 90% of the way there, in the sense that I have a simplistic...
1
4201
by: Ray Gardener | last post by:
I was wondering if anyone had tried implementing fuzzy logic set concepts in C++, because in FL, the concept of "type" or "class" is fuzzy; things belong (or are of) a given type only by degree. e.g., in a hypothetical fuzzy C++ language one could say: class pickle : public vegetable 0.2 { // pickle is not so much a vegetable as, say, onion is. };
1
1463
by: Russell | last post by:
Hi, I've been reading a lot about not using lookups in tables lately. Thus far, what I've been able to understand from my reading is this: - Do not use lookup fields in tables because they hide what's really stored (e.g. 'Joe Bloggs' is displayed, '4' is stored as the ID) - Instead, use combo boxes on forms, taking their values from tables or queries.
5
4953
by: AnAnimal | last post by:
We are moving a rather large application to the VB.NET platform. This application has a ton of table lookups. ie. code & description or code, category & description or code, class, category & description Most of the users know the more common codes and will just enter them but occasionally will need to look these codes up (especially when a temp is using the system!!). How would the VB.NET veterans handle this?
24
14416
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records.
6
5421
by: mac | last post by:
Summary: 1. I want to define a column in anMS Access table to be lookups on other tables. 2. The table that is the data source (e.g the "parent" table) has a composite primary key. 3. When the "child" table does the lookup, it should pass all the columns necessary to properly restrict the data returned to the litbso for he lookup. 4. How do I accomplish this in a lookup?
14
13498
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in various ways. One might include a '-' or have leading zeros, or have a single character missing, or a zero that is typed as a letter 'O'. That kind of thing. These tables currently reside in a mysql database. I was wondering if there is a...
2
2904
by: King Ron | last post by:
Ola all. In responding to a recent post requesting help with a search issue, I recommended using a combo box lookup in the table design. "paii, Ron" (no relation) posted this reply: " There are many posts in this group explaining the pitfalls of using the lookup feature in tables. Best practice appears to be, keep the lookup in the forms using combo box or list box."
4
4388
by: netnewbie78 | last post by:
Hello All, I don't have a problem (but maybe I will after I explain). I have a question with regards to something I saw in Access 2007. But first, a little backstory: I'm writing a very small stock database. For now, it will simply track what products come in (Stocks bought by Project) and what products go out (Stocks sold to by Project) . There are three tables: Products, Transactions and Projects.
0
9686
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10250
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10222
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10026
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7564
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6805
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5585
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4139
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2938
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.