473,786 Members | 2,407 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fuzzy Lookups

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Jan 30 '06 #1
24 2693
> As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?


I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len( a), len(b))))
shorter = float(min((len( a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r
distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is
0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez

Jan 30 '06 #2
Diez B. Roggisch wrote:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3

distance("Angel ina Jolie", "AngelinaJolei" ) 3 distance("Angel ina Jolie", "Bob")

13

what did I miss ?

</F>

Jan 30 '06 #3
Fredrik Lundh wrote:
Diez B. Roggisch wrote:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3

distance("Angel ina Jolie", "AngelinaJolei" ) 3 distance("Angel ina Jolie", "Bob")

13

what did I miss ?


Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki...htein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez
Jan 30 '06 #4
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:


Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

Jan 30 '06 #5

BBands wrote:
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:


Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab


i noticed this guy, who's quite a good ruby developer spent some time
on distances:

http://ruby.brian-schroeder.de/editierdistanz/

and also look at soundex, other algorithms (Double Metaphone, NYSIIS,
Phonex, I have notes to investigate but I haven't looked at them
myhself)

Jan 31 '06 #6

BBands wrote:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab


Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

Jan 31 '06 #7
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=di fflib.get_close _matches(AFileN ame,AllFiles,20 ,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

-Greg
On 30 Jan 2006 17:28:47 -0800, ajones <aj*****@gmail. com> wrote:

BBands wrote:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab


Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #8
> Thanks for that, I'll have a look. (So many packages, so little
time...)
Yes, there's a standard library for everything it seems! Except for a
MySQL api :-(
I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


I would be very interested it seeing that.


Done, see:
http://www.blendedtechnologies.com/r...zy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #9
Gregory Piñero wrote:
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=di fflib.get_close _matches(AFileN ame,AllFiles,20 ,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent
Jan 31 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
14073
by: Andrew McLean | last post by:
I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More precisely, for each address in database A I want to find a single matching address in database B. I'm 90% of the way there, in the sense that I have a simplistic...
1
4198
by: Ray Gardener | last post by:
I was wondering if anyone had tried implementing fuzzy logic set concepts in C++, because in FL, the concept of "type" or "class" is fuzzy; things belong (or are of) a given type only by degree. e.g., in a hypothetical fuzzy C++ language one could say: class pickle : public vegetable 0.2 { // pickle is not so much a vegetable as, say, onion is. };
1
1462
by: Russell | last post by:
Hi, I've been reading a lot about not using lookups in tables lately. Thus far, what I've been able to understand from my reading is this: - Do not use lookup fields in tables because they hide what's really stored (e.g. 'Joe Bloggs' is displayed, '4' is stored as the ID) - Instead, use combo boxes on forms, taking their values from tables or queries.
5
4952
by: AnAnimal | last post by:
We are moving a rather large application to the VB.NET platform. This application has a ton of table lookups. ie. code & description or code, category & description or code, class, category & description Most of the users know the more common codes and will just enter them but occasionally will need to look these codes up (especially when a temp is using the system!!). How would the VB.NET veterans handle this?
24
14416
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records.
6
5420
by: mac | last post by:
Summary: 1. I want to define a column in anMS Access table to be lookups on other tables. 2. The table that is the data source (e.g the "parent" table) has a composite primary key. 3. When the "child" table does the lookup, it should pass all the columns necessary to properly restrict the data returned to the litbso for he lookup. 4. How do I accomplish this in a lookup?
14
13496
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in various ways. One might include a '-' or have leading zeros, or have a single character missing, or a zero that is typed as a letter 'O'. That kind of thing. These tables currently reside in a mysql database. I was wondering if there is a...
2
2903
by: King Ron | last post by:
Ola all. In responding to a recent post requesting help with a search issue, I recommended using a combo box lookup in the table design. "paii, Ron" (no relation) posted this reply: " There are many posts in this group explaining the pitfalls of using the lookup feature in tables. Best practice appears to be, keep the lookup in the forms using combo box or list box."
4
4387
by: netnewbie78 | last post by:
Hello All, I don't have a problem (but maybe I will after I explain). I have a question with regards to something I saw in Access 2007. But first, a little backstory: I'm writing a very small stock database. For now, it will simply track what products come in (Stocks bought by Project) and what products go out (Stocks sold to by Project) . There are three tables: Products, Transactions and Projects.
0
9647
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10360
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10163
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10108
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8988
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6744
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5397
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5532
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2894
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.