Fuzzy Lookups - Python

BBands

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Jan 30 '06 #1

Subscribe Reply

2693

Diez B. Roggisch

> As mentioned above this works quite well and I am happy with it, but I

wonder if there is a more Pythonic way of doing this type of lookup?

I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len( a), len(b))))
shorter = float(min((len( a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r
distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is
0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez

Jan 30 '06 #2

Fredrik Lundh

Diez B. Roggisch wrote:

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3

distance("Angel ina Jolie", "AngelinaJolei" ) 3 distance("Angel ina Jolie", "Bob")

13

what did I miss ?

</F>

Jan 30 '06 #3

Diez B. Roggisch

Fredrik Lundh wrote:

Diez B. Roggisch wrote:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJo lei"

and

"Bob"

Both have a l-dist of 3

distance("Angel ina Jolie", "AngelinaJolei" ) 3 distance("Angel ina Jolie", "Bob")

13

what did I miss ?

Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki...htein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez

Jan 30 '06 #4

BBands

Diez B. Roggisch wrote:

I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

Jan 30 '06 #5

gene tani

BBands wrote:

Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

i noticed this guy, who's quite a good ruby developer spent some time
on distances:

http://ruby.brian-schroeder.de/editierdistanz/

and also look at soundex, other algorithms (Double Metaphone, NYSIIS,
Phonex, I have notes to investigate but I haven't looked at them
myhself)

Jan 31 '06 #6

ajones

BBands wrote:

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

Jan 31 '06 #7

Gregory Piñero

Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=di fflib.get_close _matches(AFileN ame,AllFiles,20 ,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

-Greg
On 30 Jan 2006 17:28:47 -0800, ajones <aj*****@gmail. com> wrote:

BBands wrote:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Jan 31 '06 #8

Gregory Piñero

> Thanks for that, I'll have a look. (So many packages, so little

time...)
Yes, there's a standard library for everything it seems! Except for a
MySQL api :-(

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

I would be very interested it seeing that.

Done, see:
http://www.blendedtechnologies.com/r...zy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Jan 31 '06 #9

Kent Johnson

Gregory Piñero wrote:

Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=di fflib.get_close _matches(AFileN ame,AllFiles,20 ,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent

Jan 31 '06 #10

Similar topics

14073

Fuzzy matching of postal addresses

by: Andrew McLean | last post by:

I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More precisely, for each address in database A I want to find a single matching address in database B. I'm 90% of the way there, in the sense that I have a simplistic...

Python

4198

Fuzzy logic question

by: Ray Gardener | last post by:

I was wondering if anyone had tried implementing fuzzy logic set concepts in C++, because in FL, the concept of "type" or "class" is fuzzy; things belong (or are of) a given type only by degree. e.g., in a hypothetical fuzzy C++ language one could say: class pickle : public vegetable 0.2 { // pickle is not so much a vegetable as, say, onion is. };

C / C++

1462

Alternative to lookups in tables - questions

by: Russell | last post by:

Hi, I've been reading a lot about not using lookups in tables lately. Thus far, what I've been able to understand from my reading is this: - Do not use lookup fields in tables because they hide what's really stored (e.g. 'Joe Bloggs' is displayed, '4' is stored as the ID) - Instead, use combo boxes on forms, taking their values from tables or queries.

Microsoft Access / VBA

4952

How should I handle code table lookups?

by: AnAnimal | last post by:

We are moving a rather large application to the VB.NET platform. This application has a ton of table lookups. ie. code & description or code, category & description or code, class, category & description Most of the users know the more common codes and will just enter them but occasionally will need to look these codes up (especially when a temp is using the system!!). How would the VB.NET veterans handle this?

Visual Basic .NET

14416

Can Access use Fuzzy Logic

by: cassetti | last post by:

Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records.

Microsoft Access / VBA

5420

Lookups with Composite primary keys - How To?

by: mac | last post by:

Summary: 1. I want to define a column in anMS Access table to be lookups on other tables. 2. The table that is the data source (e.g the "parent" table) has a composite primary key. 3. When the "child" table does the lookup, it should pass all the columns necessary to properly restrict the data returned to the litbso for he lookup. 4. How do I accomplish this in a lookup?

Microsoft Access / VBA

13496

Fuzzy string comparison

by: Steve Bergman | last post by:

I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in various ways. One might include a '-' or have leading zeros, or have a single character missing, or a zero that is typed as a letter 'O'. That kind of thing. These tables currently reside in a mysql database. I was wondering if there is a...

Python

2903

Discussion: using combobox & listbox lookups in table field design

by: King Ron | last post by:

Ola all. In responding to a recent post requesting help with a search issue, I recommended using a combo box lookup in the table design. "paii, Ron" (no relation) posted this reply: " There are many posts in this group explaining the pitfalls of using the lookup feature in tables. Best practice appears to be, keep the lookup in the forms using combo box or list box."

Microsoft Access / VBA

4387

Relationships, Lookups and Access 2007 tables

by: netnewbie78 | last post by:

Hello All, I don't have a problem (but maybe I will after I explain). I have a question with regards to something I saw in Access 2007. But first, a little backstory: I'm writing a very small stock database. For now, it will simply track what products come in (Stocks bought by Project) and what products go out (Stocks sold to by Project) . There are three tables: Products, Transactions and Projects.

Microsoft Access / VBA

9647

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10360

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10163

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10108

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8988

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6744

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5397

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5532

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2894

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General