Looking for library to estimate likeness of two strings

agenkin

Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

Feb 6 '08 #1

Subscribe Post Reply

2181

Tim Chase

Are there any Python libraries implementing measurement of similarity

of two strings of Latin characters?

It sounds like you're interested in calculating the Levenshtein
distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

which gives you a measure of how different they are. A measure
of "0" is that the inputs are the same. The more different the
two strings are, the greater the resulting output of the function.

Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
N=len(word2)) from my understanding of the code I've seen.
However it really is the best approximation I've seen of a "how
similar are these two strings" function. Googling for

python levenshtein distance

brings up oodles of hits.

-tkc

Feb 6 '08 #2

Steven D'Aprano

On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:

Jeff Schwab wrote:

....

>If the strings happen to be the same length, the Levenshtein distance
is equivalent to the Hamming distance.

....

I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:

....

In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6

In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2

I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.
For what it's worth, here's my implementation, which I wrote some years
ago. I doubt it's optimal.

def levenshtein(s1, s2):
"""Return the Levenshtein edit distance between two strings.

s1 and s2 should be two arbitrary length strings. Returns an
integer count of the edit distance between the strings.
"""
m, n = len(s1), len(s2)
if m == 0: return n
elif n == 0: return m
# Create an array with rows 0..m and columns 0..n.
array = [None]*(m+1)
for i in range(m+1):
array[i] = [None]*(n+1)
# Initialise the first row to 0..n.
for i in range(n+1):
array[0][i] = i
# Initialize the first column to 0..m.
for i in range(m+1):
array[i][0] = i
# Measure the differences.
for i in range(1, m+1):
c1 = s1[i-1]
for j in range(1, n+1):
c2 = s2[j-1]
cost = int(c1 != c2)
x = array[i][j-1] + 1 # Cell immediately above plus one.
y = array[i-1][j] + 1 # Cell immediately left plus one.
z = array[i-1][j-1] + cost # Cell diagonally above and
# to the left, plus the cost.
array[i][j] = min(x, y, z)
# When done, the bottom-right cell contains the Levenshtein distance.
return array[m][n]

--
Steven

Feb 7 '08 #3

Jeff Schwab

Steven D'Aprano wrote:

On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:

>Jeff Schwab wrote:
...

>>If the strings happen to be the same length, the Levenshtein distance
is equivalent to the Hamming distance.

...
>I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
...

I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.

Whoops! My mistake.

Feb 7 '08 #4

Daniel Fetchinson

Are there any Python libraries implementing measurement of similarity

of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

Very concretely if I have an sqlite db with SQLObject as the ORM what
would be the most effective way of doing such a query?

Cheers,
Daniel

Feb 7 '08 #5

Lee Capps

At 14:01 Wed 06 Feb 2008, ag*****@gmail.com wrote:

>Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

I used difflib.get_close_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

--
Lee Capps
Technology Specialist
CTE Resource Center

Feb 7 '08 #6

JKPeck

On Feb 7, 6:11 am, Lee Capps <lca...@cteresource.orgwrote:

At 14:01 Wed 06 Feb 2008, agen...@gmail.com wrote:

Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

I used difflib.get_close_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

--
Lee Capps
Technology Specialist
CTE Resource Center

Algorithms typically used for name comparisons include soundex,
nysiis, and levenshtein distance. The last is more general and
variations are used in spell checkers. You can probably Google for
Python versions. You can find implementations of these comparison
functions at
www.spss.com/devcentral in the extendedTransforms.py module.
(Requires a login but free).

HTH,
Jon Peck

Feb 7 '08 #7

agenkin

Hi, All:

Many thanks for the excellent leads. I've also found several
functions to find phonetic similarity between English names: the
mentioned above soundex, then, also, one called metaphone. I'm now
thinking of the best way to use some combination of these functions.

Feb 7 '08 #8

agenkin

On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@googlemail.com>
wrote:

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john');
SELECT * FROM s WHERE difference(name, 'john') 2;

http://www.postgresql.org/docs/8.3/s...ystrmatch.html

Feb 7 '08 #9

Guilherme Polo

2008/2/7, ag*****@gmail.com <ag*****@gmail.com>:

On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@googlemail.com>
wrote:

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john');
SELECT * FROM s WHERE difference(name, 'john') 2;

http://www.postgresql.org/docs/8.3/s...ystrmatch.html

SQLite supports soundex, but it is disabled by default, you need to
compile it with -DSQLITE_SOUNDEX=1

--
http://mail.python.org/mailman/listinfo/python-list

--
-- Guilherme H. Polo Goncalves

Feb 7 '08 #10

Similar topics

Comparing strings

by: Dennieku | last post by:

Does anybody have an idea how to compare two strings? What I want to know is the percentage of the differences between the strings Thx, Dennieku

.NET Framework

Looking for a tool to make plain text document out of a simple HTML document

by: Akseli Mäki | last post by:

Hi, Hopefully this is not too much offtopic. I'm working on a FAQ. I want to make two versions of it, plain text and HTML. I'm looking for a tool that will make a plain text doc out of the...

HTML / CSS

Needed quality freeware library for manipulating strings...

by: Aeryn | last post by:

Can any one recommend a good VB.NET freeware library that offers lots of string related functions that aren't present in the standard .NET framework?

Visual Basic .NET

python's library support

by: Sean | last post by:

I am a newbie in python, and I have a feeling that python provides less library support than perl www.cpan.org This seems a big discussion topic. I want to know if there is extensive algorithm...

Python

Amaltheia C++ library: No Main() function?

by: wayne.denier | last post by:

Ran across this well documented, nice looking library for C++ gaming called Amaltheia ( http://home.gna.org/amaltheia/ ), but they don't have a community :P. I'm not sure if anybody has used it...

C / C++

unable to read char * strings from a buffer

by: nass | last post by:

hello everyone, i have a bit of problem reading char * strings from a buffer (a shared memory, pointed to by 'file_memory'). basically i have a structure in memory 'ShMem' that can be accessed by...

C / C++

C Strings

by: cman | last post by:

Can you "walk across" C strings or char pointers (using *(sz+1)) like you can with arrays. If not, why not? If so, how? cman

C / C++

Percentage Likeness between strings

by: Travis | last post by:

Is there a way to determine the percentage likeness two strings share? So that given an parameter string1 and a list of strings to search, I could display only those values that are say, a 75% or...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server