473,785 Members | 2,434 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Looking for library to estimate likeness of two strings

Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!
Feb 6 '08 #1
9 2208
Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?
It sounds like you're interested in calculating the Levenshtein
distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

which gives you a measure of how different they are. A measure
of "0" is that the inputs are the same. The more different the
two strings are, the greater the resulting output of the function.

Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
N=len(word2)) from my understanding of the code I've seen.
However it really is the best approximation I've seen of a "how
similar are these two strings" function. Googling for

python levenshtein distance

brings up oodles of hits.

-tkc

Feb 6 '08 #2
On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
Jeff Schwab wrote:
....
>If the strings happen to be the same length, the Levenshtein distance
is equivalent to the Hamming distance.
....
I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
....
In [4]: hamdist('abcdef ', 'fabcde')
Out[4]: 6

In [5]: levenshtein('ab cdef', 'fabcde')
Out[5]: 2

I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.
For what it's worth, here's my implementation, which I wrote some years
ago. I doubt it's optimal.

def levenshtein(s1, s2):
"""Return the Levenshtein edit distance between two strings.

s1 and s2 should be two arbitrary length strings. Returns an
integer count of the edit distance between the strings.
"""
m, n = len(s1), len(s2)
if m == 0: return n
elif n == 0: return m
# Create an array with rows 0..m and columns 0..n.
array = [None]*(m+1)
for i in range(m+1):
array[i] = [None]*(n+1)
# Initialise the first row to 0..n.
for i in range(n+1):
array[0][i] = i
# Initialize the first column to 0..m.
for i in range(m+1):
array[i][0] = i
# Measure the differences.
for i in range(1, m+1):
c1 = s1[i-1]
for j in range(1, n+1):
c2 = s2[j-1]
cost = int(c1 != c2)
x = array[i][j-1] + 1 # Cell immediately above plus one.
y = array[i-1][j] + 1 # Cell immediately left plus one.
z = array[i-1][j-1] + cost # Cell diagonally above and
# to the left, plus the cost.
array[i][j] = min(x, y, z)
# When done, the bottom-right cell contains the Levenshtein distance.
return array[m][n]


--
Steven
Feb 7 '08 #3
Steven D'Aprano wrote:
On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
>Jeff Schwab wrote:
...
>>If the strings happen to be the same length, the Levenshtein distance
is equivalent to the Hamming distance.
...
>I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
...
I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.
Whoops! My mistake.
Feb 7 '08 #4
Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

Very concretely if I have an sqlite db with SQLObject as the ORM what
would be the most effective way of doing such a query?

Cheers,
Daniel
Feb 7 '08 #5
At 14:01 Wed 06 Feb 2008, ag*****@gmail.c om wrote:
>Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!
I used difflib.get_clo se_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

--
Lee Capps
Technology Specialist
CTE Resource Center
Feb 7 '08 #6
On Feb 7, 6:11 am, Lee Capps <lca...@ctereso urce.orgwrote:
At 14:01 Wed 06 Feb 2008, agen...@gmail.c om wrote:
Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?
I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.
Many thanks in advance!

I used difflib.get_clo se_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

--
Lee Capps
Technology Specialist
CTE Resource Center
Algorithms typically used for name comparisons include soundex,
nysiis, and levenshtein distance. The last is more general and
variations are used in spell checkers. You can probably Google for
Python versions. You can find implementations of these comparison
functions at
www.spss.com/devcentral in the extendedTransfo rms.py module.
(Requires a login but free).

HTH,
Jon Peck
Feb 7 '08 #7
Hi, All:

Many thanks for the excellent leads. I've also found several
functions to find phonetic similarity between English names: the
mentioned above soundex, then, also, one called metaphone. I'm now
thinking of the best way to use some combination of these functions.
Feb 7 '08 #8
On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@goo glemail.com>
wrote:
Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.
I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john') ;
SELECT * FROM s WHERE difference(name , 'john') 2;

http://www.postgresql.org/docs/8.3/s...ystrmatch.html
Feb 7 '08 #9
2008/2/7, ag*****@gmail.c om <ag*****@gmail. com>:
On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@goo glemail.com>
wrote:
Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.


I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john') ;
SELECT * FROM s WHERE difference(name , 'john') 2;

http://www.postgresql.org/docs/8.3/s...ystrmatch.html
SQLite supports soundex, but it is disabled by default, you need to
compile it with -DSQLITE_SOUNDEX =1
--
http://mail.python.org/mailman/listinfo/python-list

--
-- Guilherme H. Polo Goncalves
Feb 7 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
3403
by: Dennieku | last post by:
Does anybody have an idea how to compare two strings? What I want to know is the percentage of the differences between the strings Thx, Dennieku
14
6905
by: Akseli Mäki | last post by:
Hi, Hopefully this is not too much offtopic. I'm working on a FAQ. I want to make two versions of it, plain text and HTML. I'm looking for a tool that will make a plain text doc out of the HTML doc. The HTML version doesn't have anything fancy, just internal links. So the tool must be able to delete internal links and anchors from the HTML version, but leave external links in simplified form. That is, the HTML version would say <a...
5
1110
by: Aeryn | last post by:
Can any one recommend a good VB.NET freeware library that offers lots of string related functions that aren't present in the standard .NET framework?
4
1389
by: Sean | last post by:
I am a newbie in python, and I have a feeling that python provides less library support than perl www.cpan.org This seems a big discussion topic. I want to know if there is extensive algorithm library support in python. I know there is a pretty neat module in perl to implement graph theory. Is there a graph theory module in python? Thanks,
1
2184
by: wayne.denier | last post by:
Ran across this well documented, nice looking library for C++ gaming called Amaltheia ( http://home.gna.org/amaltheia/ ), but they don't have a community :P. I'm not sure if anybody has used it before, but the guys tell you not to write a main() function, that their library does it for you. You specify constructor, loop, a destructor functions. It all sounds good except that after compiling, my linker throws up, saying.. Error:...
4
3577
by: nass | last post by:
hello everyone, i have a bit of problem reading char * strings from a buffer (a shared memory, pointed to by 'file_memory'). basically i have a structure in memory 'ShMem' that can be accessed by 2 applications (or will be at least when it is done). the structure is declared in the procedure under the pointer infoVar. members tXXL are integer lengths of the strings that as all saved (concatenated) at address of oAndS_str, which is of type...
74
3168
by: cman | last post by:
Can you "walk across" C strings or char pointers (using *(sz+1)) like you can with arrays. If not, why not? If so, how? cman
4
3060
by: Travis | last post by:
Is there a way to determine the percentage likeness two strings share? So that given an parameter string1 and a list of strings to search, I could display only those values that are say, a 75% or more match?
0
9645
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9950
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8972
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7499
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6739
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4050
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3645
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2879
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.