448,712 Members | 1,591 Online
Need help? Post your question and get tips & solutions from a community of 448,712 IT Pros & Developers. It's quick & easy.

# Comparing 2 similar strings?

 P: n/a How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max (Crossposted to 3 of my favourite newsgroup.) -- William Park , Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html Jul 19 '05 #1
26 Replies

 P: n/a Hello William, How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? This is a classic problem of computer science. Watch this: http://odur.let.rug.nl/~kleiweg/lev/ http://www.levenshtein.net/ This solution has application in speech recognition and also in DNA research. Jul 19 '05 #2

 P: n/a William Park wrote: How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max (Crossposted to 3 of my favourite newsgroup.) "However you like" is probably the right answer, but one way might be to compare their soundex encoding (http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out percentage difference based on comparing the numeric part. Ed. Jul 19 '05 #3

 P: n/a On Wed, 18 May 2005 15:06:53 -0500, Ed Morton wrote: William Park wrote: How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max (Crossposted to 3 of my favourite newsgroup.)"However you like" is probably the right answer, but one way might be tocompare their soundex encoding(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure outpercentage difference based on comparing the numeric part. Fantastic suggestion. Here's a tiny piece of real-life test data: compare the surnames "Mousaferiadis" and "McPherson". Jul 19 '05 #4

 P: n/a William Park wrote: How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max (Crossposted to 3 of my favourite newsgroup.) http://www.personal.psu.edu/staff/i/u/iua1/python/apse/ http://www.nist.gov/dads/HTML/editDistance.html Jul 19 '05 #5

 P: n/a On Wed, 18 May 2005 15:48:32 -0400, William Park wrote: How do you compare 2 strings, and determine how much they are "close" toeach other? Eg. aqwerty qwertybare similar to each other, except for first/last char. But, how do Iquantify that?I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% maxBut, for qawerty qwerbtymax correlation is - 3 chars out of 7 are the same sequence --> 42% max 1. Google for such topics as "fuzzy matching", "edit distance", "approximate comparison". 2. Closer to home, look at the thread in comp.lang.python around 2004-11-18 -- search for "Pingel Hyyro" [and yes you do mean "hyyro", not "hydro"!!] 3. Steadfastly ignore any (presumably) well-intentioned profferings of soundex. HTH, John Jul 19 '05 #6

 P: n/a On Wed, 18 May 2005 13:45:30 -0700, Don wrote: http://www.personal.psu.edu/staff/i/u/iua1/python/apse/ The above is broken, not meeting one of the elementary conditions for a distance metric: distance(a, b) == distance(b, a) Quoting from its docs: | Note: The definition of the goodness of an approximate match is the | number of steps required to bring the string pattern to a form that is | entirely contained in the string to which it is being matched. Note: "entirely contained in", rather than "equal to". Now read on: | The mathing | is not commutative. The pattern that you instantiate the class with will be | matched against the input. For example the word "funky" can be made to | match the word "funnybone" with an edit distance of one. However, using | "funnybone" as a pattern that will be matched to "funky" the distance | will become five. | | Example: | | >>> from Apse import Approx | >>> a = Approx("funky") | >>> a.dist("funnybone") | 1 | >>> a = Approx("funnybone") | >>> a.dist("funky") | 5 Jul 19 '05 #7

 P: n/a [trimmed to c.l.py] In article <24**************************@PRIMUS.CA>, William Park wrote:How do you compare 2 strings, and determine how much they are "close" toeach other? Eg. aqwerty qwertybare similar to each other, except for first/last char. But, how do Iquantify that?I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% maxBut, for qawerty qwerbtymax correlation is - 3 chars out of 7 are the same sequence --> 42% max difflib -- use the string as the sequence -- Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/ "And if that makes me an elitist...I couldn't be happier." --JMS Jul 19 '05 #8

 P: n/a Could hit a few snags. Quick out-of-the-library compression using standards like zlib will have headers that will dilute the difference on short strings, and on long strings block compression (zlib, bzip2) will not pick up similarities because the similarities will be in different blocks. With blocks of around 100k-1M in these algos by default (IIRC), this could work well for strings between oh say 1k-50k. But I need to underscore Aahz's posting above: ***Check out difflib, it's in the library.*** Perfect package for what the OP wants AFAICT. Jul 19 '05 #9

 P: n/a On Thu, 19 May 2005 06:38:59 +1000, John Machin wrote: On Wed, 18 May 2005 15:06:53 -0500, Ed Morton wrote:William Park wrote: How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that?"However you like" is probably the right answer, but one way might be tocompare their soundex encoding(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure outpercentage difference based on comparing the numeric part. Fantastic suggestion. Here's a tiny piece of real-life test data: compare the surnames "Mousaferiadis" and "McPherson". I remember a word processor, MultiMate, which used Soundex to do matching for its suggestions for spelling correction. One of my cow-orkers typed the word 'agains' -- it was supposed to be 'against' but 'again' would also have been a sensible suggestion. MultiMate, however, suggested neither of those reasonable words, it did suggest "iguanas" amd "Utahns"... (I wonder what it does with "Talliafero" and "Tolliver", and with "Featherstone-Howe" and "Fanshaw"...) The answer to the OP, which someone just pointed out to me on comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A google search on either produces some good descriptions in the first few links, including http://www.merriampark.com/ld.htm which has not only a description of the algorithm but also source code in Java, C++ and Visual Basic (no Awk, thought there are links to pages with implementations in Perl, Python, Delphi and many more)... [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' in Schwaebisch (south German) fashion, but apparently the author of the algorithm was one Vladimir I. Levenshtein and that is how he is credited on the IEEE site. There are also a number of Google hits under the 'stein' spelling, though... Chris C Jul 19 '05 #10

 P: n/a On Thu, 19 May 2005 07:07:56 +1000, John Machin wrote: On Wed, 18 May 2005 13:45:30 -0700, Don wrote:http://www.personal.psu.edu/staff/i/u/iua1/python/apse/ The above is broken, not meeting one of the elementary conditions for a distance metric: distance(a, b) == distance(b, a) I agree that this makes the edit distance broken in the context of text strings, but you should be aware that distance is only commutative if there are no constraints on travel. If you've ever driven around cities like Sydney that use lots of one-way streets, you will know that the distance from point A to point B is not necessarily the same as the distance from B back to A. -- Steven Jul 19 '05 #11

 P: n/a On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote: None of the other approaches make the mistake of preserving the first letter -- this alone is almost enough reason for jettisoning soundex. Off-topic now, but you've made me curious. Why is this a bad idea? How would you handle the case of "barow" and "marow"? (Barrow and marrow, naturally.) Without the first letter, they sound identical. Why is throwing that information away a good thing? Thanks, -- Steven Jul 19 '05 #12

 P: n/a Steven D'Aprano wrote: On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote: None of the other approaches make the mistake of preserving the first letter -- this alone is almost enough reason for jettisoning soundex. Off-topic now, but you've made me curious. Why is this a bad idea? Because of situations like, for example, my mother's last name, which originally started with "Y" but got anglicized to a name beginning with "E". Same name, different Soundex codes, and the problem occurs only occurs because of Soundex's preservation of the exact initial letter. How would you handle the case of "barow" and "marow"? (Barrow and marrow, naturally.) Without the first letter, they sound identical. Why is throwing that information away a good thing? No one's suggesting throwing away the first letter's information, just removing the special treatment for it. "Barow" becomes 1600 and "Marow" becomes 5600. Jul 19 '05 #13

 P: n/a On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano wrote: On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote: None of the other approaches make the mistake of preserving the first letter -- this alone is almost enough reason for jettisoning soundex. Off-topic now, but you've made me curious. Why is this a bad idea? Why is the first letter any more important than any other? How would you handle the case of "barow" and "marow"? (Barrow and marrow, naturally.) Without the first letter, they sound identical. Why is throwing that information away a good thing? Well, Soundex will quite possibly throw the information away anyway, certainly it regards several letters as the same. But why is the difference between barrow and marrow more important than that between help and held? Or between hatter and hammer? Regarding 'agains' as similar to 'iguanas' and 'Utahns', but not to 'again' or 'against', is silly... Chris C Jul 19 '05 #14

 P: n/a On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano wrote: On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote: None of the other approaches make the mistake of preserving the first letter -- this alone is almost enough reason for jettisoning soundex.Off-topic now, but you've made me curious.Why is this a bad idea?How would you handle the case of "barow" and "marow"? (Barrow andmarrow, naturally.) Without the first letter, they sound identical. Why isthrowing that information away a good thing? Sorry if that was unclear. By "preserving the first letter", I meant that in "standard" soundex, the first letter is not transformed into a digit. Karen -> K650 Kieran -> K650 (R->6, N->5; vowels->0 and then are squeezed out) Now compare this: Aaron -> A650 Erin -> E650 Bearing in mind that the usual application of soundex is "all or nothing", the result is Karen == Kieran, but Aaron !== Erin, which is at the very least extremely inconsistent. A better phonetic-key creator would produce the same result for each of the first pair, and for each of the second pair -- e.g. KARAN and ARAN respectively. Also consider Catherine vs Katherine. Cheers, John Jul 19 '05 #15

 P: n/a In article <15********************************@4ax.com>, John Machin wrote: % None of the other approaches make the mistake of preserving the first % letter -- this alone is almost enough reason for jettisoning soundex. Metaphone does, at least in the case of Aaron and Erin. -- Patrick TJ McPhee North York Canada pt**@interlog.com Jul 19 '05 #16

 P: n/a On 20 May 2005 04:09:26 GMT, pt**@interlog.com (Patrick TJ McPhee) wrote: In article <15********************************@4ax.com>,John Machin wrote:% None of the other approaches make the mistake of preserving the first% letter -- this alone is almost enough reason for jettisoning soundex.Metaphone does, at least in the case of Aaron and Erin. You're quite correct; moreover NYSIIS does that too. Dunno what came over me, must be gamma rays or something :-) Jul 19 '05 #17

 P: n/a Dennis Lee Bieber wrote: On Wed, 18 May 2005 20:03:53 -0500, Ed Morton declaimed the following in comp.lang.python: Fantastic test data set. I know how to pronounce McPherson but I'd neverhave guessed that Mousaferiadis sounds like it. I suppose non-Celtsprobably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean,Ceilidh, or Concobarh are pronounced either. Since "soundex" is initial letter (consonant?) and a code for the next three syllables (or close to it), really long multi-syllabic names are effectively truncated... Howe'er... When Maire Brennan releases an album as "Moya", following sister's "Enya" (Eithne, as I seem to recall reading)... I'd not attempt to pronounce most of the names you supply... "Dalziell" doesn't look Celtic... "Culzean" almost looks Aztec/Mayan... "Ceilidh" => kay-lee? Okay, I think I can manage bain sidhe and uisge (after too much of the latter, I'll be seeing the former) Well, as an Englishman who has spent a good deal of time in Scotland I'd essay the following. If there are any Scots reading they can chastise me with glee for my presumption. Dalziell: "Da'y yell" Drumnadrochit: "Dru'mnadro'ckit" (but the Scots would insist you use a gutteral for the "ch", I'm not sure how to render that phonetically. It's a bit like the sound made before spitting, or the "G" in Guido in Dutch :-). Culzean: "Ka La'ne" Ceilidh: "Ca'yli" (once had a border collie called Ceilidh). Concobarh: (is this the same as 'Conchobar'?) Co'nnahwar promoting-scottish-english-unity-ly y'rs - steve -- Steve Holden +1 703 861 4237 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Jul 19 '05 #18

 P: n/a Chris Croughton wrote: On Thu, 19 May 2005 06:38:59 +1000, John Machin wrote:On Wed, 18 May 2005 15:06:53 -0500, Ed Morton wrote:William Park wrote: How do you compare 2 strings, and determine how much they are "close" toeach other? Eg. aqwerty qwertybare similar to each other, except for first/last char. But, how do Iquantify that?"However you like" is probably the right answer, but one way might be tocompare their soundex encoding(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure outpercentage difference based on comparing the numeric part.Fantastic suggestion. Here's a tiny piece of real-life test data:compare the surnames "Mousaferiadis" and "McPherson". I remember a word processor, MultiMate, which used Soundex to do matching for its suggestions for spelling correction. One of my cow-orkers typed the word 'agains' -- it was supposed to be 'against' but 'again' would also have been a sensible suggestion. MultiMate, however, suggested neither of those reasonable words, it did suggest "iguanas" amd "Utahns"... (I wonder what it does with "Talliafero" and "Tolliver", and with "Featherstone-Howe" and "Fanshaw"...) The answer to the OP, which someone just pointed out to me on comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A google search on either produces some good descriptions in the first few links, including http://www.merriampark.com/ld.htm which has not only a description of the algorithm but also source code in Java, C++ and Visual Basic (no Awk, thought there are links to pages with implementations in Perl, Python, Delphi and many more)... [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' in Schwaebisch (south German) fashion, but apparently the author of the algorithm was one Vladimir I. Levenshtein and that is how he is credited on the IEEE site. There are also a number of Google hits under the 'stein' spelling, though... Chris C And what's the Levenshtein distance between "levenshtein" and "levenstein"? Ironic if it were large ... regards Steve -- Steve Holden +1 703 861 4237 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Jul 19 '05 #19

 P: n/a Steve> (is this the same as 'Conchobar'?) No, that's a trendy pub in Key West... Skip Jul 19 '05 #20

 P: n/a Hello William, William Park wrote in message news:<24**************************@PRIMUS.CA>... How do you compare 2 strings, and determine how much they are "close" to each other? ... If your strings are between 1 and 16 Kb, look at GetReuse SDK: http://getreuse.com/sdk/ It has Perl and Python bindings. You might be interested in the latter as you posted to comp.lang.python. I can't disclosure the algorithm. Here is an excerpt from the home page: The formula for the calculation of the similarity is based on the scientific research. Any other "good" method of calculations should produce results that are equivalent in some terms to the GetReuse results. -- olpa@ http://datahansa.com/ XML publishing and documentation management http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme Jul 19 '05 #21

 P: n/a Hello, Scott David Daniels wrote in message news:<42********@nntp0.pdx.net>... William Park wrote: How do you compare 2 strings, and determine how much they are "close" to each other? Here's a really weird idea: Measure the size difference between the pair of strings compressed together and compressed separately. The idea isn't weird. The only problem is that naive approach failed. compress(a,b) != compress(b,a). Having an assumption that compressing is a good approximation for the Kolmogorov complexity, the correct formula is a bit more complicated. -- olpa@ http://datahansa.com/ XML publishing and documentation management http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme Jul 19 '05 #22

 P: n/a William Park a écrit : How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that? I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max (Crossposted to 3 of my favourite newsgroup.) Hi, If you want to use phonetic comparison, here are some algorithms that are reportedly more efficient than Soundex : Double-Metaphone NYSIIS Phonex Of course, phonetic algorithms have a lot of disadvantages, the main one being that they know about one way to pronounce words (usually a rather rigid, anglo-saxon way) which may not be the right way (hence the examples given before for Gaellic surnames). But these ones are far "better" than soundex. Regards, Nicolas Lehuen Jul 19 '05 #23

 P: n/a wrote in message news:11**********************@g43g2000cwa.googlegr oups.com... ***Check out difflib, it's in the library.*** Perfect package for what the OP wants AFAICT. The method in difflib is okay, but doesn't do that good a job. It is also relatively slow. My need for this was matching records in BitPim (eg which entries just read from a cell phone correspond to which entries just read from Outlook) Here are some links for the OP as well as ready to run code at the end. A paper on various algorithms and their effectiveness on various data sets. http://www.isi.edu/info-agents/works...rs/Cohen-p.pdf They do have a Java library of all the algorithms by the same authors. http://secondstring.sourceforge.net/ There is a group at the Australian National University with a project named Febrl - Freely extensible biomedical record linkage. The idea is to be able to work out if two records from the same or different sources are the same person. All their code is in Python: http://datamining.anu.edu.au/software/febrl/febrldoc/ The solution I settled on is using Jaro-Winkler. This is a quote from the paper above: A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro (Jaro 1995; 1989) and later extended by Winkler (Winkler 1999). This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster I did find a Python module that implemented this (it is part of the python-levenshtein package). Unfortunately there were some issues at the time dealing with some strings that may be unicode and some not. There was one implementation I saw that took copies of both strings and was replacing characters with asterisks. My data sets include meaningful asterisks so that seriously munged things. Ultimately I wrote my own implementation with the following properties: - The arguments can be any combination of unicode and ordinary strings - There are no special characters nor ones used internally as replacements - A bitset is used for tracking so you use an eigth of the memory - There is both Python and C code implementations so you don't have to compile anything but can if you want The code can be grabbed from: http://cvs.sf.net/viewcvs.py/bitpim/...ative/strings/ There is a detailed description at the begining of the Python implementation: http://cvs.sf.net/viewcvs.py/bitpim/...py?view=markup Roger Jul 19 '05 #24

 P: n/a After reading this thread, I have wrapped up a different approach, probably not what you were looking for, but it is very good for what I wanted: comparing a command string typed by a user with all possible commands a program can accept, to be able to do typo-correction. The method will return a floating point number between 0.0 (no match) and len(s)+1.0 (if the strings are exactly equal). You can see the score as "the number of equal letters". I put the module at http://starship.python.net/crew/hooft/ It is called comparestrings.py. It is a comparison algorithm as implemented by Reinhard Schneider and Chris Sander for comparison of protein sequences, but implemented to compare two ASCII strings. The comparison makes use of a similarity matrix for letters: in the protein case this is normally a chemical functionality similarity, for our case this is a matrix based on keys next to each other on a US Qwerty keyboard and on "capital letters are similar to their lowercase equivalent" The algorithm does not cut corners: it is sure to find the absolute best match between the two strings. No attempt has been made to make this efficient in time or memory. Time taken and memory used is proportional to the product of the lengths of the input strings. Use for strings longer than 25 characters is entirely for your own risk. Regards, Rob Hooft Jul 19 '05 #25

 P: n/a Rob W. W. Hooft wrote: After reading this thread, I have wrapped up a different approach, probably not what you were looking for, but it is very good for what I wanted: comparing a command string typed by a user with all possible commands a program can accept, to be able to do typo-correction. The method will return a floating point number between 0.0 (no match) and len(s)+1.0 (if the strings are exactly equal). You can see the score as "the number of equal letters". [...] Interesting, but the result is a bit puzzling. Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0? Furthermore, I can also produce wrong(?) results: \$ python comparestrings.py s1: test s2: x Score: -0.4 Minus 0.4... Is this a bug? --Irmen Jul 19 '05 #26

 P: n/a Irmen de Jong wrote: Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0? Maybe. You could do that by ignoring "negative" values, and by dividing by min(len(s1),len(s2))+1. For my application this is irrelevant, I only need a scale to compare a single word to many different words, and pick the one that stands out. I can relate the (unnormalized) score to the number of typos in the word. Furthermore, I can also produce wrong(?) results: \$ python comparestrings.py s1: test s2: x Score: -0.4 Minus 0.4... Is this a bug? Not in the procedure, it is a bug in my description. It was late at night when I wrote it. What you are doing here is aligning the two words like this: test --x- the t opens a gap (score -0.3), e elongates the gap (-0.2), s and x are adjacent on the keyboard (score 0.4) and t opens a gap (-0.3). Total score is -0.4. If you compare "test" with "l" you even get a score of -0.7. You can make arbitrarily low numbers by comparing long strings of "a" characters with a single "l". What this is telling you is that not only are the letters all different (score 0.0), but the strings are not even equally long (hence <0.0). The gap-open and gap-elongation penalties can be adjusted in the module. The -0.3 and -0.2 values were set after some experimentation to make the relative scores of different matches "feel" natural. -- Rob W.W. Hooft || ro*@hooft.net || http://www.hooft.net/people/rob/ Jul 19 '05 #27

### This discussion thread is closed

Replies have been disabled for this discussion.