By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,551 Members | 2,727 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Levenshtein Approximate String Matching

Rabbit
Expert Mod 10K+
P: 12,366
INTRODUCTION
The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters.

USES
The most common use of the function is for approximate string matching. Since the function returns the minimum number of edits required to transform one string into another, the user can set the threshhold at which one string is considered a match to another.

CODE
The function below accepts two strings as its input and returns an integer representing the minimum number of required edits to transform string 1 into string 2.

Because the computation time is O(n*m), you should be wary about using it on very long strings.

A common modification of this algorithm is to allow transposition of characters. Allowing transposition of adjacent characters is the Damerau-Levenshtein Distance Algorithm.

Expand|Select|Wrap|Line Numbers
  1. Function Levenshtein(str1 As String, str2 As String) As Integer
  2.     Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
  3.     Dim j As Integer, arrStr1, arrStr2, intMini, As Integer
  4.  
  5.     intLen1 = Len(str1)
  6.     ReDim arrStr1(intLen1 + 1)
  7.     intLen2 = Len(str2)
  8.     ReDim arrStr2(intLen2 + 1)
  9.     ReDim arrLev(intLen1 + 1, intLen2 + 1)
  10.  
  11.     arrLev(0, 0) = 0
  12.     For i = 1 To intLen1
  13.         arrLev(i, 0) = i
  14.         arrStr1(i) = Mid(str1, i, 1)
  15.     Next
  16.  
  17.     For j = 1 To intLen2
  18.         arrLev(0, j) = j
  19.         arrStr2(j) = Mid(str2, j, 1)
  20.     Next
  21.  
  22.     For j = 1 To intLen2
  23.         For i = 1 To intLen1
  24.             If arrStr1(i) = arrStr2(j) Then
  25.                 arrLev(i, j) = arrLev(i-1, j-1)
  26.             Else
  27.                 intMini = arrLev(i-1, j) 'deletion
  28.                 If intMini > arrLev(i, j-1) Then intMini = arrLev(i, j-1) 'insertion
  29.                 If intMini > arrLev(i-1, j-1) Then intMini = arrLev(i-1, j-1) 'deletion
  30.  
  31.                 arrLev(i, j) = intMini + 1
  32.             End If
  33.         Next
  34.     Next
  35.  
  36.     Levenshtein = arrLev(intLen1, intLen2)
  37. End Function
Feb 25 '11 #1
Share this Article
Share on Google+
6 Comments


beacon
100+
P: 579
Hi Rabbit,

This is interesting. Do you have any practical examples of when this function may come in handy?

I've had a number of occasions when I've need to compare strings to one another, but I can't think of a time when I might need to know how many character modifications it would take to make two strings match.

Thanks,
beacon
Feb 25 '11 #2

Rabbit
Expert Mod 10K+
P: 12,366
It would mostly be used to find mispelled duplicates.

For example, if I had data of the following form
Expand|Select|Wrap|Line Numbers
  1. LASTNAME
  2. smith
  3. smit
  4. smoth
  5. snith
  6. snoth
If I use the query
Expand|Select|Wrap|Line Numbers
  1. SELECT LASTNAME,
  2.    Levenshtein("smith", LASTNAME) AS Distance
  3. FROM Table1
Then the results would be
Expand|Select|Wrap|Line Numbers
  1. LASTNAME Distance
  2. smith    0
  3. smit     1
  4. smoth    1
  5. snith    1
  6. snoth    2
Then I can, say, pull everything where the distance is 1 or less and examine those more closely to see if they're duplicates. I don't have to stop at just one field though, to get a better idea of whether or not a name is a match, I could calculate the distances for first name, last name, and address and if they're all within a certain threshhold, I could make a reasonably safe assumption that it's a duplicate record.
Feb 25 '11 #3

Rabbit
Expert Mod 10K+
P: 12,366
Technically you could use it to calculate the distance between any two binary files. You could compare two jpegs and see how different they are.

Anyways, here's the Damerau-Levenshtein algorithm. Since I code mostly in VBScript now, I didn't add in the data types like I did for the first one.

This algorithm allows for transpositions of adjacent characters. Therefore, neighborhood and nieghborhood is distance 1 in this function as opposed to distance 2 in the prior function.
Expand|Select|Wrap|Line Numbers
  1. Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
  2.      Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
  3.      Dim intDB, intI1, intJ1, intD
  4.  
  5.      str1 = UCase(str1)
  6.      str2 = UCase(str2)
  7.      intLen1 = Len(str1)
  8.      intLen2 = Len(str2)
  9.      intTotalLen = intLen1 + intLen2
  10.      ReDim arrStr1(intLen1)
  11.      ReDim arrStr2(intLen2)
  12.      ReDim arrDA(intSize)
  13.      ReDim arrDistance(intLen1 + 2, intLen2 + 2)
  14.      arrDistance(0, 0) = intTotalLen
  15.  
  16.      For i = 0 To intSize - 1
  17.          arrDA(i) = 0
  18.      Next
  19.  
  20.      For i = 0 To intLen1
  21.          arrDistance(i + 1, 1) = i
  22.          arrDistance(i + 1, 0) = intTotalLen
  23.      Next
  24.  
  25.      For i = 1 To intLen1
  26.          arrStr1(i - 1) = Asc(Mid(str1, i, 1))
  27.      Next
  28.  
  29.      For j = 0 To intLen2
  30.          arrDistance(1, j + 1) = j
  31.          arrDistance(0, j + 1) = intTotalLen
  32.      Next
  33.  
  34.      For j = 1 To intLen2
  35.          arrStr2(j - 1) = Asc(Mid(str2, j, 1))
  36.      Next
  37.  
  38.      For i = 1 To intLen1
  39.          intDB = 0
  40.  
  41.          For j = 1 To intLen2
  42.              intI1 = arrDA(arrStr2(j - 1))
  43.              intJ1 = intDB
  44.  
  45.              If arrStr1(i - 1) = arrStr2(j - 1) Then
  46.                  intD = 0
  47.              Else
  48.                  intD = 1
  49.              End If
  50.  
  51.              If intD = 0 Then intDB = j
  52.  
  53.              intMini = arrDistance(i, j) + intD
  54.              If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
  55.              If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
  56.              If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
  57.  
  58.              arrDistance(i + 1, j + 1) = intMini
  59.          Next
  60.  
  61.          arrDA(arrStr1(i - 1)) = i
  62.      Next
  63.  
  64.      DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
  65.  End Function
Feb 25 '11 #4

P: 30
Thank you so much for sharing this, Rabbit. Really helped me out!

For anyone else planning to use this for matching lists of names, I would also recommend incorporating (a function to) strip all spaces and punctuation from the strings, further improving the accuracy of your matches. (Well technically I guess it makes them less accurate, but for my purposes it's more effective to disregard punctuation).
Jan 6 '16 #5

jforbes
Expert 100+
P: 1,107
Thanks for the code Rabbit. It is greatly appreciated and I learned a great deal from it.

We had a need where we wanted to compare two Bill of Materials (BOM) side by side to see their differences. I created a Form that had two SubForms showing the two Bill of Materials to be compared. So, as the user would click on a line on one Bill of Material, anything similar to that line would be highlighted on the other Bill of Material. This was accomplished in the OnPaint event of the SubFoms by passing the current line item and a reference line item (from the other BOM) to the Levenshtein method and comparing it to the Value of a slider. If the value is less than the value of the slider, the row is highlighted. It works quite well.

In developing this, I got pretty familiar with the Algorithm and decided that I wanted to try a slightly different approach. I noticed that the time to run the algorithm grew exponentially with the length of the strings, basically some initial time to call the function, then the time to perform a nested loop:
Time=Tstart + (Tcompare * S1length * S2length)

This wasn’t a shock, and after I realized this, I notice that Rabbit had already pointed this out as O(n*m) in his original post.

Knowing this, I tried a similar approach by counting the occurrence of each character for both strings and then subtracting the count from the total length, which would be the count of changes needed to make the strings identical. This changed the amount to time it would take to perform the comparison to grow linearly in comparison to the length of the strings to compare:
Time=Tstart + (Tcompare * S1length) + (Tcompare * S2length)

The two algorithms produced very similar results. I did some comparison testing on the times to run for both approaches and for me, if the string lengths were under fifteen characters the Levenshtein version was faster, but when the string lengths increased to twenty characters or longer, the getAltStringDistance version consistently took less time. Since the application I was working on had on average fifty characters per string, the getAltStringDistance version often worked faster.

One last thing to note on the getAltStringDistance version is that it is only performing counts on characters in the mMatchChars constant, currently "abcdefghijklmnopqrstuvwxyz0123456789 -()/". So, it probably wouldn’t work well for the comparison of .JPEG as this list would have to grow to include every character.

Lastly, I created a wrapper to call the most appropriate method based on string length. Best of both worlds, right? What I ended up with:
Expand|Select|Wrap|Line Numbers
  1. Public Const mMatchChars = "abcdefghijklmnopqrstuvwxyz0123456789 -()/"
  2.  
  3.     'getStringDistance
  4. Public Function getStringDistance(ByRef s1 As String, ByRef s2 As String) As Integer
  5.     If Len(s1) > 17 And Len(s2) > 17 Then
  6.         getStringDistance = getAltStringDistance(s1, s2)
  7.     Else
  8.         getStringDistance = Levenshtein(s1, s2)
  9.     End If
  10. End Function
  11.  
  12.     'getAltStringDistance
  13. Public Function getAltStringDistance(ByVal s1 As String, ByVal s2 As String) As Integer
  14.  
  15.     Dim sTemp1 As String
  16.     Dim sTemp2 As String
  17.     Dim lLargest As Long
  18.     Dim lMatches As Long
  19.     Dim lRemainder As Long
  20.     Dim lCount1 As Long
  21.     Dim lCount2 As Long
  22.     Dim lPos As Long
  23.  
  24.     sTemp1 = s1
  25.     sTemp2 = s2
  26.     If Len(s1) > Len(s2) Then
  27.         lLargest = Len(s1)
  28.     Else
  29.         lLargest = Len(s2)
  30.     End If
  31.  
  32.     ' Loop through possible characters
  33.     ' Perform the count and then remove
  34.     ' the found characters from the strings
  35.     ' A count of the Remaining characters is
  36.     ' the Distance between the strings, sorta
  37.     For lPos = 1 To Len(mMatchChars)
  38.         sTemp1 = Replace(s1, Mid(mMatchChars, lPos, 1), "")
  39.         sTemp2 = Replace(s2, Mid(mMatchChars, lPos, 1), "")
  40.         lCount1 = (Len(s1) - Len(sTemp1))
  41.         lCount2 = (Len(s2) - Len(sTemp2))
  42.         If lCount1 > lCount2 Then
  43.             lMatches = lMatches + lCount2
  44.         Else
  45.             lMatches = lMatches + lCount1
  46.         End If
  47.         s1 = sTemp1
  48.         s2 = sTemp2
  49.     Next lPos
  50.  
  51.     If Len(s1) > Len(s2) Then
  52.         lRemainder = Len(s1)
  53.     Else
  54.         lRemainder = Len(s2)
  55.     End If
  56.     getAltStringDistance = ((lLargest - lRemainder) - lMatches)
  57.  
  58. End Function
  59.  
Thanks again for the code Rabbit!
Jun 27 '16 #6

Rabbit
Expert Mod 10K+
P: 12,366
Hi jforbes, glad you were able to get some use out of the code. Your new algorithm is reminiscent of the ngram algorithm. The main difference is that the ngram algorithm takes into account the order of the characters. https://bytes.com/topic/access/insig...tring-matching

Another one you might be interested in is the double metaphone algorithm which calculates the "sound" of a word.
https://bytes.com/topic/access/insig...hm#post3799765

Both algorithms allow you to precalculate the values to make future comparisons quicker. Something that you can't do with Levenshtein.
Jun 28 '16 #7