435,551 Members | 2,727 Online
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

# Levenshtein Approximate String Matching

 Expert Mod 10K+ P: 12,366 INTRODUCTION The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters. USES The most common use of the function is for approximate string matching. Since the function returns the minimum number of edits required to transform one string into another, the user can set the threshhold at which one string is considered a match to another. CODE The function below accepts two strings as its input and returns an integer representing the minimum number of required edits to transform string 1 into string 2. Because the computation time is O(n*m), you should be wary about using it on very long strings. A common modification of this algorithm is to allow transposition of characters. Allowing transposition of adjacent characters is the Damerau-Levenshtein Distance Algorithm. Expand|Select|Wrap|Line Numbers Function Levenshtein(str1 As String, str2 As String) As Integer     Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer     Dim j As Integer, arrStr1, arrStr2, intMini, As Integer       intLen1 = Len(str1)     ReDim arrStr1(intLen1 + 1)     intLen2 = Len(str2)     ReDim arrStr2(intLen2 + 1)     ReDim arrLev(intLen1 + 1, intLen2 + 1)       arrLev(0, 0) = 0     For i = 1 To intLen1         arrLev(i, 0) = i         arrStr1(i) = Mid(str1, i, 1)     Next       For j = 1 To intLen2         arrLev(0, j) = j         arrStr2(j) = Mid(str2, j, 1)     Next       For j = 1 To intLen2         For i = 1 To intLen1             If arrStr1(i) = arrStr2(j) Then                 arrLev(i, j) = arrLev(i-1, j-1)             Else                 intMini = arrLev(i-1, j) 'deletion                 If intMini > arrLev(i, j-1) Then intMini = arrLev(i, j-1) 'insertion                 If intMini > arrLev(i-1, j-1) Then intMini = arrLev(i-1, j-1) 'deletion                   arrLev(i, j) = intMini + 1             End If         Next     Next       Levenshtein = arrLev(intLen1, intLen2) End Function Feb 25 '11 #1

 100+ P: 579 Hi Rabbit, This is interesting. Do you have any practical examples of when this function may come in handy? I've had a number of occasions when I've need to compare strings to one another, but I can't think of a time when I might need to know how many character modifications it would take to make two strings match. Thanks, beacon Feb 25 '11 #2

 Expert Mod 10K+ P: 12,366 It would mostly be used to find mispelled duplicates. For example, if I had data of the following form Expand|Select|Wrap|Line Numbers LASTNAME smith smit smoth snith snoth If I use the query Expand|Select|Wrap|Line Numbers SELECT LASTNAME,    Levenshtein("smith", LASTNAME) AS Distance FROM Table1 Then the results would be Expand|Select|Wrap|Line Numbers LASTNAME Distance smith    0 smit     1 smoth    1 snith    1 snoth    2 Then I can, say, pull everything where the distance is 1 or less and examine those more closely to see if they're duplicates. I don't have to stop at just one field though, to get a better idea of whether or not a name is a match, I could calculate the distances for first name, last name, and address and if they're all within a certain threshhold, I could make a reasonably safe assumption that it's a duplicate record. Feb 25 '11 #3

 Expert Mod 10K+ P: 12,366 Technically you could use it to calculate the distance between any two binary files. You could compare two jpegs and see how different they are. Anyways, here's the Damerau-Levenshtein algorithm. Since I code mostly in VBScript now, I didn't add in the data types like I did for the first one. This algorithm allows for transpositions of adjacent characters. Therefore, neighborhood and nieghborhood is distance 1 in this function as opposed to distance 2 in the prior function. Expand|Select|Wrap|Line Numbers Function DamerauLevenshtein(str1, str2, Optional intSize = 256)      Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini      Dim intDB, intI1, intJ1, intD        str1 = UCase(str1)      str2 = UCase(str2)      intLen1 = Len(str1)      intLen2 = Len(str2)      intTotalLen = intLen1 + intLen2      ReDim arrStr1(intLen1)      ReDim arrStr2(intLen2)      ReDim arrDA(intSize)      ReDim arrDistance(intLen1 + 2, intLen2 + 2)      arrDistance(0, 0) = intTotalLen        For i = 0 To intSize - 1          arrDA(i) = 0      Next        For i = 0 To intLen1          arrDistance(i + 1, 1) = i          arrDistance(i + 1, 0) = intTotalLen      Next        For i = 1 To intLen1          arrStr1(i - 1) = Asc(Mid(str1, i, 1))      Next        For j = 0 To intLen2          arrDistance(1, j + 1) = j          arrDistance(0, j + 1) = intTotalLen      Next        For j = 1 To intLen2          arrStr2(j - 1) = Asc(Mid(str2, j, 1))      Next        For i = 1 To intLen1          intDB = 0            For j = 1 To intLen2              intI1 = arrDA(arrStr2(j - 1))              intJ1 = intDB                If arrStr1(i - 1) = arrStr2(j - 1) Then                  intD = 0              Else                  intD = 1              End If                If intD = 0 Then intDB = j                intMini = arrDistance(i, j) + intD              If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1              If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1              If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1                arrDistance(i + 1, j + 1) = intMini          Next            arrDA(arrStr1(i - 1)) = i      Next        DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)  End Function Feb 25 '11 #4

 P: 30 Thank you so much for sharing this, Rabbit. Really helped me out! For anyone else planning to use this for matching lists of names, I would also recommend incorporating (a function to) strip all spaces and punctuation from the strings, further improving the accuracy of your matches. (Well technically I guess it makes them less accurate, but for my purposes it's more effective to disregard punctuation). Jan 6 '16 #5

 Expert 100+ P: 1,107 Thanks for the code Rabbit. It is greatly appreciated and I learned a great deal from it. We had a need where we wanted to compare two Bill of Materials (BOM) side by side to see their differences. I created a Form that had two SubForms showing the two Bill of Materials to be compared. So, as the user would click on a line on one Bill of Material, anything similar to that line would be highlighted on the other Bill of Material. This was accomplished in the OnPaint event of the SubFoms by passing the current line item and a reference line item (from the other BOM) to the Levenshtein method and comparing it to the Value of a slider. If the value is less than the value of the slider, the row is highlighted. It works quite well. In developing this, I got pretty familiar with the Algorithm and decided that I wanted to try a slightly different approach. I noticed that the time to run the algorithm grew exponentially with the length of the strings, basically some initial time to call the function, then the time to perform a nested loop: ` Time=Tstart + (Tcompare * S1length * S2length)` This wasn’t a shock, and after I realized this, I notice that Rabbit had already pointed this out as `O(n*m)` in his original post. Knowing this, I tried a similar approach by counting the occurrence of each character for both strings and then subtracting the count from the total length, which would be the count of changes needed to make the strings identical. This changed the amount to time it would take to perform the comparison to grow linearly in comparison to the length of the strings to compare: ` Time=Tstart + (Tcompare * S1length) + (Tcompare * S2length)` The two algorithms produced very similar results. I did some comparison testing on the times to run for both approaches and for me, if the string lengths were under fifteen characters the Levenshtein version was faster, but when the string lengths increased to twenty characters or longer, the getAltStringDistance version consistently took less time. Since the application I was working on had on average fifty characters per string, the getAltStringDistance version often worked faster. One last thing to note on the getAltStringDistance version is that it is only performing counts on characters in the mMatchChars constant, currently "abcdefghijklmnopqrstuvwxyz0123456789 -()/". So, it probably wouldn’t work well for the comparison of .JPEG as this list would have to grow to include every character. Lastly, I created a wrapper to call the most appropriate method based on string length. Best of both worlds, right? What I ended up with: Expand|Select|Wrap|Line Numbers Public Const mMatchChars = "abcdefghijklmnopqrstuvwxyz0123456789 -()/"       'getStringDistance Public Function getStringDistance(ByRef s1 As String, ByRef s2 As String) As Integer     If Len(s1) > 17 And Len(s2) > 17 Then         getStringDistance = getAltStringDistance(s1, s2)     Else         getStringDistance = Levenshtein(s1, s2)     End If End Function       'getAltStringDistance Public Function getAltStringDistance(ByVal s1 As String, ByVal s2 As String) As Integer       Dim sTemp1 As String     Dim sTemp2 As String     Dim lLargest As Long     Dim lMatches As Long     Dim lRemainder As Long     Dim lCount1 As Long     Dim lCount2 As Long     Dim lPos As Long       sTemp1 = s1     sTemp2 = s2     If Len(s1) > Len(s2) Then         lLargest = Len(s1)     Else         lLargest = Len(s2)     End If       ' Loop through possible characters     ' Perform the count and then remove     ' the found characters from the strings     ' A count of the Remaining characters is     ' the Distance between the strings, sorta     For lPos = 1 To Len(mMatchChars)         sTemp1 = Replace(s1, Mid(mMatchChars, lPos, 1), "")         sTemp2 = Replace(s2, Mid(mMatchChars, lPos, 1), "")         lCount1 = (Len(s1) - Len(sTemp1))         lCount2 = (Len(s2) - Len(sTemp2))         If lCount1 > lCount2 Then             lMatches = lMatches + lCount2         Else             lMatches = lMatches + lCount1         End If         s1 = sTemp1         s2 = sTemp2     Next lPos       If Len(s1) > Len(s2) Then         lRemainder = Len(s1)     Else         lRemainder = Len(s2)     End If     getAltStringDistance = ((lLargest - lRemainder) - lMatches)   End Function   Thanks again for the code Rabbit! Jun 27 '16 #6

 Expert Mod 10K+ P: 12,366 Hi jforbes, glad you were able to get some use out of the code. Your new algorithm is reminiscent of the ngram algorithm. The main difference is that the ngram algorithm takes into account the order of the characters. https://bytes.com/topic/access/insig...tring-matching Another one you might be interested in is the double metaphone algorithm which calculates the "sound" of a word. https://bytes.com/topic/access/insig...hm#post3799765 Both algorithms allow you to precalculate the values to make future comparisons quicker. Something that you can't do with Levenshtein. Jun 28 '16 #7