473,499 Members | 1,886 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Levenshtein Approximate String Matching

Rabbit
12,516 Recognized Expert Moderator MVP
INTRODUCTION
The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters.

USES
The most common use of the function is for approximate string matching. Since the function returns the minimum number of edits required to transform one string into another, the user can set the threshhold at which one string is considered a match to another.

CODE
The function below accepts two strings as its input and returns an integer representing the minimum number of required edits to transform string 1 into string 2.

Because the computation time is O(n*m), you should be wary about using it on very long strings.

A common modification of this algorithm is to allow transposition of characters. Allowing transposition of adjacent characters is the Damerau-Levenshtein Distance Algorithm.

Expand|Select|Wrap|Line Numbers
  1. Function Levenshtein(str1 As String, str2 As String) As Integer
  2.     Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
  3.     Dim j As Integer, arrStr1, arrStr2, intMini, As Integer
  4.  
  5.     intLen1 = Len(str1)
  6.     ReDim arrStr1(intLen1 + 1)
  7.     intLen2 = Len(str2)
  8.     ReDim arrStr2(intLen2 + 1)
  9.     ReDim arrLev(intLen1 + 1, intLen2 + 1)
  10.  
  11.     arrLev(0, 0) = 0
  12.     For i = 1 To intLen1
  13.         arrLev(i, 0) = i
  14.         arrStr1(i) = Mid(str1, i, 1)
  15.     Next
  16.  
  17.     For j = 1 To intLen2
  18.         arrLev(0, j) = j
  19.         arrStr2(j) = Mid(str2, j, 1)
  20.     Next
  21.  
  22.     For j = 1 To intLen2
  23.         For i = 1 To intLen1
  24.             If arrStr1(i) = arrStr2(j) Then
  25.                 arrLev(i, j) = arrLev(i-1, j-1)
  26.             Else
  27.                 intMini = arrLev(i-1, j) 'deletion
  28.                 If intMini > arrLev(i, j-1) Then intMini = arrLev(i, j-1) 'insertion
  29.                 If intMini > arrLev(i-1, j-1) Then intMini = arrLev(i-1, j-1) 'deletion
  30.  
  31.                 arrLev(i, j) = intMini + 1
  32.             End If
  33.         Next
  34.     Next
  35.  
  36.     Levenshtein = arrLev(intLen1, intLen2)
  37. End Function
Feb 25 '11 #1
6 31442
beacon
579 Contributor
Hi Rabbit,

This is interesting. Do you have any practical examples of when this function may come in handy?

I've had a number of occasions when I've need to compare strings to one another, but I can't think of a time when I might need to know how many character modifications it would take to make two strings match.

Thanks,
beacon
Feb 25 '11 #2
Rabbit
12,516 Recognized Expert Moderator MVP
It would mostly be used to find mispelled duplicates.

For example, if I had data of the following form
Expand|Select|Wrap|Line Numbers
  1. LASTNAME
  2. smith
  3. smit
  4. smoth
  5. snith
  6. snoth
If I use the query
Expand|Select|Wrap|Line Numbers
  1. SELECT LASTNAME,
  2.    Levenshtein("smith", LASTNAME) AS Distance
  3. FROM Table1
Then the results would be
Expand|Select|Wrap|Line Numbers
  1. LASTNAME Distance
  2. smith    0
  3. smit     1
  4. smoth    1
  5. snith    1
  6. snoth    2
Then I can, say, pull everything where the distance is 1 or less and examine those more closely to see if they're duplicates. I don't have to stop at just one field though, to get a better idea of whether or not a name is a match, I could calculate the distances for first name, last name, and address and if they're all within a certain threshhold, I could make a reasonably safe assumption that it's a duplicate record.
Feb 25 '11 #3
Rabbit
12,516 Recognized Expert Moderator MVP
Technically you could use it to calculate the distance between any two binary files. You could compare two jpegs and see how different they are.

Anyways, here's the Damerau-Levenshtein algorithm. Since I code mostly in VBScript now, I didn't add in the data types like I did for the first one.

This algorithm allows for transpositions of adjacent characters. Therefore, neighborhood and nieghborhood is distance 1 in this function as opposed to distance 2 in the prior function.
Expand|Select|Wrap|Line Numbers
  1. Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
  2.      Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
  3.      Dim intDB, intI1, intJ1, intD
  4.  
  5.      str1 = UCase(str1)
  6.      str2 = UCase(str2)
  7.      intLen1 = Len(str1)
  8.      intLen2 = Len(str2)
  9.      intTotalLen = intLen1 + intLen2
  10.      ReDim arrStr1(intLen1)
  11.      ReDim arrStr2(intLen2)
  12.      ReDim arrDA(intSize)
  13.      ReDim arrDistance(intLen1 + 2, intLen2 + 2)
  14.      arrDistance(0, 0) = intTotalLen
  15.  
  16.      For i = 0 To intSize - 1
  17.          arrDA(i) = 0
  18.      Next
  19.  
  20.      For i = 0 To intLen1
  21.          arrDistance(i + 1, 1) = i
  22.          arrDistance(i + 1, 0) = intTotalLen
  23.      Next
  24.  
  25.      For i = 1 To intLen1
  26.          arrStr1(i - 1) = Asc(Mid(str1, i, 1))
  27.      Next
  28.  
  29.      For j = 0 To intLen2
  30.          arrDistance(1, j + 1) = j
  31.          arrDistance(0, j + 1) = intTotalLen
  32.      Next
  33.  
  34.      For j = 1 To intLen2
  35.          arrStr2(j - 1) = Asc(Mid(str2, j, 1))
  36.      Next
  37.  
  38.      For i = 1 To intLen1
  39.          intDB = 0
  40.  
  41.          For j = 1 To intLen2
  42.              intI1 = arrDA(arrStr2(j - 1))
  43.              intJ1 = intDB
  44.  
  45.              If arrStr1(i - 1) = arrStr2(j - 1) Then
  46.                  intD = 0
  47.              Else
  48.                  intD = 1
  49.              End If
  50.  
  51.              If intD = 0 Then intDB = j
  52.  
  53.              intMini = arrDistance(i, j) + intD
  54.              If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
  55.              If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
  56.              If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
  57.  
  58.              arrDistance(i + 1, j + 1) = intMini
  59.          Next
  60.  
  61.          arrDA(arrStr1(i - 1)) = i
  62.      Next
  63.  
  64.      DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
  65.  End Function
Feb 25 '11 #4
webbeacon
30 New Member
Thank you so much for sharing this, Rabbit. Really helped me out!

For anyone else planning to use this for matching lists of names, I would also recommend incorporating (a function to) strip all spaces and punctuation from the strings, further improving the accuracy of your matches. (Well technically I guess it makes them less accurate, but for my purposes it's more effective to disregard punctuation).
Jan 6 '16 #5
jforbes
1,107 Recognized Expert Top Contributor
Thanks for the code Rabbit. It is greatly appreciated and I learned a great deal from it.

We had a need where we wanted to compare two Bill of Materials (BOM) side by side to see their differences. I created a Form that had two SubForms showing the two Bill of Materials to be compared. So, as the user would click on a line on one Bill of Material, anything similar to that line would be highlighted on the other Bill of Material. This was accomplished in the OnPaint event of the SubFoms by passing the current line item and a reference line item (from the other BOM) to the Levenshtein method and comparing it to the Value of a slider. If the value is less than the value of the slider, the row is highlighted. It works quite well.

In developing this, I got pretty familiar with the Algorithm and decided that I wanted to try a slightly different approach. I noticed that the time to run the algorithm grew exponentially with the length of the strings, basically some initial time to call the function, then the time to perform a nested loop:
Time=Tstart + (Tcompare * S1length * S2length)

This wasn’t a shock, and after I realized this, I notice that Rabbit had already pointed this out as O(n*m) in his original post.

Knowing this, I tried a similar approach by counting the occurrence of each character for both strings and then subtracting the count from the total length, which would be the count of changes needed to make the strings identical. This changed the amount to time it would take to perform the comparison to grow linearly in comparison to the length of the strings to compare:
Time=Tstart + (Tcompare * S1length) + (Tcompare * S2length)

The two algorithms produced very similar results. I did some comparison testing on the times to run for both approaches and for me, if the string lengths were under fifteen characters the Levenshtein version was faster, but when the string lengths increased to twenty characters or longer, the getAltStringDistance version consistently took less time. Since the application I was working on had on average fifty characters per string, the getAltStringDistance version often worked faster.

One last thing to note on the getAltStringDistance version is that it is only performing counts on characters in the mMatchChars constant, currently "abcdefghijklmnopqrstuvwxyz0123456789 -()/". So, it probably wouldn’t work well for the comparison of .JPEG as this list would have to grow to include every character.

Lastly, I created a wrapper to call the most appropriate method based on string length. Best of both worlds, right? What I ended up with:
Expand|Select|Wrap|Line Numbers
  1. Public Const mMatchChars = "abcdefghijklmnopqrstuvwxyz0123456789 -()/"
  2.  
  3.     'getStringDistance
  4. Public Function getStringDistance(ByRef s1 As String, ByRef s2 As String) As Integer
  5.     If Len(s1) > 17 And Len(s2) > 17 Then
  6.         getStringDistance = getAltStringDistance(s1, s2)
  7.     Else
  8.         getStringDistance = Levenshtein(s1, s2)
  9.     End If
  10. End Function
  11.  
  12.     'getAltStringDistance
  13. Public Function getAltStringDistance(ByVal s1 As String, ByVal s2 As String) As Integer
  14.  
  15.     Dim sTemp1 As String
  16.     Dim sTemp2 As String
  17.     Dim lLargest As Long
  18.     Dim lMatches As Long
  19.     Dim lRemainder As Long
  20.     Dim lCount1 As Long
  21.     Dim lCount2 As Long
  22.     Dim lPos As Long
  23.  
  24.     sTemp1 = s1
  25.     sTemp2 = s2
  26.     If Len(s1) > Len(s2) Then
  27.         lLargest = Len(s1)
  28.     Else
  29.         lLargest = Len(s2)
  30.     End If
  31.  
  32.     ' Loop through possible characters
  33.     ' Perform the count and then remove
  34.     ' the found characters from the strings
  35.     ' A count of the Remaining characters is
  36.     ' the Distance between the strings, sorta
  37.     For lPos = 1 To Len(mMatchChars)
  38.         sTemp1 = Replace(s1, Mid(mMatchChars, lPos, 1), "")
  39.         sTemp2 = Replace(s2, Mid(mMatchChars, lPos, 1), "")
  40.         lCount1 = (Len(s1) - Len(sTemp1))
  41.         lCount2 = (Len(s2) - Len(sTemp2))
  42.         If lCount1 > lCount2 Then
  43.             lMatches = lMatches + lCount2
  44.         Else
  45.             lMatches = lMatches + lCount1
  46.         End If
  47.         s1 = sTemp1
  48.         s2 = sTemp2
  49.     Next lPos
  50.  
  51.     If Len(s1) > Len(s2) Then
  52.         lRemainder = Len(s1)
  53.     Else
  54.         lRemainder = Len(s2)
  55.     End If
  56.     getAltStringDistance = ((lLargest - lRemainder) - lMatches)
  57.  
  58. End Function
  59.  
Thanks again for the code Rabbit!
Jun 27 '16 #6
Rabbit
12,516 Recognized Expert Moderator MVP
Hi jforbes, glad you were able to get some use out of the code. Your new algorithm is reminiscent of the ngram algorithm. The main difference is that the ngram algorithm takes into account the order of the characters. https://bytes.com/topic/access/insig...tring-matching

Another one you might be interested in is the double metaphone algorithm which calculates the "sound" of a word.
https://bytes.com/topic/access/insig...hm#post3799765

Both algorithms allow you to precalculate the values to make future comparisons quicker. Something that you can't do with Levenshtein.
Jun 28 '16 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

0
2720
by: Tom Warren | last post by:
I found a c program called similcmp on the net and converted it to vba if anybody wants it. I'll post the technical research on it if there is any call for it. It looks like it could be a useful...
10
9520
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a...
2
2705
by: laura | last post by:
Hi, I'm looking for the algorithm which does the lowest number of comparison for the string matching problem (in the worst case). I know that the complexity is liniar in the length of the...
4
3948
by: Kelly B | last post by:
This is a simple string matching code from K&R2 (modified to match a user entered string from a text file) I tried to code an alternative strindex function(commented below) but it does not work...
8
2819
by: kumarboston | last post by:
Hi All, I have a string of around 500 characters(alphabets) and I have a 3 arrays storing the position "start, end and type" for that string. i am trying to do is to match these position to the...
4
3341
karlectomy
by: karlectomy | last post by:
Hi all, I am parsing text from a text file and I use Pattern and Matcher for the majority of the matching. My problem comes when I use the String.matches(String) function. Some of the lines have...
2
2032
by: Zetten | last post by:
I've got a script which basically goes line by line through an external file, and hands it to the screen in a slightly parsed form. I'd like to apply another bit of parsing to what's coming in...
2
2458
by: sajancr | last post by:
dear sir, i need help in perl program where i want to search a string from paragraph and to convert all the letters in para to X except that string... in this program i have used string matching...
4
13547
Rabbit
by: Rabbit | last post by:
INTRODUCTION An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el",...
0
7007
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7220
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7388
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5470
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
3099
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3091
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1427
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
665
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
297
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.