Rabbit 12,516
Recognized Expert Moderator MVP INTRODUCTION
The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters. USES
The most common use of the function is for approximate string matching. Since the function returns the minimum number of edits required to transform one string into another, the user can set the threshhold at which one string is considered a match to another. CODE
The function below accepts two strings as its input and returns an integer representing the minimum number of required edits to transform string 1 into string 2.
Because the computation time is O(n*m), you should be wary about using it on very long strings.
A common modification of this algorithm is to allow transposition of characters. Allowing transposition of adjacent characters is the Damerau-Levenshtein Distance Algorithm. - Function Levenshtein(str1 As String, str2 As String) As Integer
-
Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
-
Dim j As Integer, arrStr1, arrStr2, intMini, As Integer
-
-
intLen1 = Len(str1)
-
ReDim arrStr1(intLen1 + 1)
-
intLen2 = Len(str2)
-
ReDim arrStr2(intLen2 + 1)
-
ReDim arrLev(intLen1 + 1, intLen2 + 1)
-
-
arrLev(0, 0) = 0
-
For i = 1 To intLen1
-
arrLev(i, 0) = i
-
arrStr1(i) = Mid(str1, i, 1)
-
Next
-
-
For j = 1 To intLen2
-
arrLev(0, j) = j
-
arrStr2(j) = Mid(str2, j, 1)
-
Next
-
-
For j = 1 To intLen2
-
For i = 1 To intLen1
-
If arrStr1(i) = arrStr2(j) Then
-
arrLev(i, j) = arrLev(i-1, j-1)
-
Else
-
intMini = arrLev(i-1, j) 'deletion
-
If intMini > arrLev(i, j-1) Then intMini = arrLev(i, j-1) 'insertion
-
If intMini > arrLev(i-1, j-1) Then intMini = arrLev(i-1, j-1) 'deletion
-
-
arrLev(i, j) = intMini + 1
-
End If
-
Next
-
Next
-
-
Levenshtein = arrLev(intLen1, intLen2)
-
End Function
6 31442
Hi Rabbit,
This is interesting. Do you have any practical examples of when this function may come in handy?
I've had a number of occasions when I've need to compare strings to one another, but I can't think of a time when I might need to know how many character modifications it would take to make two strings match.
Thanks,
beacon
Rabbit 12,516
Recognized Expert Moderator MVP
It would mostly be used to find mispelled duplicates.
For example, if I had data of the following form - LASTNAME
-
smith
-
smit
-
smoth
-
snith
-
snoth
If I use the query - SELECT LASTNAME,
-
Levenshtein("smith", LASTNAME) AS Distance
-
FROM Table1
Then the results would be - LASTNAME Distance
-
smith 0
-
smit 1
-
smoth 1
-
snith 1
-
snoth 2
Then I can, say, pull everything where the distance is 1 or less and examine those more closely to see if they're duplicates. I don't have to stop at just one field though, to get a better idea of whether or not a name is a match, I could calculate the distances for first name, last name, and address and if they're all within a certain threshhold, I could make a reasonably safe assumption that it's a duplicate record.
Rabbit 12,516
Recognized Expert Moderator MVP
Technically you could use it to calculate the distance between any two binary files. You could compare two jpegs and see how different they are.
Anyways, here's the Damerau-Levenshtein algorithm. Since I code mostly in VBScript now, I didn't add in the data types like I did for the first one.
This algorithm allows for transpositions of adjacent characters. Therefore, neighborhood and nieghborhood is distance 1 in this function as opposed to distance 2 in the prior function. - Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
-
Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
-
Dim intDB, intI1, intJ1, intD
-
-
str1 = UCase(str1)
-
str2 = UCase(str2)
-
intLen1 = Len(str1)
-
intLen2 = Len(str2)
-
intTotalLen = intLen1 + intLen2
-
ReDim arrStr1(intLen1)
-
ReDim arrStr2(intLen2)
-
ReDim arrDA(intSize)
-
ReDim arrDistance(intLen1 + 2, intLen2 + 2)
-
arrDistance(0, 0) = intTotalLen
-
-
For i = 0 To intSize - 1
-
arrDA(i) = 0
-
Next
-
-
For i = 0 To intLen1
-
arrDistance(i + 1, 1) = i
-
arrDistance(i + 1, 0) = intTotalLen
-
Next
-
-
For i = 1 To intLen1
-
arrStr1(i - 1) = Asc(Mid(str1, i, 1))
-
Next
-
-
For j = 0 To intLen2
-
arrDistance(1, j + 1) = j
-
arrDistance(0, j + 1) = intTotalLen
-
Next
-
-
For j = 1 To intLen2
-
arrStr2(j - 1) = Asc(Mid(str2, j, 1))
-
Next
-
-
For i = 1 To intLen1
-
intDB = 0
-
-
For j = 1 To intLen2
-
intI1 = arrDA(arrStr2(j - 1))
-
intJ1 = intDB
-
-
If arrStr1(i - 1) = arrStr2(j - 1) Then
-
intD = 0
-
Else
-
intD = 1
-
End If
-
-
If intD = 0 Then intDB = j
-
-
intMini = arrDistance(i, j) + intD
-
If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
-
If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
-
If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
-
-
arrDistance(i + 1, j + 1) = intMini
-
Next
-
-
arrDA(arrStr1(i - 1)) = i
-
Next
-
-
DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
-
End Function
Thank you so much for sharing this, Rabbit. Really helped me out!
For anyone else planning to use this for matching lists of names, I would also recommend incorporating (a function to) strip all spaces and punctuation from the strings, further improving the accuracy of your matches. (Well technically I guess it makes them less accurate, but for my purposes it's more effective to disregard punctuation).
jforbes 1,107
Recognized Expert Top Contributor
Thanks for the code Rabbit. It is greatly appreciated and I learned a great deal from it.
We had a need where we wanted to compare two Bill of Materials (BOM) side by side to see their differences. I created a Form that had two SubForms showing the two Bill of Materials to be compared. So, as the user would click on a line on one Bill of Material, anything similar to that line would be highlighted on the other Bill of Material. This was accomplished in the OnPaint event of the SubFoms by passing the current line item and a reference line item (from the other BOM) to the Levenshtein method and comparing it to the Value of a slider. If the value is less than the value of the slider, the row is highlighted. It works quite well.
In developing this, I got pretty familiar with the Algorithm and decided that I wanted to try a slightly different approach. I noticed that the time to run the algorithm grew exponentially with the length of the strings, basically some initial time to call the function, then the time to perform a nested loop: Time=Tstart + (Tcompare * S1length * S2length)
This wasn’t a shock, and after I realized this, I notice that Rabbit had already pointed this out as O(n*m) in his original post.
Knowing this, I tried a similar approach by counting the occurrence of each character for both strings and then subtracting the count from the total length, which would be the count of changes needed to make the strings identical. This changed the amount to time it would take to perform the comparison to grow linearly in comparison to the length of the strings to compare: Time=Tstart + (Tcompare * S1length) + (Tcompare * S2length)
The two algorithms produced very similar results. I did some comparison testing on the times to run for both approaches and for me, if the string lengths were under fifteen characters the Levenshtein version was faster, but when the string lengths increased to twenty characters or longer, the getAltStringDistance version consistently took less time. Since the application I was working on had on average fifty characters per string, the getAltStringDistance version often worked faster.
One last thing to note on the getAltStringDistance version is that it is only performing counts on characters in the mMatchChars constant, currently "abcdefghijklmnopqrstuvwxyz0123456789 -()/". So, it probably wouldn’t work well for the comparison of .JPEG as this list would have to grow to include every character.
Lastly, I created a wrapper to call the most appropriate method based on string length. Best of both worlds, right? What I ended up with: -
Public Const mMatchChars = "abcdefghijklmnopqrstuvwxyz0123456789 -()/"
-
-
'getStringDistance
-
Public Function getStringDistance(ByRef s1 As String, ByRef s2 As String) As Integer
-
If Len(s1) > 17 And Len(s2) > 17 Then
-
getStringDistance = getAltStringDistance(s1, s2)
-
Else
-
getStringDistance = Levenshtein(s1, s2)
-
End If
-
End Function
-
-
'getAltStringDistance
-
Public Function getAltStringDistance(ByVal s1 As String, ByVal s2 As String) As Integer
-
-
Dim sTemp1 As String
-
Dim sTemp2 As String
-
Dim lLargest As Long
-
Dim lMatches As Long
-
Dim lRemainder As Long
-
Dim lCount1 As Long
-
Dim lCount2 As Long
-
Dim lPos As Long
-
-
sTemp1 = s1
-
sTemp2 = s2
-
If Len(s1) > Len(s2) Then
-
lLargest = Len(s1)
-
Else
-
lLargest = Len(s2)
-
End If
-
-
' Loop through possible characters
-
' Perform the count and then remove
-
' the found characters from the strings
-
' A count of the Remaining characters is
-
' the Distance between the strings, sorta
-
For lPos = 1 To Len(mMatchChars)
-
sTemp1 = Replace(s1, Mid(mMatchChars, lPos, 1), "")
-
sTemp2 = Replace(s2, Mid(mMatchChars, lPos, 1), "")
-
lCount1 = (Len(s1) - Len(sTemp1))
-
lCount2 = (Len(s2) - Len(sTemp2))
-
If lCount1 > lCount2 Then
-
lMatches = lMatches + lCount2
-
Else
-
lMatches = lMatches + lCount1
-
End If
-
s1 = sTemp1
-
s2 = sTemp2
-
Next lPos
-
-
If Len(s1) > Len(s2) Then
-
lRemainder = Len(s1)
-
Else
-
lRemainder = Len(s2)
-
End If
-
getAltStringDistance = ((lLargest - lRemainder) - lMatches)
-
-
End Function
-
Thanks again for the code Rabbit! Rabbit 12,516
Recognized Expert Moderator MVP
Hi jforbes, glad you were able to get some use out of the code. Your new algorithm is reminiscent of the ngram algorithm. The main difference is that the ngram algorithm takes into account the order of the characters. https://bytes.com/topic/access/insig...tring-matching
Another one you might be interested in is the double metaphone algorithm which calculates the "sound" of a word. https://bytes.com/topic/access/insig...hm#post3799765
Both algorithms allow you to precalculate the values to make future comparisons quicker. Something that you can't do with Levenshtein.
Sign in to post your reply or Sign up for a free account.
Similar topics |
by: Tom Warren |
last post by:
I found a c program called similcmp on the net and converted it to vba
if anybody wants it. I'll post the technical research on it if there
is any call for it. It looks like it could be a useful...
|
by: javuchi |
last post by:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".
Is there such a...
|
by: laura |
last post by:
Hi,
I'm looking for the algorithm which does the lowest number of
comparison for the string matching problem (in the worst case).
I know that the complexity is liniar in the length of the...
|
by: Kelly B |
last post by:
This is a simple string matching code from K&R2 (modified to match a
user entered string from a text file)
I tried to code an alternative strindex function(commented below) but it
does not work...
|
by: kumarboston |
last post by:
Hi All,
I have a string of around 500 characters(alphabets) and I have a 3 arrays storing the position "start, end and type" for that string. i am trying to do is to match these position to the...
| |
by: karlectomy |
last post by:
Hi all,
I am parsing text from a text file and I use Pattern and Matcher for the majority of the matching. My problem comes when I use the String.matches(String) function. Some of the lines have...
|
by: Zetten |
last post by:
I've got a script which basically goes line by line through an external file, and hands it to the screen in a slightly parsed form.
I'd like to apply another bit of parsing to what's coming in...
|
by: sajancr |
last post by:
dear sir,
i need help in perl program where i want to search a string from paragraph and to convert all the letters in para to X except that string... in this program i have used string matching...
|
by: Rabbit |
last post by:
INTRODUCTION
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el",...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...
| |