By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,509 Members | 1,686 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

NGram Approximate String Matching

Rabbit
Expert Mod 10K+
P: 12,316
INTRODUCTION
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".

USES
NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.

It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.

CODE
The code below is written for VBScript but should be directly portable to VBA.

The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.

The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.

Expand|Select|Wrap|Line Numbers
  1. Function CreateNGram(strInput, intN)
  2.     Dim arrNGram, intBound, i, j, strGram, didInc, arrTemp
  3.  
  4.     If Len(strInput) = 0 Then Exit Function
  5.  
  6.     ReDim arrNGram(Len(strInput) + 1, 1)
  7.     strInput = Chr(0) & UCase(Trim(strInput)) & Chr(0)
  8.     intBound = -1
  9.  
  10.     For i = 1 To Len(strInput)-intN+1
  11.         strGram = Mid(strInput, i, intN)
  12.         didInc = False
  13.  
  14.         For j = 0 To intBound
  15.             If strGram = arrNGram(j, 0) Then
  16.                 arrNGram(j, 1) = arrNGram(j, 1) + 1
  17.                 didInc = True
  18.                 Exit For
  19.             End If
  20.         Next
  21.  
  22.         If Not didInc Then
  23.             intBound = intBound + 1
  24.             arrNGram(intBound, 0) = strGram
  25.             arrNGram(intBound, 1) = 1
  26.         End If
  27.     Next
  28.  
  29.     ReDim arrTemp(intBound, 1)
  30.     For i = 0 To intBound
  31.         arrTemp(i, 0) = arrNGram(i, 0)
  32.         arrTemp(i, 1) = arrNGram(i, 1)
  33.     Next
  34.  
  35.     CreateNGram = arrTemp
  36. End Function
  37.  
  38. Function CompareNGram(arr1, arr2)
  39.     Dim i, j, intMatches, intCount1, intCount2
  40.  
  41.     intMatches = 0
  42.     intCount1 = 0
  43.  
  44.     For i = 0 To UBound(arr1)
  45.         intCount1 = intCount1 + arr1(i, 1)
  46.         intCount2 = 0
  47.  
  48.         For j = 0 To UBound(arr2)
  49.             intCount2 = intCount2 + arr2(j, 1)
  50.  
  51.             If arr1(i, 0) = arr2(j, 0) Then
  52.                 If arr1(i, 1) >= arr2(j, 1) Then
  53.                     intMatches = intMatches + arr2(j, 1)
  54.                 Else
  55.                     intMatches = intMatches + arr1(i, 1)
  56.                 End If
  57.             End If
  58.         Next
  59.     Next
  60.  
  61.     CompareNGram = 2 * intMatches / (intCount1 + intCount2)
  62. End Function
Mar 28 '11 #1
Share this Article
Share on Google+
4 Comments


P: 5
Can u give a example queary
Jul 30 '15 #2

Rabbit
Expert Mod 10K+
P: 12,316
You mean a sample call of the function? You call it the same way you would any other function.
Expand|Select|Wrap|Line Numbers
  1. CreateNGram("sample string", 2)
Jul 30 '15 #3

P: 1
Hi Rabbit,

Thanks for sharing this information! I'm trying to use your functions (in VBA) to compare some strings, and I'm getting a Type Mismatch error in the CreateNGram function. It seems like the arrNGram array and the arrTemp array are empty at all times during the run, resulting in an error during the function call. Any ideas?
Apr 26 '18 #4

Rabbit
Expert Mod 10K+
P: 12,316
The CreateNGram function returns a 2 dimensional array. It looks like that's an issue for the CompareNGram function because it's defaulting to a 1 dimensional empty array. I modified the function prototype above to fix the CompareNGram function so it just defaults to a Variant.

If you're calling just CreateNGram by itself, then you need to treat the returned results as a 2 dimensional array.
Apr 27 '18 #5