473,326 Members | 2,182 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,326 developers and data experts.

NGram Approximate String Matching

Rabbit
12,516 Expert Mod 8TB
INTRODUCTION
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".

USES
NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.

It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.

CODE
The code below is written for VBScript but should be directly portable to VBA.

The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.

The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.

Expand|Select|Wrap|Line Numbers
  1. Function CreateNGram(strInput, intN)
  2.     Dim arrNGram, intBound, i, j, strGram, didInc, arrTemp
  3.  
  4.     If Len(strInput) = 0 Then Exit Function
  5.  
  6.     ReDim arrNGram(Len(strInput) + 1, 1)
  7.     strInput = Chr(0) & UCase(Trim(strInput)) & Chr(0)
  8.     intBound = -1
  9.  
  10.     For i = 1 To Len(strInput)-intN+1
  11.         strGram = Mid(strInput, i, intN)
  12.         didInc = False
  13.  
  14.         For j = 0 To intBound
  15.             If strGram = arrNGram(j, 0) Then
  16.                 arrNGram(j, 1) = arrNGram(j, 1) + 1
  17.                 didInc = True
  18.                 Exit For
  19.             End If
  20.         Next
  21.  
  22.         If Not didInc Then
  23.             intBound = intBound + 1
  24.             arrNGram(intBound, 0) = strGram
  25.             arrNGram(intBound, 1) = 1
  26.         End If
  27.     Next
  28.  
  29.     ReDim arrTemp(intBound, 1)
  30.     For i = 0 To intBound
  31.         arrTemp(i, 0) = arrNGram(i, 0)
  32.         arrTemp(i, 1) = arrNGram(i, 1)
  33.     Next
  34.  
  35.     CreateNGram = arrTemp
  36. End Function
  37.  
  38. Function CompareNGram(arr1, arr2)
  39.     Dim i, j, intMatches, intCount1, intCount2
  40.  
  41.     intMatches = 0
  42.     intCount1 = 0
  43.  
  44.     For i = 0 To UBound(arr1)
  45.         intCount1 = intCount1 + arr1(i, 1)
  46.         intCount2 = 0
  47.  
  48.         For j = 0 To UBound(arr2)
  49.             intCount2 = intCount2 + arr2(j, 1)
  50.  
  51.             If arr1(i, 0) = arr2(j, 0) Then
  52.                 If arr1(i, 1) >= arr2(j, 1) Then
  53.                     intMatches = intMatches + arr2(j, 1)
  54.                 Else
  55.                     intMatches = intMatches + arr1(i, 1)
  56.                 End If
  57.             End If
  58.         Next
  59.     Next
  60.  
  61.     CompareNGram = 2 * intMatches / (intCount1 + intCount2)
  62. End Function
Mar 28 '11 #1
4 13500
GORKAL
5
Can u give a example queary
Jul 30 '15 #2
Rabbit
12,516 Expert Mod 8TB
You mean a sample call of the function? You call it the same way you would any other function.
Expand|Select|Wrap|Line Numbers
  1. CreateNGram("sample string", 2)
Jul 30 '15 #3
Hi Rabbit,

Thanks for sharing this information! I'm trying to use your functions (in VBA) to compare some strings, and I'm getting a Type Mismatch error in the CreateNGram function. It seems like the arrNGram array and the arrTemp array are empty at all times during the run, resulting in an error during the function call. Any ideas?
Apr 26 '18 #4
Rabbit
12,516 Expert Mod 8TB
The CreateNGram function returns a 2 dimensional array. It looks like that's an issue for the CompareNGram function because it's defaulting to a 1 dimensional empty array. I modified the function prototype above to fix the CompareNGram function so it just defaults to a Variant.

If you're calling just CreateNGram by itself, then you need to treat the returned results as a 2 dimensional array.
Apr 27 '18 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Stuart E. Wugalter | last post by:
Hiya Folks! I need some help with some string pattern matching. Let's say field1 (in Table1) contains the strings in question and it's data type is memo. I have a form that uses Table1 as its...
0
by: Tom Warren | last post by:
I found a c program called similcmp on the net and converted it to vba if anybody wants it. I'll post the technical research on it if there is any call for it. It looks like it could be a useful...
10
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a...
2
by: laura | last post by:
Hi, I'm looking for the algorithm which does the lowest number of comparison for the string matching problem (in the worst case). I know that the complexity is liniar in the length of the...
4
by: Kelly B | last post by:
This is a simple string matching code from K&R2 (modified to match a user entered string from a text file) I tried to code an alternative strindex function(commented below) but it does not work...
8
by: kumarboston | last post by:
Hi All, I have a string of around 500 characters(alphabets) and I have a 3 arrays storing the position "start, end and type" for that string. i am trying to do is to match these position to the...
4
karlectomy
by: karlectomy | last post by:
Hi all, I am parsing text from a text file and I use Pattern and Matcher for the majority of the matching. My problem comes when I use the String.matches(String) function. Some of the lines have...
2
by: Zetten | last post by:
I've got a script which basically goes line by line through an external file, and hands it to the screen in a slightly parsed form. I'd like to apply another bit of parsing to what's coming in...
2
by: sajancr | last post by:
dear sir, i need help in perl program where i want to search a string from paragraph and to convert all the letters in para to X except that string... in this program i have used string matching...
1
jkmyoung
by: jkmyoung | last post by:
Hey, I'm trying to find an algorithm to approximately match different substrings within a master list, with some forgiveness on spelling. Say you are allowed 3 errors in the string match:...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.