473,405 Members | 2,261 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,405 developers and data experts.

NGram Approximate String Matching

Rabbit
12,516 Expert Mod 8TB
INTRODUCTION
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".

USES
NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.

It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.

CODE
The code below is written for VBScript but should be directly portable to VBA.

The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.

The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.

Expand|Select|Wrap|Line Numbers
  1. Function CreateNGram(strInput, intN)
  2.     Dim arrNGram, intBound, i, j, strGram, didInc, arrTemp
  3.  
  4.     If Len(strInput) = 0 Then Exit Function
  5.  
  6.     ReDim arrNGram(Len(strInput) + 1, 1)
  7.     strInput = Chr(0) & UCase(Trim(strInput)) & Chr(0)
  8.     intBound = -1
  9.  
  10.     For i = 1 To Len(strInput)-intN+1
  11.         strGram = Mid(strInput, i, intN)
  12.         didInc = False
  13.  
  14.         For j = 0 To intBound
  15.             If strGram = arrNGram(j, 0) Then
  16.                 arrNGram(j, 1) = arrNGram(j, 1) + 1
  17.                 didInc = True
  18.                 Exit For
  19.             End If
  20.         Next
  21.  
  22.         If Not didInc Then
  23.             intBound = intBound + 1
  24.             arrNGram(intBound, 0) = strGram
  25.             arrNGram(intBound, 1) = 1
  26.         End If
  27.     Next
  28.  
  29.     ReDim arrTemp(intBound, 1)
  30.     For i = 0 To intBound
  31.         arrTemp(i, 0) = arrNGram(i, 0)
  32.         arrTemp(i, 1) = arrNGram(i, 1)
  33.     Next
  34.  
  35.     CreateNGram = arrTemp
  36. End Function
  37.  
  38. Function CompareNGram(arr1, arr2)
  39.     Dim i, j, intMatches, intCount1, intCount2
  40.  
  41.     intMatches = 0
  42.     intCount1 = 0
  43.  
  44.     For i = 0 To UBound(arr1)
  45.         intCount1 = intCount1 + arr1(i, 1)
  46.         intCount2 = 0
  47.  
  48.         For j = 0 To UBound(arr2)
  49.             intCount2 = intCount2 + arr2(j, 1)
  50.  
  51.             If arr1(i, 0) = arr2(j, 0) Then
  52.                 If arr1(i, 1) >= arr2(j, 1) Then
  53.                     intMatches = intMatches + arr2(j, 1)
  54.                 Else
  55.                     intMatches = intMatches + arr1(i, 1)
  56.                 End If
  57.             End If
  58.         Next
  59.     Next
  60.  
  61.     CompareNGram = 2 * intMatches / (intCount1 + intCount2)
  62. End Function
Mar 28 '11 #1
4 13532
GORKAL
5
Can u give a example queary
Jul 30 '15 #2
Rabbit
12,516 Expert Mod 8TB
You mean a sample call of the function? You call it the same way you would any other function.
Expand|Select|Wrap|Line Numbers
  1. CreateNGram("sample string", 2)
Jul 30 '15 #3
Hi Rabbit,

Thanks for sharing this information! I'm trying to use your functions (in VBA) to compare some strings, and I'm getting a Type Mismatch error in the CreateNGram function. It seems like the arrNGram array and the arrTemp array are empty at all times during the run, resulting in an error during the function call. Any ideas?
Apr 26 '18 #4
Rabbit
12,516 Expert Mod 8TB
The CreateNGram function returns a 2 dimensional array. It looks like that's an issue for the CompareNGram function because it's defaulting to a 1 dimensional empty array. I modified the function prototype above to fix the CompareNGram function so it just defaults to a Variant.

If you're calling just CreateNGram by itself, then you need to treat the returned results as a 2 dimensional array.
Apr 27 '18 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Stuart E. Wugalter | last post by:
Hiya Folks! I need some help with some string pattern matching. Let's say field1 (in Table1) contains the strings in question and it's data type is memo. I have a form that uses Table1 as its...
0
by: Tom Warren | last post by:
I found a c program called similcmp on the net and converted it to vba if anybody wants it. I'll post the technical research on it if there is any call for it. It looks like it could be a useful...
10
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a...
2
by: laura | last post by:
Hi, I'm looking for the algorithm which does the lowest number of comparison for the string matching problem (in the worst case). I know that the complexity is liniar in the length of the...
4
by: Kelly B | last post by:
This is a simple string matching code from K&R2 (modified to match a user entered string from a text file) I tried to code an alternative strindex function(commented below) but it does not work...
8
by: kumarboston | last post by:
Hi All, I have a string of around 500 characters(alphabets) and I have a 3 arrays storing the position "start, end and type" for that string. i am trying to do is to match these position to the...
4
karlectomy
by: karlectomy | last post by:
Hi all, I am parsing text from a text file and I use Pattern and Matcher for the majority of the matching. My problem comes when I use the String.matches(String) function. Some of the lines have...
2
by: Zetten | last post by:
I've got a script which basically goes line by line through an external file, and hands it to the screen in a slightly parsed form. I'd like to apply another bit of parsing to what's coming in...
2
by: sajancr | last post by:
dear sir, i need help in perl program where i want to search a string from paragraph and to convert all the letters in para to X except that string... in this program i have used string matching...
1
jkmyoung
by: jkmyoung | last post by:
Hey, I'm trying to find an algorithm to approximately match different substrings within a master list, with some forgiveness on spelling. Say you are allowed 3 errors in the string match:...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.