473,417 Members | 1,553 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,417 software developers and data experts.

Finding near matches

I have a database with 100,000's records, each with a unique
reference, eg
A123BNK456
I would like to generate a search facility whereby we can choose an
exact match or partial match, where the partial match could be any of
the following:-

Partial match 1: 1 character maybe wrong
Partial match 2: 2 characters maybe wrong
Partial match 3: 3 characters maybe wrong
....etc

One idea is to do multiple searches with ? used for each permuatation,
although this mayget out of hand where the number of characters that
maybe wrong increase.
Has anyone got any suggestions
Brian.
Nov 13 '05 #1
2 3271
Do a google search for any of the following:
Edit distance - probably what your looking for.
Monophone
Double monophone
Soundex

HTH

Best wishes,
Simon.
"B Moor" <mo******@yahoo.co.uk> wrote in message
news:4d**************************@posting.google.c om...
I have a database with 100,000's records, each with a unique
reference, eg
A123BNK456
I would like to generate a search facility whereby we can choose an
exact match or partial match, where the partial match could be any of
the following:-

Partial match 1: 1 character maybe wrong
Partial match 2: 2 characters maybe wrong
Partial match 3: 3 characters maybe wrong
...etc

One idea is to do multiple searches with ? used for each permuatation,
although this mayget out of hand where the number of characters that
maybe wrong increase.
Has anyone got any suggestions
Brian.



Nov 13 '05 #2
"Simon Long" <si*******@goose.freeserve.co.uk> wrote in message news:<cd**********@newsg4.svr.pol.co.uk>...
Do a google search for any of the following:
Edit distance - probably what your looking for.
Monophone
Double monophone
Soundex

HTH

Best wishes,
Simon.
"B Moor" <mo******@yahoo.co.uk> wrote in message
news:4d**************************@posting.google.c om...
I have a database with 100,000's records, each with a unique
reference, eg
A123BNK456
I would like to generate a search facility whereby we can choose an
exact match or partial match, where the partial match could be any of
the following:-

Partial match 1: 1 character maybe wrong
Partial match 2: 2 characters maybe wrong
Partial match 3: 3 characters maybe wrong
...etc

One idea is to do multiple searches with ? used for each permuatation,
although this mayget out of hand where the number of characters that
maybe wrong increase.
Has anyone got any suggestions
Brian.


Hey Brian,

Both Metaphone and Soundex will discard numbers, wich seems to me
would be issue considering the structure of your reference ID.
Metaphone is much slower and the code it returns is phonic, meaning it
codes the sound the word makes, not just the sound each individual
character makes like Soundex does. The purpose of both Metaphone and
Soundex is to produce a code for words that can be stored in your
database to narrow the possibilities of like sounding words.

The Levenshtein Distance is what you've asked for but, it is real
sloooow. To run it 100,000 reference IDs may not be practical. I've
never tested Soundex on words that contain numbers, but I would first
try Soundex to see if what it returns as a group is in the park. Then
run Levenshtein Distance on that smaller group.

I'm including Soundex, (Metaphone seem useless for this app)
Levenshtein Distance LD, and SimiCmp which is 300% faster than LD but
returns a precentage.

Good luck, have fun
Tom

Public Function SimilCmp(Known As String, Match As String) As Single
Dim k() As Byte ' Known word array
Dim m() As Byte ' Match word array
Dim ks As Integer ' Known string length
Dim ms As Integer ' Match string length

ks = Len(Known): ms = Len(Match) ' initialize Known and
Match string lengths
If ks = 0 Or ms = 0 Then Exit Function ' if Known or Match are
empty, return 0%
If Known = Match Then ' if strings the same
SimilCmp = 100 ' return 100%
Else ' if strings are NOT the
same
k = StrConv(UCase(Known), vbFromUnicode) ' load Known array
m = StrConv(UCase(Match), vbFromUnicode) ' load Match array
SimilCmp = GcSubStr(k(), 0, ks, m(), 0, ms) ' get number of matching
characters
SimilCmp = (200 * SimilCmp / (ks + ms)) ' calculate precentage
Erase k ' free array resources
Erase m ' free array resources
End If ' if strings match

End Function

Private Function GcSubStr(k() As Byte, ko As Integer, ks As Integer,
m() As Byte, mo As Integer, ms As Integer) As Long
Dim ki As Integer ' Known string index
Dim mi As Integer ' Match string index
Dim kn As Integer ' Known next index
Dim mn As Integer ' Match next index
Dim kl As Integer ' Known next length
Dim ml As Integer ' Match next length
Dim cc As Integer ' current concurrent
character count
Dim ci As Integer ' concurrent character
index

' If no more substing to compare, return 0
If (ks <= ko Or ms <= mo) Then Exit Function ' if Known or Match
index is past end of string
If (ks = ko + 1 And ms = mo + 1) Then Exit Function ' if last
character of Known and Match strings
cc = 0: kl = ks: ml = ms: ' initialize concurrent
character count and string lengths

ki = ko ' initialize Known index
While (ki < kl) ' Known string loop
mi = mo ' initialize Match index
While (mi < ml) ' Match string loop
If (k(ki) = m(mi)) Then ' if a character the
same in both strings
ci = 1 ' initialize concurrent
character index
Do Until ((ki + ci >= kl) Or (mi + ci >= ml)) ' while things
match, keep trying...
If (k(ki + ci) <> m(mi + ci)) Then Exit Do ' if things NOT
matching, break loop
ci = ci + 1 ' bump concurrent
character count
Loop ' next character match
If ci > cc Then ' if concurrent count >
greatest count
cc = ci: kn = ki: mn = mi ' save indexes and
greatest count
kl = ks - cc: ml = ms - cc ' save indexes and
greatest count
End If ' if concurrent count >
greatest count
End If ' if a character the
same in both strings
mi = mi + 1 ' bump Match index
Wend ' next Match character
ki = ki + 1 ' bump Known index
Wend ' next Known character
If cc = 0 Then Exit Function ' return 0
cc = cc + GcSubStr(k(), kn + cc, ks, m(), mn + cc, ms) ' check right
hand side
cc = cc + GcSubStr(k(), ko, kn, m(), mo, mn) ' check left hand side
GcSubStr = cc ' return matched
character count of substrings
End Function

Public Function LD(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim ks As Integer ' length of Known word
Dim ms As Integer ' length of Match word
Dim ki As Integer ' Known word index
Dim mi As Integer ' Match word index
Dim kc As String ' Known character
Dim mc As String ' Match character
Dim cost As Integer ' cost
Dim a As Integer '
Dim b As Integer '
Dim c As Integer '

' Step 1
ks = Len(s) ' set Known word length
ms = Len(t) ' set Match word length
If ks = 0 Then ' if no Known word
LD = ms ' return length of Match
word
Exit Function ' bail out
End If ' if no Known word
If ms = 0 Then ' if no Match word
LD = ks ' return length of Known
word
Exit Function ' bail out
End If ' if no Match word
ReDim d(0 To ks, 0 To ms) As Integer ' build matrix based on
length of Known and Match words

' Step 2
For ki = 0 To ks ' Known word loop
d(ki, 0) = ki ' load Known word index into
matrix, vertically
Next ki ' next index of known word

For mi = 0 To ms ' Known Match loop
d(0, mi) = mi ' load Match word index into
matrix, horizontally
Next mi ' next index of Match word

For ki = 1 To ks ' Known word loop
kc = Mid$(s, ki, 1) ' get character of Known
word
For mi = 1 To ms ' Match word loop
mc = Mid$(t, mi, 1) ' get character of Match
word
If kc = mc Then ' if character from Known
and Match words NOT the same
cost = 0 ' set cost to 0
Else ' if character from Known
and Match words the same
cost = 1 ' set cost to 0
End If ' if character from Known
and Match words the same
a = d(ki - 1, mi) + 1
b = d(ki, mi - 1) + 1
c = d(ki - 1, mi - 1) + cost
If b < a Then a = b
If c < a Then
d(ki, mi) = c
Else
d(ki, mi) = a
End If

' d(ki, mi) = Minimum(d(ki - 1, mi) + 1, d(ki, mi - 1) + 1, d(ki -
1, mi - 1) + cost)
Next mi ' next Match word character
Next ki ' next Known word character

LD = d(ks, ms) ' return distance
Erase d ' free array

End Function

Public Function SoundS(ByVal Word As String, Optional ByVal Length As
Byte = 4) As String
' The purpose of the SoundS system is to code words that have similar
sounds.
' 1. SoundS "return code" always consists of the initial letter
followed by digits.
' 2. The first letter of a name appears as the first and only letter
in the soundS "return code".
' 3. The letters A, E, I, O, U, Y, W and H are never coded, but do
break consecutive code
' sequences (see next rule).
' 4. All other letters are coded EXCEPT when they immediately follow a
letter (including
' the first letter) that would be coded with the same code digit.
' 5. The soundS code guide is:
' Code Key Letters and Equivalents
' 1 = B, P, F, V
' 2 = C, S, K, G, J, Q, X, Z
' 3 = D, T
' 4 = L
' 5 = M, N
' 6 = R
' 6. Trailing zeros are appended to short codes so all names are coded
with a letter
' followed by the "Length parameter" number of digits.
' 7. Codes longer then the "Length parameter" are truncated after the
"Length" digit.
'
' Example: SCHAEFER. First letter is S so the code will start with S.
SC is not coded
' (rule 5), HAE is not coded (rule 3), F => 1, E is not coded (rule
3), R = 6. This
' leaves S16, and with zero padding this gives the soundex code of
S160.

Dim Prev, Curr As Integer ' previous and current
SxCode values
Dim i, v As Integer ' word index, current
character's ascii value
Dim WL, CL As Integer ' word length, code
length
Dim Code As String ' soundex return code
Dim a, z As Byte ' "A" and "Z" ascii
value
Static SXC(25) As Byte ' soundex codes array

If SXC(1) = 0 Then GoSub InitSXC ' if Soundex codes not
initilized
If Len(Word) = 0 Then Exit Function ' if no word to test,
bail out with blank return code ("")
Word = Trim$(UCase$(Word)) ' trim whitespace and
convert to upper case

Code = Left$(Word, 1) ' initialize return code
with word's first letter (Rule 1)
WL = Len(Word) ' initialize word length
a = Asc("A"): z = Asc("Z") ' initialize ascii "A"
and "Z" values

Prev = SXC(Asc(Left$(Word, 1)) - a) ' initialize previous
value (for first letter)
CL = 1 ' initialize return code
length (to include first letter)
i = 1 ' initialize word index

While (i < WL) And CL < Length ' loop until either end
of word or return code = length
v = Asc((Mid$(Word, i + 1, 1))) ' get word character's
ascii value
If (v >= a And v <= z) Then ' if letter
Curr = SXC(v - a) ' get letter's code
(coded if NOT zero)
If (Curr <> 0) Then ' if letter coded (rules
3,6)
If (Curr <> Prev) Then ' if current CODE Not
the same as presious CODE
Code = Code & CByte(Curr) ' add letter's CODE to
return code
CL = CL + 1 ' increament return code
length
End If ' if current CODE Not
the same as presious CODE
End If ' if letter coded
' note that we are breaking consecutive code sequences here
because Curr is NOT coded. (rules 3 + 4)
Prev = Curr ' save current Code as
new previous (for next letter)
Else ' if not letter
' note that we are breaking consecutive code sequences here
because v is NOT a letter. (rules 3 + 4)
Prev = 0 ' set previous code to
(not coded) zero (for next letter)
End If ' if letter
i = i + 1 ' increament word index
Wend ' loop until either end
of word or return code = length

SoundS = Left$(Code & "000000000", Length) ' return formated code

Exit Function
Nov 13 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Eric Linders | last post by:
Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the...
1
by: Tristan | last post by:
Im trying to expand a search util by uing regular expression to allow common search criteria such as +-* and phrases "". My understanding of ereg(string pattern, string string, ) is that the...
3
by: Quack Boy | last post by:
I'm new to MS Access (97) and I need for a query to find *near* matches (not exact duplicates) mainly to weed out duplicates that may contain similar erronious data. EG Table1 ...
3
by: louise raisbeck | last post by:
Hi there, I have a brain block on finding a repeater item which matches a field from the row of the datasource. If my datasource has a column called 'somethingid' i want to find the repeateritem...
2
by: duende | last post by:
I was hoping someone will be able to help me, i've got a bit of experience with mysql, but nothing quite like this. I have a table with 3 columns 'sid', 'aid', and 'call'. What i want to do is...
1
by: psmahesh | last post by:
Hi folks, I am comparing two arrays and removing matches from the second array from the first array. Can someone take a look at this code below and mention if this is okay and perhaps if there...
4
by: Hardik Dangar | last post by:
hi friends, i have a bunch of html pages and i want to fetch records from them and i m really confused how i can do after working with regular expressions and other stuffs from last few days can...
11
by: axlq | last post by:
Does anyone have a favored way of finding the geographic location of a user's IP address, so that a php script can include content relevant to that location? Reverse-lookup of the hostname isn't...
65
by: Aditya | last post by:
Hi I am using a line of as char recvedValues = {'\0'}; in my code. I get a warning as near initialization for recvedValues. I am using gcc 3.4 Can anybody please explain the meaning. Thanks...
0
by: Lloyd Sheen | last post by:
I am looking for a method to compare strings that will give me an indication of how close the strings are. Exact matches would be 100 and things link "Don't" versus "Dont" would show as close...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.