By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
 429,301 Members | 3,575 Online Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

# Fuzzy String Matching: Double Metaphone Algorithm

 Expert Mod 10K+ P: 12,359 Introduction One of the big issues with name matching is how prone it is to error. There are many different ways people spell the same name, typos, mishearing what the other person said. There are a myriad of ways that free form language data can be corrupted. And it causes many headaches when you need to search/match on the bad data. There are many different approaches to get around it. Like the Levenshtein algorithm which calculates how many edits it would take to make one string match another string. Or the NGram algorithm that exams the smaller sequences that a string is composed of and compares them to the sequences of a nother string. Then there are phonetic algorithms that encode a string based on how it would "sound". Like the SoundEx or Double Metaphone algorithm. What is Double Metaphone? Double Metaphone is a phonetic algorithm that takes a string and produces 2 encodings on how it could be pronounced in spoken language. Two encodings are produced because a word can sometimes be pronounced multiple ways. It is used for English and contains many rules on the "sounds" of a string. It also tries to account for differences in the English pronunciation of Romanized words in Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins. How Does it Work? The algorithm works by looking at hundreds of phonetic rules to create a phonetic string. Some examples of the rules it uses are: P and B are encoded to the same sound. Unless the P is followed by an H which is then encoded to the F sound. SCI is encoded to the S sound while SCO is encoded to the SK sound. T and D are encoded to the same sound. Unless the T is followed by an H which is then encoded to the theta sound which is represented by the number 0. The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the LEFT(output, threshold) is the main pronunciation and the RIGHT(output, threshold) is a potentially different pronunciation. Hence the spaces when the the encoding produces a pronunciation with less characters than the threshold. The algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing. This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, "foxtrot" and "fixed" have the same encoding. So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish. Sample Implementation in VBScript This implementation takes a string s input along with a length i input and outputs 2 possible different phonetic encodings of the string, each of length I. For example, calling it with `DoubleMetaphone("Smith", 4)` would return `SM0 XMT `, spaces included. You can use this to precalculate the pronunciation of names or words in your tables and compare the 2 different phonetic codes against all the other phonetic codes in the table to find names or words that sound similar. Expand|Select|Wrap|Line Numbers Option Explicit   MsgBox(DoubleMetaphone(InputBox("Enter String"), 6))   Function DoubleMetaphone(strOriginal, intThreshhold)     Dim isSlavoGermanic, strPrimary, strSecondary, i, intJump, iB     Dim intLength, cP, cS, arr, x, intPad       isSlavoGermanic = False     iB = 4     intPad = 6     x = iB     intLength = Len(strOriginal) + iB - 1     strOriginal = UCase(strOriginal)       If (InStr(strOriginal, "W") + InStr(strOriginal, "K") + InStr(strOriginal, "CZ") + InStr(strOriginal, "WITZ")) <> 0 Then         isSlavoGermanic = True     End If       ReDim arr(intLength + intPad + 1)       For i = 0 To iB-1         arr(i) = vbTab     Next       For i = iB To intLength         arr(i) = Mid(strOriginal, i-iB+1, 1)     Next       For i = intLength+1 To UBound(arr)         arr(i) = vbTab     Next       Select Case (arr(x) & arr(x+1))         Case "AC"             strPrimary = "AKS"             strSecondary = "AKS"             x = x + 4         Case "GN", "KN", "PN", "PS"             x = x + 1         Case "HA", "HE", "HI", "HO", "HU", "HY"             strPrimary = "H"             strSecondary = "H"             x = x + 2         Case "WA", "WE", "WI", "WO", "WU", "WY"             strPrimary = "A"             strSecondary = "F"             x = x + 2         Case "WH"             strPrimary = "A"             strSecondary = "A"             x = x + 1         Case "SM", "SN", "SL", "SW"             strPrimary = "S"             strSecondary = "X"             x = x + 1         Case "GY"             strPrimary = "K"             strSecondary = "J"             x = x + 2     End Select       If x = iB Then         If arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "JOSE" Then             If (x = iB And arr(x+4) = " ") Then                 strPrimary = "HS"                 strSecondary = "HS"                 x = x + 4             End If         ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "SUGAR" Then             strPrimary = "XK"             strSecondary = "SK"             x = x + 5         ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CAESAR" Then             strPrimary = "SSR"             strSecondary = "SSR"             x = x + 6         ElseIf (arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARAC" Or _         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARIS" Or _         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHOR" Or _         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHYM" Or _         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHEM") And _         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) <> "CHORE" Then             strPrimary = "K"             strSecondary = "K"             x = x + 2         End If     End If       If x = iB Then         Select Case arr(x) & arr(x+1) & arr(x+2)             Case "GES", "GEP", "GEB", "GEL", "GEY", "GIB", "GIL", "GIN", "GIE", "GEI", "GER"                 strPrimary = "K"                 strSecondary = "J"                 x = x + 2             Case "GHI"                 strPrimary = "J"                 strSecondary = "J"                 x = x + 3             Case "AGN", "EGN", "IGN", "OGN", "UGN", "UGY"                 If Not isSlavoGermanic Then                     strPrimary = "AKN"                     strSecondary = "AN"                     x = x + 3                 End If         End Select     End If       If x = iB Then         Select Case arr(x)             Case "X"                 strPrimary = "S"                 strSecondary = "S"                 x = x + 1             Case "A", "E", "I", "O", "U", "Y"                 strPrimary = "A"                 strSecondary = "A"                 x = x + 1             Case "J"                 strPrimary = "J"                 strSecondary = "A"                 x = x + 1         End Select     End If       Do While x <= intLength         If Len(strPrimary) >= intThreshhold Then             Exit Do         End If           intJump = 1         cP = arr(x)         cS = arr(x)           Select Case arr(x)             Case "A", "E", "I", "O", "U", "Y"                 cP = ""                 cS = ""               Case "B"                 cP = "P"                 cS = "P"               Case "Ç"                 cP = "S"                 cS = "S"               Case "C"                 If x > iB+1 And arr(x-2) <> "A" And arr(x-2) <> "E" And arr(x-2) <> "I" And arr(x-2) <> "O" And arr(x-2) <> "U" And _                 arr(x-2) <> "Y" And arr(x-1) & arr(x+1) = "AH" And ((arr(x+2) <> "I" And arr(x+2) <> "E") Or _                 arr(x-2) & arr(x+2) & arr(x+3) = "BER" Or arr(x-2) & arr(x+2) & arr(x+3) = "MER") Then                     cP = "K"                     cS = "K"                     intJump = 2                 ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "HIA" Then                     cP = "K"                     cS = "K"                     intJump = 4                 ElseIf arr(x+1) = "H" Then                     If x > iB And arr(x+2) & arr(x+3) = "AE" Then                         cP = "K"                         cS = "X"                         intJump = 2                     ElseIf arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _                     arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _                     arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or arr(x+2) = "T" Or arr(x+2) = "S" Or _                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHES" Or _                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ARHIT" Or _                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHID" Or _                      ((arr(x-2) = "A" Or arr(x-2) = "E" Or arr(x-2) = "O" Or arr(x-2) = "U" Or x = iB) And _                     (arr(x+2) = "L" Or arr(x+2) = "R" Or arr(x+2) = "N" Or arr(x+2) = "M" Or arr(x+2) = "B" Or _                     arr(x+2) = "H" Or arr(x+2) = "F" Or arr(x+2) = "V" Or arr(x+2) = "W" Or arr(x+2) = " "))Then                         cP = "K"                         cS = "K"                         intJump = 2                     Else                         intJump = 2                           If x > iB Then                             If arr(iB) & arr(iB+1) = "MC" Then                                 cP = "K"                                 cS = "K"                             Else                                 cP = "X"                                 cS = "K"                             End If                         Else                             cP = "X"                             cS = "X"                         End If                     End If                 ElseIf arr(x+1) = "Z" And arr(x-2) & arr(x-1) <> "WI" Then                     cP = "S"                     cS = "X"                     intJump = 2                 ElseIf arr(x+1) & arr(x+2) & arr(x+2) = "CIA" Then                     cP = "X"                     cS = "X"                     intJump = 3                 ElseIf arr(x+1) = "C" And Not (x = iB+1 And arr(iB) = "M") Then                     If (arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "H") And arr(x+2) & arr(x+3) <> "HU" Then                         If arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCEE" Or _                         arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCES" Then                             cP = "KS"                             cS = "KS"                             intJump = 3                         Else                             cP = "X"                             cS = "X"                             intJump = 3                         End If                     Else                         cP = "K"                         cS = "K"                         intJump = 2                     End If                 ElseIf arr(x+1) = "K" Or arr(x+1) = "G" Or arr(x+1) = "Q" Then                     cP = "K"                     cS = "K"                     intJump = 2                 ElseIf arr(x+1) = "I" Or arr(x+1) = "E" Or arr(x+1) = "Y" Then                     If arr(x+1) & arr(x+2) = "IO" Or arr(x+1) & arr(x+2) = "IE" Or arr(x+1) & arr(x+2) = "IA" Then                         cP = "S"                         cS = "X"                         intJump = 2                     Else                         cP = "S"                         cS = "S"                         intJump = 2                     End If                 Else                     cP = "K"                     cS = "K"                       If arr(x+1) & arr(x+2) = " C" Or arr(x+1) & arr(x+2) = " Q" Or arr(x+1) & arr(x+2) = " G" Then                         intJump = 3                     Else                         If (arr(x+1) = "C" Or arr(x+1) = "K" Or arr(x+1) = "Q") And _                         arr(x+1) & arr(x+2) <> "CE" And arr(x+1) & arr(x+2) <> "CI" Then                              intJump = 2                         End If                     End If                 End If               Case "D"                 If arr(x+1) = "G" Then                     If arr(x+2) = "I" Or _                     arr(x+2) = "E" Or _                     arr(x+2) = "Y" Then                         cP = "J"                         cS = "J"                         intJump = 3                     Else                         cP = "TK"                         cS = "TK"                         intJump = 2                     End If                 ElseIf arr(x+1) = "T" Then                     cP = "T"                     cS = "T"                     intJump = 2                 Else                     cP = "T"                     cS = "T"                 End If               Case "G"                 If arr(x+1) = "H" Then                     If x <> iB And arr(x-1) <> "A" And arr(x-1) <> "E" And arr(x-1) <> "I" _                     And arr(x-1) <> "O" And arr(x-1) <> "U" And arr(x-1) <> "Y" Then                         cP = "K"                         cS = "K"                         intJump = 2                     ElseIf (x > iB+1 And (arr(x-2) = "B" Or arr(x-2) = "H" Or arr(x-2) = "D")) Or _                     (x > iB+2 And (arr(x-3) = "B" Or arr(x-3) = "H" Or arr(x-3) = "D")) Or _                     (x > iB+3 And (arr(x-4) = "B" Or arr(x-4) = "H")) Then                         cP = ""                         cS = ""                         intJump = 2                     Else                         If x > iB+2 And arr(x-1) = "U" And _                         (arr(x-3) = "C" Or arr(x-3) = "G" Or arr(x-3) = "L" Or arr(x-3) = "R" Or arr(x-3) = "T") Then                             cP = "F"                             cS = "F"                             intJump = 2                         ElseIf x > iB And arr(x-1) <> "I" Then                             cP = "K"                             cS = "K"                             intJump = 2                         Else                             cP = ""                             cS = ""                         End If                     End If                 ElseIf arr(x+1) = "N" Then                     cS = "KN"                     intJump = 2                       If arr(x+2) & arr(x+3) <> "EY" And Not isSlavoGermanic Then                         cP = "N"                     Else                         cP = "KN"                     End If                 ElseIf arr(x+1) & arr(x+2) = "LI" And Not isSlavoGermanic Then                     cP = "KL"                     cS = "L"                     intJump = 2                 ElseIf (arr(x+1) & arr(x+2) = "ER" Or arr(x+1) = "Y") And _                 arr(x-1) <> "E" And arr(x-1) <> "I" And _                 arr(x-1) & arr(x+1) <> "RY" And _                 arr(x-1) & arr(x+1) <> "OY" And _                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "DANGER" And _                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "RANGER" And _                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "MANGER" Then                     cP = "K"                     cS = "J"                     intJump = 2                 ElseIf arr(x+1) = "E" Or arr(x+1) = "I" Or arr(x+1) = "Y" Or _                 arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "AGGI" Or _                 arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "OGGI" Then                     If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _                     arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _                     arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or _                     arr(x+1) & arr(x+2) = "ET" Then                         cP = "K"                         cS = "K"                         intJump = 2                     Else                         cP = "J"                         If arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "IER " Then                             cS = "J"                             intJump = 3                         Else                             cS = "K"                             intJump = 2                         End If                     End If                 Else                     cP = "K"                     cS = "K"                 End If               Case "H"                 If (arr(x-1) = "A" Or _                 arr(x-1) = "E" Or _                 arr(x-1) = "I" Or _                 arr(x-1) = "O" Or _                 arr(x-1) = "U" Or _                 arr(x-1) = "Y") And _                 (arr(x+1) = "A" Or _                 arr(x+1) = "E" Or _                 arr(x+1) = "I" Or _                 arr(x+1) = "O" Or _                 arr(x+1) = "U" Or _                 arr(x+1) = "Y") Then                     intJump = 2                 Else                     cP = ""                     cS = ""                 End If               Case "J"                 If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "SAN " Then                     cP = "H"                     cS = "H"                 Else                     If Not isSlavoGermanic And ( _                     arr(x-1) = "A" Or _                     arr(x-1) = "E" Or _                     arr(x-1) = "I" Or _                     arr(x-1) = "O" Or _                     arr(x-1) = "U" Or _                     arr(x-1) = "Y") And ( _                     arr(x+1) = "A" Or _                     arr(x+1) = "O") Then                         cS = "H"                     Else                         If x = intLength Then                             cS = ""                         Else                             If arr(x+1) = "L" Or arr(x+1) = "T" Or arr(x+1) = "K" Or _                             arr(x+1) = "S" Or arr(x+1) = "N" Or arr(x+1) = "M" Or _                             arr(x+1) = "B" Or arr(x+1) = "Z" Or _                             arr(x-1) = "S" Or arr(x-1) = "K" Or arr(x-1) = "L" Then                                 cP = ""                                 cS = ""                             End If                         End If                     End If                 End If               Case "L"                 If arr(x+1) = "L" Then                     intJump = 2                       If ((x = intLength-2 And ( _                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLO" Or _                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLA" Or _                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE" _                     )) Or (( _                     arr(intLength-1) & arr(intLength) = "AS" Or _                     arr(intLength-1) & arr(intLength) = "OS" Or _                     arr(intLength) = "A" Or arr(intLength) = "O") And _                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE")) Then                         cS = ""                     End If                 End If               Case "M"                 If arr(x-1) & arr(x) & arr(x+1) = "UMB" And _                 (x = intLength-1 Or arr(x+2) & arr(x+3) = "ER") Then                     intJump = 2                 End If               Case "P"                 Select Case arr(x+1)                     Case "H"                         cP = "F"                         cS = "F"                         intJump = 2                     Case "B"                         intJump = 2                 End Select               Case "Q"                 cP = "K"                 cS = "K"               Case "R"                 If x = intLength And Not isSlavoGermanic And _                 arr(x-2) & arr(x-1) = "IE" And _                 arr(x-4) & arr(x-3) <> "ME" And _                 arr(x-4) & arr(x-3) <> "MA" Then                     cP = ""                 End If               Case "S"                 If arr(x+1) = "L" And (arr(x-1) = "I" Or arr(x-1) = "Y") Then                     cP = ""                     cS = ""                 ElseIf arr(x+1) = "H" And _                 arr(x+2) & arr(x+3) & arr(x+4) <> "EIM" And _                 arr(x+2) & arr(x+3) & arr(x+4) <> "OEK" And _                 arr(x+2) & arr(x+3) & arr(x+4) <> "OLM" And _                 arr(x+2) & arr(x+3) & arr(x+4) <> "OLZ" Then                     intJump = 2                     cP = "X"                     cS = "X"                 ElseIf Not isSlavoGermanic And ( _                 arr(x+1) & arr(x+2) = "IA" Or _                 arr(x+1) & arr(x+2) = "IO") Then                     intJump = 3                     cS = "X"                 ElseIf arr(x+1) = "Z" Then                     cS = "X"                     intJump = 2                 ElseIf arr(x+1) = "C" Then                     intJump = 3                       If arr(x+2) = "H" Then                         If arr(x+3) & arr(x+4) = "OO" Or _                         arr(x+3) & arr(x+4) = "ER" Or _                         arr(x+3) & arr(x+4) = "EN" Or _                         arr(x+3) & arr(x+4) = "UY" Or _                         arr(x+3) & arr(x+4) = "ED" Or _                         arr(x+3) & arr(x+4) = "EM" Then                             cS = "SK"                               If arr(x+3) & arr(x+4) = "ER" Or _                             arr(x+3) & arr(x+4) = "EN" Then                                 cP = "X"                             Else                                 cP = "SK"                             End If                         Else                             cP = "X"                               If x <> iB Or arr(iB+3) = "W" Or arr(iB+3) = "A" Or _                             arr(iB+3) = "E" Or arr(iB+3) = "I" Or arr(iB+3) = "O" Or _                             arr(iB+3) = "U" Or arr(iB+3) = "Y" Then                                 cS = "X"                             End If                         End If                     ElseIf arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "Y" Then                     Else                         cP = "SK"                         cS = "SK"                     End If                 ElseIf x = intLength And arr(x-1) = "I" And ( _                 arr(x-2) = "A" Or arr(x-2) = "O") Then                     cP = ""                 End If               Case "T"                 If arr(x+1) & arr(x+2) & arr(x+3) = "ION" _                 Or arr(x+1) & arr(x+2) = "IA" _                 Or arr(x+1) & arr(x+2) = "CH" Then                     cP = "X"                     cS = "X"                     intJump = 3                 ElseIf (arr(x+1) = "H" Or arr(x+1) & arr(x+2) = "TH") And _                 (arr(x+2) & arr(x+3) <> "OM" And _                 arr(x+2) & arr(x+3) <> "AM" And _                 arr(iB) & arr(iB+1) & arr(iB+2) <> "SCH" And _                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VAN " And _                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VON ") Then                     cP = "0"                     intJump = 2                 ElseIf arr(x+1) = "D" Then                     intJump = 2                 End If               Case "V"                 cP = "F"                 cS = "F"               Case "W"                 If arr(x+1) = "R" Then                     cP = "R"                     cS = "R"                     intJump = 2                 ElseIf arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" _                 Or (x = intLength And ( _                 arr(x-1) = "A" Or _                 arr(x-1) = "E" Or _                 arr(x-1) = "I" Or _                 arr(x-1) = "O" Or _                 arr(x-1) = "U" Or _                 arr(x-1) = "Y")) _                 Or ((arr(x-1) = "E" Or arr(x-1) = "O") And _                 (arr(x+1) & arr(x+2) & arr(x+3) = "SKI" Or _                 arr(x+1) & arr(x+2) & arr(x+3) = "SKY")) Then                     cP = ""                     cS = "F"                 ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "ICZ" _                 Or arr(x+1) & arr(x+2) & arr(x+3) = "ITZ" Then                     cP = "TS"                     cS = "FX"                     intJump = 4                 Else                     cP = ""                     cS = ""                 End If               Case "X"                 If x = intLength And _                 (arr(x-3) & arr(x-2) & arr(x-1) = "IAU" Or _                  arr(x-3) & arr(x-2) & arr(x-1) = "EAU" Or _                 arr(x-2) & arr(x-1) = "AU" Or _                 arr(x-2) & arr(x-1) = "OU") Then                     cP = ""                     cS = ""                 Else                     cP = "KS"                     cS = "KS"                 End If                   If arr(x+1) = "C" Then                     intJump = 2                 End If               Case "Z"                 If arr(x+1) = "H" Then                     cP = "J"                     cS = "J"                 ElseIf (arr(x+1) & arr(x+2) = "ZO" Or _                 arr(x+1) & arr(x+2) = "ZI" Or _                 arr(x+1) & arr(x+2) = "ZA") _                 Or (isSlavoGermanic And x <> iB And arr(x-1) = "T") Then                     cP = "S"                     cS = "TS"                 Else                     cP = "S"                     cS = "S"                 End If         End Select           strPrimary = strPrimary & cP         strSecondary = strSecondary & cS           If arr(x) = arr(x+1) And arr(x) <> "C" Then             intJump = intJump + 1         End If         x = x + intJump     Loop       For i = 1 To intThreshhold         strPrimary = strPrimary & " "         strSecondary = strSecondary & " "     Next       DoubleMetaphone = Left(strPrimary, intThreshhold) & Left(strSecondary, intThreshhold) End Function Dec 21 '15 #1
Share this Article
10 Comments

 Expert 100+ P: 1,107 This looks like a lot of fun. I can't wait to have time to experiment with it. Thanks. Dec 22 '15 #2

 Expert Mod 5K+ P: 5,397 J, Just played with it some in Rabbit's roughdraft, Code ports almost directly into Access-VBA with only a need to wrap line 3 into a sub-procedure. Dec 22 '15 #3

 Expert Mod 10K+ P: 12,359 Thanks Z. And thanks for testing it against another implementation online. That helps ease my mind on whether or not it was implemented properly. Dec 22 '15 #4

 Expert Mod 5K+ P: 5,397 Rabbit, Still playing with this, I had initially just hand typed twenty or so random words in to an on-line implementation. Now I'm playing in a database and I noticed that there is a space between the two groups, one of the online versions ran the two groups together, one did not.. and I closed the window and I have my history set to flush: so for "agencies" you get "AJNSS AKNXS" So in my people template: Expand|Select|Wrap|Line Numbers SELECT people_FirstName    , DoubleMetaphone([people_firstName],6)      AS DMP FROM tbl_people; yields Note the name Foxtrot, here there is no space.... Expand|Select|Wrap|Line Numbers people_FirstName       DMP: Alpha                  ALF   ALF    Beta                   PT    PT     Charlie                XRL   XRL    Delta                  TLT   TLT    Echo                   AX    AK     `Foxtrot                FKSTRTFKSTRT` Golf                   KLF   KLF    Hotel                  HTL   HTL    India                  ANT   ANT    Juliet                 JLT   ALT    Kilo                   KL    KL     Lima                   LM    LM     Mike                   MK    MK     November               NFMPR NFMPR  October                AKTPR AKTPR  Papa                   PP    PP     Quebec                 KPK   KPK    Romeo                  RM    RM     Sierra                 SR    SR     Tango                  TNK   TNK    Uniform                ANFRM ANFRM  Victor                 FKTR  FKTR   Whisky                 ASK   ASK    Xray                   SR    SR     Yankee                 ANK   ANK    Zulu                   SL    SL I've not stepped thru the code yet... Dec 22 '15 #5

 Expert Mod 10K+ P: 12,359 The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the `LEFT(output, threshold)` is the main pronunciation and the `RIGHT(output, threshold)` is a potentially different pronunciation. Hence the spaces when there the encoding produces a pronunciation with less characters than the threshold. Threshold of course is a misnomer. It's more of a limit or max length. Not sure why I used threshold at the time. As for Foxtrot, the algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. Such as the space before ` Foxtrot`. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing. Dec 22 '15 #6

 Expert Mod 5K+ P: 5,397 + Thank you for explaining that, not being familiar with this algorithm. Between the two different implementations and some of the articles I was skimming also smashing the two parts together in to one string made it unclear for me as to how the proper output should be. + The apparent space in "Foxtrot" is an artifact of the formatting in the posting, sorry, didn't intend to mislead you... :( SO I should be seeing "FKSTRT FKSTRT" for "foxtrot"? Directly in the immedate pane: Expand|Select|Wrap|Line Numbers ?DoubleMetaphone("Foxtrot",6) FKSTRTFKSTRT   ?DoubleMetaphone("November",6) NFMPR NFMPR   ?DoubleMetaphone("October",6) AKTPR AKTPR Using foxtrot, STOP at line 586: Expand|Select|Wrap|Line Numbers ?"'" & strPrimary & "'" & len(strPrimary) 'FKSTRT      ' 12 ?"'" & strSecondary & "'" & len(strPrimary) 'FKSTRT      ' 12 ?"'" & Left(strPrimary, intThreshhold) & "'" 'FKSTRT' ?"'" & Left(strSecondary, intThreshhold) & "'" 'FKSTRT' Should we have a check for length of strPrimary and strSecondary against the intThreshhold and add one or does the fact that the threshold matches the length of the encoding correctly merge the two? Dec 22 '15 #7

 Expert Mod 10K+ P: 12,359 Lines 587-590 will pad out the different pronunciations to the threshold if necessary. Line 592 will truncate the different pronunciations if it exceeds the threshold. When the encoding matches or exceeds the threshold, the 2 pronunciations run together without any spaces in between. Dec 22 '15 #8

 Expert Mod 5K+ P: 5,397 So, if understand correctly, this is by design? One of the implementations(http://swoodbridge.com/DoubleMetaPhone/mptest.php3) appears to be using 4 as the threshold and the result returns with "foxtrot yields 'FKST' and 'FKST'" so I am still confused about the correct output. Dec 22 '15 #9

 Expert Mod 10K+ P: 12,359 This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, foxtrot and fixed have the same encoding. So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish. Dec 22 '15 #10

 Expert Mod 10K+ P: 12,359 I've added some of the clarification that came up as a result of our discussion into the main article. Dec 22 '15 #11

### Post your comment

Sign in to post your comment or Sign up for a free account. 