473,549 Members | 2,708 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fuzzy String Matching: Double Metaphone Algorithm

Rabbit
12,516 Recognized Expert Moderator MVP
Introduction
One of the big issues with name matching is how prone it is to error. There are many different ways people spell the same name, typos, mishearing what the other person said. There are a myriad of ways that free form language data can be corrupted. And it causes many headaches when you need to search/match on the bad data.

There are many different approaches to get around it. Like the Levenshtein algorithm which calculates how many edits it would take to make one string match another string. Or the NGram algorithm that exams the smaller sequences that a string is composed of and compares them to the sequences of a nother string. Then there are phonetic algorithms that encode a string based on how it would "sound". Like the SoundEx or Double Metaphone algorithm.

What is Double Metaphone?
Double Metaphone is a phonetic algorithm that takes a string and produces 2 encodings on how it could be pronounced in spoken language. Two encodings are produced because a word can sometimes be pronounced multiple ways.

It is used for English and contains many rules on the "sounds" of a string. It also tries to account for differences in the English pronunciation of Romanized words in Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins.

How Does it Work?
The algorithm works by looking at hundreds of phonetic rules to create a phonetic string. Some examples of the rules it uses are:
  • P and B are encoded to the same sound. Unless the P is followed by an H which is then encoded to the F sound.
  • SCI is encoded to the S sound while SCO is encoded to the SK sound.
  • T and D are encoded to the same sound. Unless the T is followed by an H which is then encoded to the theta sound which is represented by the number 0.

The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the LEFT(output, threshold) is the main pronunciation and the RIGHT(output, threshold) is a potentially different pronunciation. Hence the spaces when the the encoding produces a pronunciation with less characters than the threshold.

The algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing.

This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, "foxtrot" and "fixed" have the same encoding.

So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish.

Sample Implementation in VBScript
This implementation takes a string s input along with a length i input and outputs 2 possible different phonetic encodings of the string, each of length I.

For example, calling it with DoubleMetaphone ("Smith", 4) would return SM0 XMT , spaces included.

You can use this to precalculate the pronunciation of names or words in your tables and compare the 2 different phonetic codes against all the other phonetic codes in the table to find names or words that sound similar.

Expand|Select|Wrap|Line Numbers
  1. Option Explicit
  2.  
  3. MsgBox(DoubleMetaphone(InputBox("Enter String"), 6))
  4.  
  5. Function DoubleMetaphone(strOriginal, intThreshhold)
  6.     Dim isSlavoGermanic, strPrimary, strSecondary, i, intJump, iB
  7.     Dim intLength, cP, cS, arr, x, intPad
  8.  
  9.     isSlavoGermanic = False
  10.     iB = 4
  11.     intPad = 6
  12.     x = iB
  13.     intLength = Len(strOriginal) + iB - 1
  14.     strOriginal = UCase(strOriginal)
  15.  
  16.     If (InStr(strOriginal, "W") + InStr(strOriginal, "K") + InStr(strOriginal, "CZ") + InStr(strOriginal, "WITZ")) <> 0 Then
  17.         isSlavoGermanic = True
  18.     End If
  19.  
  20.     ReDim arr(intLength + intPad + 1)
  21.  
  22.     For i = 0 To iB-1
  23.         arr(i) = vbTab
  24.     Next
  25.  
  26.     For i = iB To intLength
  27.         arr(i) = Mid(strOriginal, i-iB+1, 1)
  28.     Next
  29.  
  30.     For i = intLength+1 To UBound(arr)
  31.         arr(i) = vbTab
  32.     Next
  33.  
  34.     Select Case (arr(x) & arr(x+1))
  35.         Case "AC"
  36.             strPrimary = "AKS"
  37.             strSecondary = "AKS"
  38.             x = x + 4
  39.         Case "GN", "KN", "PN", "PS"
  40.             x = x + 1
  41.         Case "HA", "HE", "HI", "HO", "HU", "HY"
  42.             strPrimary = "H"
  43.             strSecondary = "H"
  44.             x = x + 2
  45.         Case "WA", "WE", "WI", "WO", "WU", "WY"
  46.             strPrimary = "A"
  47.             strSecondary = "F"
  48.             x = x + 2
  49.         Case "WH"
  50.             strPrimary = "A"
  51.             strSecondary = "A"
  52.             x = x + 1
  53.         Case "SM", "SN", "SL", "SW"
  54.             strPrimary = "S"
  55.             strSecondary = "X"
  56.             x = x + 1
  57.         Case "GY"
  58.             strPrimary = "K"
  59.             strSecondary = "J"
  60.             x = x + 2
  61.     End Select
  62.  
  63.     If x = iB Then
  64.         If arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "JOSE" Then
  65.             If (x = iB And arr(x+4) = " ") Then
  66.                 strPrimary = "HS"
  67.                 strSecondary = "HS"
  68.                 x = x + 4
  69.             End If
  70.         ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "SUGAR" Then
  71.             strPrimary = "XK"
  72.             strSecondary = "SK"
  73.             x = x + 5
  74.         ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CAESAR" Then
  75.             strPrimary = "SSR"
  76.             strSecondary = "SSR"
  77.             x = x + 6
  78.         ElseIf (arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARAC" Or _
  79.         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARIS" Or _
  80.         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHOR" Or _
  81.         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHYM" Or _
  82.         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHEM") And _
  83.         arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) <> "CHORE" Then
  84.             strPrimary = "K"
  85.             strSecondary = "K"
  86.             x = x + 2
  87.         End If
  88.     End If
  89.  
  90.     If x = iB Then
  91.         Select Case arr(x) & arr(x+1) & arr(x+2)
  92.             Case "GES", "GEP", "GEB", "GEL", "GEY", "GIB", "GIL", "GIN", "GIE", "GEI", "GER"
  93.                 strPrimary = "K"
  94.                 strSecondary = "J"
  95.                 x = x + 2
  96.             Case "GHI"
  97.                 strPrimary = "J"
  98.                 strSecondary = "J"
  99.                 x = x + 3
  100.             Case "AGN", "EGN", "IGN", "OGN", "UGN", "UGY"
  101.                 If Not isSlavoGermanic Then
  102.                     strPrimary = "AKN"
  103.                     strSecondary = "AN"
  104.                     x = x + 3
  105.                 End If
  106.         End Select
  107.     End If
  108.  
  109.     If x = iB Then
  110.         Select Case arr(x)
  111.             Case "X"
  112.                 strPrimary = "S"
  113.                 strSecondary = "S"
  114.                 x = x + 1
  115.             Case "A", "E", "I", "O", "U", "Y"
  116.                 strPrimary = "A"
  117.                 strSecondary = "A"
  118.                 x = x + 1
  119.             Case "J"
  120.                 strPrimary = "J"
  121.                 strSecondary = "A"
  122.                 x = x + 1
  123.         End Select
  124.     End If
  125.  
  126.     Do While x <= intLength
  127.         If Len(strPrimary) >= intThreshhold Then
  128.             Exit Do
  129.         End If
  130.  
  131.         intJump = 1
  132.         cP = arr(x)
  133.         cS = arr(x)
  134.  
  135.         Select Case arr(x)
  136.             Case "A", "E", "I", "O", "U", "Y"
  137.                 cP = ""
  138.                 cS = ""
  139.  
  140.             Case "B"
  141.                 cP = "P"
  142.                 cS = "P"
  143.  
  144.             Case "Ç"
  145.                 cP = "S"
  146.                 cS = "S"
  147.  
  148.             Case "C"
  149.                 If x > iB+1 And arr(x-2) <> "A" And arr(x-2) <> "E" And arr(x-2) <> "I" And arr(x-2) <> "O" And arr(x-2) <> "U" And _
  150.                 arr(x-2) <> "Y" And arr(x-1) & arr(x+1) = "AH" And ((arr(x+2) <> "I" And arr(x+2) <> "E") Or _
  151.                 arr(x-2) & arr(x+2) & arr(x+3) = "BER" Or arr(x-2) & arr(x+2) & arr(x+3) = "MER") Then
  152.                     cP = "K"
  153.                     cS = "K"
  154.                     intJump = 2
  155.                 ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "HIA" Then
  156.                     cP = "K"
  157.                     cS = "K"
  158.                     intJump = 4
  159.                 ElseIf arr(x+1) = "H" Then
  160.                     If x > iB And arr(x+2) & arr(x+3) = "AE" Then
  161.                         cP = "K"
  162.                         cS = "X"
  163.                         intJump = 2
  164.                     ElseIf arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _
  165.                     arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _
  166.                     arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or arr(x+2) = "T" Or arr(x+2) = "S" Or _
  167.                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHES" Or _
  168.                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ARHIT" Or _
  169.                     arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHID" Or _ 
  170.                     ((arr(x-2) = "A" Or arr(x-2) = "E" Or arr(x-2) = "O" Or arr(x-2) = "U" Or x = iB) And _
  171.                     (arr(x+2) = "L" Or arr(x+2) = "R" Or arr(x+2) = "N" Or arr(x+2) = "M" Or arr(x+2) = "B" Or _
  172.                     arr(x+2) = "H" Or arr(x+2) = "F" Or arr(x+2) = "V" Or arr(x+2) = "W" Or arr(x+2) = " "))Then
  173.                         cP = "K"
  174.                         cS = "K"
  175.                         intJump = 2
  176.                     Else
  177.                         intJump = 2
  178.  
  179.                         If x > iB Then
  180.                             If arr(iB) & arr(iB+1) = "MC" Then
  181.                                 cP = "K"
  182.                                 cS = "K"
  183.                             Else
  184.                                 cP = "X"
  185.                                 cS = "K"
  186.                             End If
  187.                         Else
  188.                             cP = "X"
  189.                             cS = "X"
  190.                         End If
  191.                     End If
  192.                 ElseIf arr(x+1) = "Z" And arr(x-2) & arr(x-1) <> "WI" Then
  193.                     cP = "S"
  194.                     cS = "X"
  195.                     intJump = 2
  196.                 ElseIf arr(x+1) & arr(x+2) & arr(x+2) = "CIA" Then
  197.                     cP = "X"
  198.                     cS = "X"
  199.                     intJump = 3
  200.                 ElseIf arr(x+1) = "C" And Not (x = iB+1 And arr(iB) = "M") Then
  201.                     If (arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "H") And arr(x+2) & arr(x+3) <> "HU" Then
  202.                         If arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCEE" Or _
  203.                         arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCES" Then
  204.                             cP = "KS"
  205.                             cS = "KS"
  206.                             intJump = 3
  207.                         Else
  208.                             cP = "X"
  209.                             cS = "X"
  210.                             intJump = 3
  211.                         End If
  212.                     Else
  213.                         cP = "K"
  214.                         cS = "K"
  215.                         intJump = 2
  216.                     End If
  217.                 ElseIf arr(x+1) = "K" Or arr(x+1) = "G" Or arr(x+1) = "Q" Then
  218.                     cP = "K"
  219.                     cS = "K"
  220.                     intJump = 2
  221.                 ElseIf arr(x+1) = "I" Or arr(x+1) = "E" Or arr(x+1) = "Y" Then
  222.                     If arr(x+1) & arr(x+2) = "IO" Or arr(x+1) & arr(x+2) = "IE" Or arr(x+1) & arr(x+2) = "IA" Then
  223.                         cP = "S"
  224.                         cS = "X"
  225.                         intJump = 2
  226.                     Else
  227.                         cP = "S"
  228.                         cS = "S"
  229.                         intJump = 2
  230.                     End If
  231.                 Else
  232.                     cP = "K"
  233.                     cS = "K"
  234.  
  235.                     If arr(x+1) & arr(x+2) = " C" Or arr(x+1) & arr(x+2) = " Q" Or arr(x+1) & arr(x+2) = " G" Then
  236.                         intJump = 3
  237.                     Else
  238.                         If (arr(x+1) = "C" Or arr(x+1) = "K" Or arr(x+1) = "Q") And _
  239.                         arr(x+1) & arr(x+2) <> "CE" And arr(x+1) & arr(x+2) <> "CI" Then 
  240.                             intJump = 2
  241.                         End If
  242.                     End If
  243.                 End If
  244.  
  245.             Case "D"
  246.                 If arr(x+1) = "G" Then
  247.                     If arr(x+2) = "I" Or _
  248.                     arr(x+2) = "E" Or _
  249.                     arr(x+2) = "Y" Then
  250.                         cP = "J"
  251.                         cS = "J"
  252.                         intJump = 3
  253.                     Else
  254.                         cP = "TK"
  255.                         cS = "TK"
  256.                         intJump = 2
  257.                     End If
  258.                 ElseIf arr(x+1) = "T" Then
  259.                     cP = "T"
  260.                     cS = "T"
  261.                     intJump = 2
  262.                 Else
  263.                     cP = "T"
  264.                     cS = "T"
  265.                 End If
  266.  
  267.             Case "G"
  268.                 If arr(x+1) = "H" Then
  269.                     If x <> iB And arr(x-1) <> "A" And arr(x-1) <> "E" And arr(x-1) <> "I" _
  270.                     And arr(x-1) <> "O" And arr(x-1) <> "U" And arr(x-1) <> "Y" Then
  271.                         cP = "K"
  272.                         cS = "K"
  273.                         intJump = 2
  274.                     ElseIf (x > iB+1 And (arr(x-2) = "B" Or arr(x-2) = "H" Or arr(x-2) = "D")) Or _
  275.                     (x > iB+2 And (arr(x-3) = "B" Or arr(x-3) = "H" Or arr(x-3) = "D")) Or _
  276.                     (x > iB+3 And (arr(x-4) = "B" Or arr(x-4) = "H")) Then
  277.                         cP = ""
  278.                         cS = ""
  279.                         intJump = 2
  280.                     Else
  281.                         If x > iB+2 And arr(x-1) = "U" And _
  282.                         (arr(x-3) = "C" Or arr(x-3) = "G" Or arr(x-3) = "L" Or arr(x-3) = "R" Or arr(x-3) = "T") Then
  283.                             cP = "F"
  284.                             cS = "F"
  285.                             intJump = 2
  286.                         ElseIf x > iB And arr(x-1) <> "I" Then
  287.                             cP = "K"
  288.                             cS = "K"
  289.                             intJump = 2
  290.                         Else
  291.                             cP = ""
  292.                             cS = ""
  293.                         End If
  294.                     End If
  295.                 ElseIf arr(x+1) = "N" Then
  296.                     cS = "KN"
  297.                     intJump = 2
  298.  
  299.                     If arr(x+2) & arr(x+3) <> "EY" And Not isSlavoGermanic Then
  300.                         cP = "N"
  301.                     Else
  302.                         cP = "KN"
  303.                     End If
  304.                 ElseIf arr(x+1) & arr(x+2) = "LI" And Not isSlavoGermanic Then
  305.                     cP = "KL"
  306.                     cS = "L"
  307.                     intJump = 2
  308.                 ElseIf (arr(x+1) & arr(x+2) = "ER" Or arr(x+1) = "Y") And _
  309.                 arr(x-1) <> "E" And arr(x-1) <> "I" And _
  310.                 arr(x-1) & arr(x+1) <> "RY" And _
  311.                 arr(x-1) & arr(x+1) <> "OY" And _
  312.                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "DANGER" And _
  313.                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "RANGER" And _
  314.                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "MANGER" Then
  315.                     cP = "K"
  316.                     cS = "J"
  317.                     intJump = 2
  318.                 ElseIf arr(x+1) = "E" Or arr(x+1) = "I" Or arr(x+1) = "Y" Or _
  319.                 arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "AGGI" Or _
  320.                 arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "OGGI" Then
  321.                     If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _
  322.                     arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _
  323.                     arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or _
  324.                     arr(x+1) & arr(x+2) = "ET" Then
  325.                         cP = "K"
  326.                         cS = "K"
  327.                         intJump = 2
  328.                     Else
  329.                         cP = "J"
  330.                         If arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "IER " Then
  331.                             cS = "J"
  332.                             intJump = 3
  333.                         Else
  334.                             cS = "K"
  335.                             intJump = 2
  336.                         End If
  337.                     End If
  338.                 Else
  339.                     cP = "K"
  340.                     cS = "K"
  341.                 End If
  342.  
  343.             Case "H"
  344.                 If (arr(x-1) = "A" Or _
  345.                 arr(x-1) = "E" Or _
  346.                 arr(x-1) = "I" Or _
  347.                 arr(x-1) = "O" Or _
  348.                 arr(x-1) = "U" Or _
  349.                 arr(x-1) = "Y") And _
  350.                 (arr(x+1) = "A" Or _
  351.                 arr(x+1) = "E" Or _
  352.                 arr(x+1) = "I" Or _
  353.                 arr(x+1) = "O" Or _
  354.                 arr(x+1) = "U" Or _
  355.                 arr(x+1) = "Y") Then
  356.                     intJump = 2
  357.                 Else
  358.                     cP = ""
  359.                     cS = ""
  360.                 End If
  361.  
  362.             Case "J"
  363.                 If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "SAN " Then
  364.                     cP = "H"
  365.                     cS = "H"
  366.                 Else
  367.                     If Not isSlavoGermanic And ( _
  368.                     arr(x-1) = "A" Or _
  369.                     arr(x-1) = "E" Or _
  370.                     arr(x-1) = "I" Or _
  371.                     arr(x-1) = "O" Or _
  372.                     arr(x-1) = "U" Or _
  373.                     arr(x-1) = "Y") And ( _
  374.                     arr(x+1) = "A" Or _
  375.                     arr(x+1) = "O") Then
  376.                         cS = "H"
  377.                     Else
  378.                         If x = intLength Then
  379.                             cS = ""
  380.                         Else
  381.                             If arr(x+1) = "L" Or arr(x+1) = "T" Or arr(x+1) = "K" Or _
  382.                             arr(x+1) = "S" Or arr(x+1) = "N" Or arr(x+1) = "M" Or _
  383.                             arr(x+1) = "B" Or arr(x+1) = "Z" Or _
  384.                             arr(x-1) = "S" Or arr(x-1) = "K" Or arr(x-1) = "L" Then
  385.                                 cP = ""
  386.                                 cS = ""
  387.                             End If
  388.                         End If
  389.                     End If
  390.                 End If
  391.  
  392.             Case "L"
  393.                 If arr(x+1) = "L" Then
  394.                     intJump = 2
  395.  
  396.                     If ((x = intLength-2 And ( _
  397.                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLO" Or _
  398.                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLA" Or _
  399.                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE" _
  400.                     )) Or (( _
  401.                     arr(intLength-1) & arr(intLength) = "AS" Or _
  402.                     arr(intLength-1) & arr(intLength) = "OS" Or _
  403.                     arr(intLength) = "A" Or arr(intLength) = "O") And _
  404.                     arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE")) Then
  405.                         cS = ""
  406.                     End If
  407.                 End If
  408.  
  409.             Case "M"
  410.                 If arr(x-1) & arr(x) & arr(x+1) = "UMB" And _
  411.                 (x = intLength-1 Or arr(x+2) & arr(x+3) = "ER") Then
  412.                     intJump = 2
  413.                 End If
  414.  
  415.             Case "P"
  416.                 Select Case arr(x+1)
  417.                     Case "H"
  418.                         cP = "F"
  419.                         cS = "F"
  420.                         intJump = 2
  421.                     Case "B"
  422.                         intJump = 2
  423.                 End Select
  424.  
  425.             Case "Q"
  426.                 cP = "K"
  427.                 cS = "K"
  428.  
  429.             Case "R"
  430.                 If x = intLength And Not isSlavoGermanic And _
  431.                 arr(x-2) & arr(x-1) = "IE" And _
  432.                 arr(x-4) & arr(x-3) <> "ME" And _
  433.                 arr(x-4) & arr(x-3) <> "MA" Then
  434.                     cP = ""
  435.                 End If
  436.  
  437.             Case "S"
  438.                 If arr(x+1) = "L" And (arr(x-1) = "I" Or arr(x-1) = "Y") Then
  439.                     cP = ""
  440.                     cS = ""
  441.                 ElseIf arr(x+1) = "H" And _
  442.                 arr(x+2) & arr(x+3) & arr(x+4) <> "EIM" And _
  443.                 arr(x+2) & arr(x+3) & arr(x+4) <> "OEK" And _
  444.                 arr(x+2) & arr(x+3) & arr(x+4) <> "OLM" And _
  445.                 arr(x+2) & arr(x+3) & arr(x+4) <> "OLZ" Then
  446.                     intJump = 2
  447.                     cP = "X"
  448.                     cS = "X"
  449.                 ElseIf Not isSlavoGermanic And ( _
  450.                 arr(x+1) & arr(x+2) = "IA" Or _
  451.                 arr(x+1) & arr(x+2) = "IO") Then
  452.                     intJump = 3
  453.                     cS = "X"
  454.                 ElseIf arr(x+1) = "Z" Then
  455.                     cS = "X"
  456.                     intJump = 2
  457.                 ElseIf arr(x+1) = "C" Then
  458.                     intJump = 3
  459.  
  460.                     If arr(x+2) = "H" Then
  461.                         If arr(x+3) & arr(x+4) = "OO" Or _
  462.                         arr(x+3) & arr(x+4) = "ER" Or _
  463.                         arr(x+3) & arr(x+4) = "EN" Or _
  464.                         arr(x+3) & arr(x+4) = "UY" Or _
  465.                         arr(x+3) & arr(x+4) = "ED" Or _
  466.                         arr(x+3) & arr(x+4) = "EM" Then
  467.                             cS = "SK"
  468.  
  469.                             If arr(x+3) & arr(x+4) = "ER" Or _
  470.                             arr(x+3) & arr(x+4) = "EN" Then
  471.                                 cP = "X"
  472.                             Else
  473.                                 cP = "SK"
  474.                             End If
  475.                         Else
  476.                             cP = "X"
  477.  
  478.                             If x <> iB Or arr(iB+3) = "W" Or arr(iB+3) = "A" Or _
  479.                             arr(iB+3) = "E" Or arr(iB+3) = "I" Or arr(iB+3) = "O" Or _
  480.                             arr(iB+3) = "U" Or arr(iB+3) = "Y" Then
  481.                                 cS = "X"
  482.                             End If
  483.                         End If
  484.                     ElseIf arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "Y" Then
  485.                     Else
  486.                         cP = "SK"
  487.                         cS = "SK"
  488.                     End If
  489.                 ElseIf x = intLength And arr(x-1) = "I" And ( _
  490.                 arr(x-2) = "A" Or arr(x-2) = "O") Then
  491.                     cP = ""
  492.                 End If
  493.  
  494.             Case "T"
  495.                 If arr(x+1) & arr(x+2) & arr(x+3) = "ION" _
  496.                 Or arr(x+1) & arr(x+2) = "IA" _
  497.                 Or arr(x+1) & arr(x+2) = "CH" Then
  498.                     cP = "X"
  499.                     cS = "X"
  500.                     intJump = 3
  501.                 ElseIf (arr(x+1) = "H" Or arr(x+1) & arr(x+2) = "TH") And _
  502.                 (arr(x+2) & arr(x+3) <> "OM" And _
  503.                 arr(x+2) & arr(x+3) <> "AM" And _
  504.                 arr(iB) & arr(iB+1) & arr(iB+2) <> "SCH" And _
  505.                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VAN " And _
  506.                 arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VON ") Then
  507.                     cP = "0"
  508.                     intJump = 2
  509.                 ElseIf arr(x+1) = "D" Then
  510.                     intJump = 2
  511.                 End If
  512.  
  513.             Case "V"
  514.                 cP = "F"
  515.                 cS = "F"
  516.  
  517.             Case "W"
  518.                 If arr(x+1) = "R" Then
  519.                     cP = "R"
  520.                     cS = "R"
  521.                     intJump = 2
  522.                 ElseIf arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" _
  523.                 Or (x = intLength And ( _
  524.                 arr(x-1) = "A" Or _
  525.                 arr(x-1) = "E" Or _
  526.                 arr(x-1) = "I" Or _
  527.                 arr(x-1) = "O" Or _
  528.                 arr(x-1) = "U" Or _
  529.                 arr(x-1) = "Y")) _
  530.                 Or ((arr(x-1) = "E" Or arr(x-1) = "O") And _
  531.                 (arr(x+1) & arr(x+2) & arr(x+3) = "SKI" Or _
  532.                 arr(x+1) & arr(x+2) & arr(x+3) = "SKY")) Then
  533.                     cP = ""
  534.                     cS = "F"
  535.                 ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "ICZ" _
  536.                 Or arr(x+1) & arr(x+2) & arr(x+3) = "ITZ" Then
  537.                     cP = "TS"
  538.                     cS = "FX"
  539.                     intJump = 4
  540.                 Else
  541.                     cP = ""
  542.                     cS = ""
  543.                 End If
  544.  
  545.             Case "X"
  546.                 If x = intLength And _
  547.                 (arr(x-3) & arr(x-2) & arr(x-1) = "IAU" Or _ 
  548.                 arr(x-3) & arr(x-2) & arr(x-1) = "EAU" Or _
  549.                 arr(x-2) & arr(x-1) = "AU" Or _
  550.                 arr(x-2) & arr(x-1) = "OU") Then
  551.                     cP = ""
  552.                     cS = ""
  553.                 Else
  554.                     cP = "KS"
  555.                     cS = "KS"
  556.                 End If
  557.  
  558.                 If arr(x+1) = "C" Then
  559.                     intJump = 2
  560.                 End If
  561.  
  562.             Case "Z"
  563.                 If arr(x+1) = "H" Then
  564.                     cP = "J"
  565.                     cS = "J"
  566.                 ElseIf (arr(x+1) & arr(x+2) = "ZO" Or _
  567.                 arr(x+1) & arr(x+2) = "ZI" Or _
  568.                 arr(x+1) & arr(x+2) = "ZA") _
  569.                 Or (isSlavoGermanic And x <> iB And arr(x-1) = "T") Then
  570.                     cP = "S"
  571.                     cS = "TS"
  572.                 Else
  573.                     cP = "S"
  574.                     cS = "S"
  575.                 End If
  576.         End Select
  577.  
  578.         strPrimary = strPrimary & cP
  579.         strSecondary = strSecondary & cS
  580.  
  581.         If arr(x) = arr(x+1) And arr(x) <> "C" Then
  582.             intJump = intJump + 1
  583.         End If
  584.         x = x + intJump
  585.     Loop
  586.  
  587.     For i = 1 To intThreshhold
  588.         strPrimary = strPrimary & " "
  589.         strSecondary = strSecondary & " "
  590.     Next
  591.  
  592.     DoubleMetaphone = Left(strPrimary, intThreshhold) & Left(strSecondary, intThreshhold)
  593. End Function
Dec 21 '15 #1
10 12842
jforbes
1,107 Recognized Expert Top Contributor
This looks like a lot of fun. I can't wait to have time to experiment with it. Thanks.
Dec 22 '15 #2
zmbd
5,501 Recognized Expert Moderator Expert
J,
Just played with it some in Rabbit's roughdraft,
Code ports almost directly into Access-VBA with only a need to wrap line 3 into a sub-procedure.
Dec 22 '15 #3
Rabbit
12,516 Recognized Expert Moderator MVP
Thanks Z. And thanks for testing it against another implementation online. That helps ease my mind on whether or not it was implemented properly.
Dec 22 '15 #4
zmbd
5,501 Recognized Expert Moderator Expert
Rabbit,
Still playing with this, I had initially just hand typed twenty or so random words in to an on-line implementation.
Now I'm playing in a database and I noticed that there is a space between the two groups, one of the online versions ran the two groups together, one did not.. and I closed the window and I have my history set to flush:

so for "agencies" you get "AJNSS AKNXS"

So in my people template:
Expand|Select|Wrap|Line Numbers
  1. SELECT people_FirstName
  2.    , DoubleMetaphone([people_firstName],6)
  3.      AS DMP
  4. FROM tbl_people;
yields
Note the name Foxtrot, here there is no space....
Expand|Select|Wrap|Line Numbers
  1. people_FirstName       DMP:
  2. Alpha                  ALF   ALF   
  3. Beta                   PT    PT    
  4. Charlie                XRL   XRL   
  5. Delta                  TLT   TLT   
  6. Echo                   AX    AK    
  7. Foxtrot                FKSTRTFKSTRT
  8. Golf                   KLF   KLF   
  9. Hotel                  HTL   HTL   
  10. India                  ANT   ANT   
  11. Juliet                 JLT   ALT   
  12. Kilo                   KL    KL    
  13. Lima                   LM    LM    
  14. Mike                   MK    MK    
  15. November               NFMPR NFMPR 
  16. October                AKTPR AKTPR 
  17. Papa                   PP    PP    
  18. Quebec                 KPK   KPK   
  19. Romeo                  RM    RM    
  20. Sierra                 SR    SR    
  21. Tango                  TNK   TNK   
  22. Uniform                ANFRM ANFRM 
  23. Victor                 FKTR  FKTR  
  24. Whisky                 ASK   ASK   
  25. Xray                   SR    SR    
  26. Yankee                 ANK   ANK   
  27. Zulu                   SL    SL
I've not stepped thru the code yet...
Dec 22 '15 #5
Rabbit
12,516 Recognized Expert Moderator MVP
The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the LEFT(output, threshold) is the main pronunciation and the RIGHT(output, threshold) is a potentially different pronunciation. Hence the spaces when there the encoding produces a pronunciation with less characters than the threshold. Threshold of course is a misnomer. It's more of a limit or max length. Not sure why I used threshold at the time.

As for Foxtrot, the algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. Such as the space before Foxtrot. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing.
Dec 22 '15 #6
zmbd
5,501 Recognized Expert Moderator Expert
+ Thank you for explaining that, not being familiar with this algorithm. Between the two different implementations and some of the articles I was skimming also smashing the two parts together in to one string made it unclear for me as to how the proper output should be.

+ The apparent space in "Foxtrot" is an artifact of the formatting in the posting, sorry, didn't intend to mislead you... :(
SO I should be seeing "FKSTRT FKSTRT" for "foxtrot"?
Directly in the immedate pane:
Expand|Select|Wrap|Line Numbers
  1. ?DoubleMetaphone("Foxtrot",6)
  2. FKSTRTFKSTRT
  3.  
  4. ?DoubleMetaphone("November",6)
  5. NFMPR NFMPR
  6.  
  7. ?DoubleMetaphone("October",6)
  8. AKTPR AKTPR
Using foxtrot, STOP at line 586:
Expand|Select|Wrap|Line Numbers
  1. ?"'" & strPrimary & "'" & len(strPrimary)
  2. 'FKSTRT      ' 12
  3. ?"'" & strSecondary & "'" & len(strPrimary)
  4. 'FKSTRT      ' 12
  5. ?"'" & Left(strPrimary, intThreshhold) & "'"
  6. 'FKSTRT'
  7. ?"'" & Left(strSecondary, intThreshhold) & "'"
  8. 'FKSTRT'
Should we have a check for length of strPrimary and strSecondary against the intThreshhold and add one or does the fact that the threshold matches the length of the encoding correctly merge the two?
Dec 22 '15 #7
Rabbit
12,516 Recognized Expert Moderator MVP
Lines 587-590 will pad out the different pronunciations to the threshold if necessary.

Line 592 will truncate the different pronunciations if it exceeds the threshold.

When the encoding matches or exceeds the threshold, the 2 pronunciations run together without any spaces in between.
Dec 22 '15 #8
zmbd
5,501 Recognized Expert Moderator Expert
So, if understand correctly, this is by design?

One of the implementations (http://swoodbridge.com/DoubleMetaPhone/mptest.php3) appears to be using 4 as the threshold and the result returns with "foxtrot yields 'FKST' and 'FKST'"

so I am still confused about the correct output.
Dec 22 '15 #9
Rabbit
12,516 Recognized Expert Moderator MVP
This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, foxtrot and fixed have the same encoding.

So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish.
Dec 22 '15 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

10
9532
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a library?
2
2713
by: laura | last post by:
Hi, I'm looking for the algorithm which does the lowest number of comparison for the string matching problem (in the worst case). I know that the complexity is liniar in the length of the strings, but I'm interested in finding the actual number of comparisons. The lowest number (theoretical) is m-n+1, but as far as I know no algorithm...
14
13465
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in various ways. One might include a '-' or have leading zeros, or have a single character missing, or a zero that is typed as a letter 'O'. That...
15
7634
by: shiniskumar | last post by:
How to convert string to double? Ive got a double variable dTot; its value is 5.037717235E7 when i did FreemarkerTools.formatDecimal(dTot) i got it as string "50377172.35". now i want to pass it as a double value to the method format() which is shown below. when i do Double.parseDouble i get dTot as 5.037717235E7 ...
3
3875
by: JohnH | last post by:
I'm very interested in implementing double-metaphone search in my database to resolve customer names, and I've done quite a bit of searching trying to find a VBA implementation. I know that Clive Bolton spent quite a bit of time creating one several years ago (an A97 demo mdb), but it doesn't seem to be publicly available. I would greatly...
4
3951
by: Kelly B | last post by:
This is a simple string matching code from K&R2 (modified to match a user entered string from a text file) I tried to code an alternative strindex function(commented below) but it does not work ,i don't seem to understand where exactly am i going wrong Secondly using gets() can be a dangerous option if the user enters more characters than...
8
2822
by: kumarboston | last post by:
Hi All, I have a string of around 500 characters(alphabets) and I have a 3 arrays storing the position "start, end and type" for that string. i am trying to do is to match these position to the string and highlight those positions on the web page. For example: string- ABCDREGYRIWJEKSALOPRHDAGRTPRTDBRTWASERFSDHSJHDS start_pos=5 end_pos=10...
6
31483
Rabbit
by: Rabbit | last post by:
INTRODUCTION The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters. USES The most common use of the function is for approximate string matching. Since the function returns the minimum number...
4
13568
Rabbit
by: Rabbit | last post by:
INTRODUCTION An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld". USES NGram modeling is often used to model natural languages. It is also used to predict the...
0
7524
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7451
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
1
7475
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6048
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5372
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3501
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1944
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1061
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
766
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.