Michel Esber wrote:
Thanks for the hint.
I saw this function before posting. Unfortunately, it does not meet my
requirements. For example and according to the docs:
SELECT EMPNO, LASTNAME FROM EMPLOYEE
WHERE SOUNDEX(LASTNAME) = SOUNDEX('Loucesy')
EMPNO LASTNAME
------ ---------------
000110 LUCCHESSI
But for my application, 'Loucesy' and 'LUCCHESSI' are very different
strings, even though their sounds are similar.
Any other ideas?
I *think* what you're looking for is the "distance" between strings;
i.e. the number of changes one must make to get from one string to
another. The Levenshtein Distance algorithm provides a way to calculate
this. See:
Levenshtein Distance Article
http://en.wikipedia.org/wiki/Levenshtein_distance
Example implementations in several languages
http://en.wikisource.org/wiki/Levenshtein_distance
This algorithm returns 2 for the distance between "ABCDEFGHIJ" and
"ACBDEFGHIJ" (indicating that 2 alterations, an insertion and a
deletion, have to be made to get from one to the other). There are
refinements of the Levenshtein Distance algorithm that include swapping
characters as an operation which could return 1 for the distance.
To get a percentage similarity you could do something fairly crude like
comparing the distance to the length of the string, e.g.:
100 * (len - distance) / len
Which in this case would give 80%.
Unfortunately, looking at the implementations, the algorithm is
probably quite hard to implement efficiently in an SQL UDF. You'd
likely be better off implementing it as an external UDF in C or Java
(there are C++ and Java implementations at the link above, as well as
Lisp, Python, Ruby, Perl, Haskell, etc.)
HTH,
Dave.
--