to*****@no.spam.cox.net (Tom van Stiphout) wrote in
<if********************************@4ax.com>:
On Thu, 20 Nov 2003 12:12:32 GMT, Antitax <cn*@carrefuts.com>
wrote:
The SOUNDEX function is often used to group similar-sounding
records. I often find it not discriminating enough. I'm using the
Ratcliff-Obershelp algorithm. It has its own problems, but it
works better in my situation.
Information about both can be found by your favorite search
engine.
Soundex (and related functions) are designed for finding
similar-sounding words. An address is a collection of numbers and
words, rather than a single word, and therefore not that
susceptible to comparison with soundalike functions.
When I really need to do this kind of thing (I once had to de-dupe
250K records against 100K existing records with a likely overlap of
something on the order of 1/3), I process the address field into a
numeric part and a text part and then work with those individually.
It's also helpful to process the commonly occuring words into a
common format (Street, Str, St, etc.).
I've also used a function of my own that works on the same
principle as Simil (Levenshtein) functions, which return a number
that shows how close two strings are to each other (it's useful to
think of it as percentage of overlap, but that's not precisely how
Simil functions calculate it).
The best way to insure the ability to de-dupe on addresses (or to
group on addresses), is to have your data entry process do the
regularization. There are a couple of methods for this:
1. in the AfterUpdate, scan the input address and regularize
abbreviations, etc.
2. use multiple fields to collect the information in a structured
fashion, and then write to multiple fields or write the result to a
single field.
Some combination of the two can be used also.
The problem is, of course, that no existing data is that clean, so
unless you have CASS-certified data (i.e., formatted according to
US postal regulations by being pre-processed for regularity), you
are really working towards a future in which all the data has been
entered via your UI.
--
David W. Fenton
http://www.bway.net/~dfenton
dfenton at bway dot net
http://www.bway.net/~dfassoc