Connecting Tech Pros Worldwide Help | Site Map

Query for finding similar adress records in database

Antitax
Guest
 
Posts: n/a
#1: Nov 12 '05
I have a database with more than 800 adress records

Some of the are similar because some letters in the street adress for
example are not identical, altough they point to the same adress.

Does anyone have the syntax for a query that shows similar adresses
thar could be duplicates?

Thank you
Tom van Stiphout
Guest
 
Posts: n/a
#2: Nov 12 '05

re: Query for finding similar adress records in database


On Thu, 20 Nov 2003 12:12:32 GMT, Antitax <cnb@carrefuts.com> wrote:

The SOUNDEX function is often used to group similar-sounding records.
I often find it not discriminating enough. I'm using the
Ratcliff-Obershelp algorithm. It has its own problems, but it works
better in my situation.

Information about both can be found by your favorite search engine.

-Tom.

[color=blue]
>I have a database with more than 800 adress records
>
>Some of the are similar because some letters in the street adress for
>example are not identical, altough they point to the same adress.
>
>Does anyone have the syntax for a query that shows similar adresses
>thar could be duplicates?
>
>Thank you[/color]

David W. Fenton
Guest
 
Posts: n/a
#3: Nov 12 '05

re: Query for finding similar adress records in database


tom7744@no.spam.cox.net (Tom van Stiphout) wrote in
<ifhprvstglghj62cevfig28i844g0mrej7@4ax.com>:
[color=blue]
>On Thu, 20 Nov 2003 12:12:32 GMT, Antitax <cnb@carrefuts.com>
>wrote:
>
>The SOUNDEX function is often used to group similar-sounding
>records. I often find it not discriminating enough. I'm using the
>Ratcliff-Obershelp algorithm. It has its own problems, but it
>works better in my situation.
>
>Information about both can be found by your favorite search
>engine.[/color]

Soundex (and related functions) are designed for finding
similar-sounding words. An address is a collection of numbers and
words, rather than a single word, and therefore not that
susceptible to comparison with soundalike functions.

When I really need to do this kind of thing (I once had to de-dupe
250K records against 100K existing records with a likely overlap of
something on the order of 1/3), I process the address field into a
numeric part and a text part and then work with those individually.
It's also helpful to process the commonly occuring words into a
common format (Street, Str, St, etc.).

I've also used a function of my own that works on the same
principle as Simil (Levenshtein) functions, which return a number
that shows how close two strings are to each other (it's useful to
think of it as percentage of overlap, but that's not precisely how
Simil functions calculate it).

The best way to insure the ability to de-dupe on addresses (or to
group on addresses), is to have your data entry process do the
regularization. There are a couple of methods for this:

1. in the AfterUpdate, scan the input address and regularize
abbreviations, etc.

2. use multiple fields to collect the information in a structured
fashion, and then write to multiple fields or write the result to a
single field.

Some combination of the two can be used also.

The problem is, of course, that no existing data is that clean, so
unless you have CASS-certified data (i.e., formatted according to
US postal regulations by being pre-processed for regularity), you
are really working towards a future in which all the data has been
entered via your UI.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc
Closed Thread