By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,277 Members | 1,172 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,277 IT Pros & Developers. It's quick & easy.

Query for finding similar adress records in database

P: n/a
I have a database with more than 800 adress records

Some of the are similar because some letters in the street adress for
example are not identical, altough they point to the same adress.

Does anyone have the syntax for a query that shows similar adresses
thar could be duplicates?

Thank you
Nov 12 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Thu, 20 Nov 2003 12:12:32 GMT, Antitax <cn*@carrefuts.com> wrote:

The SOUNDEX function is often used to group similar-sounding records.
I often find it not discriminating enough. I'm using the
Ratcliff-Obershelp algorithm. It has its own problems, but it works
better in my situation.

Information about both can be found by your favorite search engine.

-Tom.

I have a database with more than 800 adress records

Some of the are similar because some letters in the street adress for
example are not identical, altough they point to the same adress.

Does anyone have the syntax for a query that shows similar adresses
thar could be duplicates?

Thank you


Nov 12 '05 #2

P: n/a
to*****@no.spam.cox.net (Tom van Stiphout) wrote in
<if********************************@4ax.com>:
On Thu, 20 Nov 2003 12:12:32 GMT, Antitax <cn*@carrefuts.com>
wrote:

The SOUNDEX function is often used to group similar-sounding
records. I often find it not discriminating enough. I'm using the
Ratcliff-Obershelp algorithm. It has its own problems, but it
works better in my situation.

Information about both can be found by your favorite search
engine.


Soundex (and related functions) are designed for finding
similar-sounding words. An address is a collection of numbers and
words, rather than a single word, and therefore not that
susceptible to comparison with soundalike functions.

When I really need to do this kind of thing (I once had to de-dupe
250K records against 100K existing records with a likely overlap of
something on the order of 1/3), I process the address field into a
numeric part and a text part and then work with those individually.
It's also helpful to process the commonly occuring words into a
common format (Street, Str, St, etc.).

I've also used a function of my own that works on the same
principle as Simil (Levenshtein) functions, which return a number
that shows how close two strings are to each other (it's useful to
think of it as percentage of overlap, but that's not precisely how
Simil functions calculate it).

The best way to insure the ability to de-dupe on addresses (or to
group on addresses), is to have your data entry process do the
regularization. There are a couple of methods for this:

1. in the AfterUpdate, scan the input address and regularize
abbreviations, etc.

2. use multiple fields to collect the information in a structured
fashion, and then write to multiple fields or write the result to a
single field.

Some combination of the two can be used also.

The problem is, of course, that no existing data is that clean, so
unless you have CASS-certified data (i.e., formatted according to
US postal regulations by being pre-processed for regularity), you
are really working towards a future in which all the data has been
entered via your UI.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc
Nov 12 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.