By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,476 Members | 1,497 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,476 IT Pros & Developers. It's quick & easy.

Dedupe Data in a Table

P: 2
Hi,

Im trying to dedupe about 30,000 records in a access database, i have done the simple dedupe wizard which has worked, but as you know it only does exact matching dedupes, does anyone , know of any access add ins i can use to help like fuzzy logic, The records are just simple company name, address, account number, name etc fields,

What would be the best way to dedupe this data, and i need to do this on a regular basis, This task has jsut been given to me as a add on to my main role, and i dnt know where to start, i have basic knowledge of access,

Is there a algorithem that i can use, i dnt know where to start when it comes to sql code.

If someone could give me a step by step guide on what to do that would be great.

I look forward to your response.
Jun 11 '07 #1
Share this Question
Share on Google+
8 Replies


NeoPa
Expert Mod 15k+
P: 31,186
Please remember to provide a meaningful Title for any threads started (Please Use Appropriate Titles for New Threads!). This helps to ensure that other members, and also the general public, will have a better chance of finding answers to any similar questions.

MODERATOR.
Jun 11 '07 #2

NeoPa
Expert Mod 15k+
P: 31,186
Are you asking about matching companies that are the same but some muppet has entered them differently into the database?
This is one of the reasons why data cleanliness is so critical in a database. Someone needs to explain how important that is to the management. You are NOT gonna have any sort of easy task sorting this out I'm afraid.
Jun 11 '07 #3

NeoPa
Expert Mod 15k+
P: 31,186
What I did, was write a function that stripped out all the common similarities in names, like & Sons etc, and used that to search for duplicates. It's a very long way from perfect though.
Jun 11 '07 #4

P: 2
What I did, was write a function that stripped out all the common similarities in names, like & Sons etc, and used that to search for duplicates. It's a very long way from perfect though.
Thanks,

IS there a better way of doing it, can access pick up the following duplicate,

ABC Ltd NN1 2RE 1007
ACB Ltd NN1 2RD 1070

Is there any ADD IN software packages that will work,

Thanks
Jun 12 '07 #5

NeoPa
Expert Mod 15k+
P: 31,186
If it could do it then you would not want it to (IMHO). This is a very dangerous thing to do automatically. This sounds like the sort of request you would get from a boss whose understanding of database issues is very limited. Without intelligent interaction, this process could very well remove perfectly valid similar entries, leaving you with bigger problems than you started with.
To answer your question :
I'm not aware of any software that allows you to do what you ask.
Sorry I couldn't be more help :(
Jun 12 '07 #6

P: 1
cCad="HOUSE MÏLLIÉ S.A."

- Remove accents, double spaces, dots, ... (use regular expressions)

cCad="HOUSE MILLE SA"

- Remove "stop words" like SA,GMBH...

cCad="HOUSE MILLE"

For each Word in cCad create Trigrams:
HOUSE={%HO,HOU,OUS,USE,USE#}
MILLE={%MI,MIL,ILL,LLE,LE#}

Save this trigrams in a hastable structure:

Hashtable1:
%MI is in record 1,7,9,...

Hashtable1.add("%MI",{1,7,9})

Hashtable2
Record 1 has this trigrams

Hashtable2.add(1,{%HO,HOU,OUS...})


When you have this structure ...
You know that record number 1,7,9 have the same trigram but is not necessary to compare record 1 with 7 and 9 and....

Records to compare?

for each trigram in record 1 complete a new hashtable with record and number of same trigrams

start with first record, record 1

Hashtable3
Record 7 has the same 7 trigrams, record 9 has 1 same trigram

7,7
9,1

for each element in hashtable 3 you only need compare records with thresold>0,667 for example

record 1 has 10 trigrams ={%HO,HOU,OUS,USE,USE#,%MI,MIL,ILL,LLE,LE#}

record 7 has 7 same trigrams with record 1, then: thresold=7/10=0.7>0.667
you must to compare record 1 and 7

reocord 9 have 1 same trigram with record 1, then: thresold=1/10=0.1<0.667
you don't must to compare this record with 1.

To compare two record you can use this formula:

Similarity=2C/(n+m)
Where,
C=number of common trigrams in two records
n=length cCad record 1
m=length cCad record 2

if Similarity, for example is >0.8 is possible that two records are the same...

Use the same hashtable structure with street,city...
Jul 15 '08 #7

FishVal
Expert 2.5K+
P: 2,653
Hello, all.

Q-gramm method described by i3ose is rather simple and effective one.
Though not the only one of course.

For starters you may read this article and consider what algorithm or maybe combination of such will be suitable for your particular task.

Regards,
Fish
Jul 15 '08 #8

P: 42
There is data cleansing software called winpure if you want to wack that into google you can download a demo and see how it goes. It doesn't take to long to go through its tutorials and work out how to bring up duplicates. Its is also very good for removing bad data like numbers on the end of a company name etc. Let me know how you go with it.
Jul 16 '08 #9

Post your reply

Sign in to post your reply or Sign up for a free account.