473,473 Members | 1,589 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Dedupe Data in a Table

2 New Member
Hi,

Im trying to dedupe about 30,000 records in a access database, i have done the simple dedupe wizard which has worked, but as you know it only does exact matching dedupes, does anyone , know of any access add ins i can use to help like fuzzy logic, The records are just simple company name, address, account number, name etc fields,

What would be the best way to dedupe this data, and i need to do this on a regular basis, This task has jsut been given to me as a add on to my main role, and i dnt know where to start, i have basic knowledge of access,

Is there a algorithem that i can use, i dnt know where to start when it comes to sql code.

If someone could give me a step by step guide on what to do that would be great.

I look forward to your response.
Jun 11 '07 #1
8 11182
NeoPa
32,556 Recognized Expert Moderator MVP
Please remember to provide a meaningful Title for any threads started (Please Use Appropriate Titles for New Threads!). This helps to ensure that other members, and also the general public, will have a better chance of finding answers to any similar questions.

MODERATOR.
Jun 11 '07 #2
NeoPa
32,556 Recognized Expert Moderator MVP
Are you asking about matching companies that are the same but some muppet has entered them differently into the database?
This is one of the reasons why data cleanliness is so critical in a database. Someone needs to explain how important that is to the management. You are NOT gonna have any sort of easy task sorting this out I'm afraid.
Jun 11 '07 #3
NeoPa
32,556 Recognized Expert Moderator MVP
What I did, was write a function that stripped out all the common similarities in names, like & Sons etc, and used that to search for duplicates. It's a very long way from perfect though.
Jun 11 '07 #4
Cars
2 New Member
What I did, was write a function that stripped out all the common similarities in names, like & Sons etc, and used that to search for duplicates. It's a very long way from perfect though.
Thanks,

IS there a better way of doing it, can access pick up the following duplicate,

ABC Ltd NN1 2RE 1007
ACB Ltd NN1 2RD 1070

Is there any ADD IN software packages that will work,

Thanks
Jun 12 '07 #5
NeoPa
32,556 Recognized Expert Moderator MVP
If it could do it then you would not want it to (IMHO). This is a very dangerous thing to do automatically. This sounds like the sort of request you would get from a boss whose understanding of database issues is very limited. Without intelligent interaction, this process could very well remove perfectly valid similar entries, leaving you with bigger problems than you started with.
To answer your question :
I'm not aware of any software that allows you to do what you ask.
Sorry I couldn't be more help :(
Jun 12 '07 #6
i3ose
1 New Member
cCad="HOUSE MÏLLIÉ S.A."

- Remove accents, double spaces, dots, ... (use regular expressions)

cCad="HOUSE MILLE SA"

- Remove "stop words" like SA,GMBH...

cCad="HOUSE MILLE"

For each Word in cCad create Trigrams:
HOUSE={%HO,HOU,OUS,USE,USE#}
MILLE={%MI,MIL,ILL,LLE,LE#}

Save this trigrams in a hastable structure:

Hashtable1:
%MI is in record 1,7,9,...

Hashtable1.add("%MI",{1,7,9})

Hashtable2
Record 1 has this trigrams

Hashtable2.add(1,{%HO,HOU,OUS...})


When you have this structure ...
You know that record number 1,7,9 have the same trigram but is not necessary to compare record 1 with 7 and 9 and....

Records to compare?

for each trigram in record 1 complete a new hashtable with record and number of same trigrams

start with first record, record 1

Hashtable3
Record 7 has the same 7 trigrams, record 9 has 1 same trigram

7,7
9,1

for each element in hashtable 3 you only need compare records with thresold>0,667 for example

record 1 has 10 trigrams ={%HO,HOU,OUS,USE,USE#,%MI,MIL,ILL,LLE,LE#}

record 7 has 7 same trigrams with record 1, then: thresold=7/10=0.7>0.667
you must to compare record 1 and 7

reocord 9 have 1 same trigram with record 1, then: thresold=1/10=0.1<0.667
you don't must to compare this record with 1.

To compare two record you can use this formula:

Similarity=2C/(n+m)
Where,
C=number of common trigrams in two records
n=length cCad record 1
m=length cCad record 2

if Similarity, for example is >0.8 is possible that two records are the same...

Use the same hashtable structure with street,city...
Jul 15 '08 #7
FishVal
2,653 Recognized Expert Specialist
Hello, all.

Q-gramm method described by i3ose is rather simple and effective one.
Though not the only one of course.

For starters you may read this article and consider what algorithm or maybe combination of such will be suitable for your particular task.

Regards,
Fish
Jul 15 '08 #8
Big X
42 New Member
There is data cleansing software called winpure if you want to wack that into google you can download a demo and see how it goes. It doesn't take to long to go through its tutorials and work out how to bring up duplicates. Its is also very good for removing bad data like numbers on the end of a company name etc. Let me know how you go with it.
Jul 16 '08 #9

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Kevin Myers | last post by:
Hello, I'm an experienced application developer in some languages (including various SQL dialects), but have very little experience with MS Access or VBA, and am having trouble figuring out how...
8
by: Woody Splawn | last post by:
Lets say I have a winform that is populated with a dataset. The dataset and data table may have several rows in it. Lets say I am looking at the winform and I want to assign a value to a certain...
2
by: Steve Stover | last post by:
I want to use the caching API in .net to store data. The data would be stored in a data table class that has approx. 10 columns and 71 rows. What I need to know is this: According to Microsoft's...
2
by: Brian Mitchell | last post by:
Ok, I know this is an elementary question but I have a data grid that is bound to a data table and I can't seem to find a way to match the selected row in the grid with it's respective row in the...
1
by: Vasilis X | last post by:
Here is the question : I have a data table, UnShorted, which has a data column EventTime (type : date time) and a data column Values (type : single). I want to create a table, Shorted, that...
1
by: John | last post by:
Hi When using Table Adapter Configuration Wizard if 'Use SQL Statements' is selected as Command Type, the data table's name in dataset is retained and only its data adapter's select statements...
1
by: DH | last post by:
I have an untyped dataset with a table. I am trying to programmatically add a row to this table. This was working in VS 2003 / .net 1.1 I am receiving an error "Object reference not set to an...
1
by: laredotornado | last post by:
Hi, I have a data table on my page (buried amidst other images and extraneous text). I would like to spawn a new window that automatically prints the content of my data table, and only that...
4
by: indona | last post by:
hi, i have to enter data from a delimited file into sqlserver database table. i have been able to delimit the file and read the data into a data table, now i want enter the data table contents to...
1
by: BaseballGraphs | last post by:
Hello, I am trying to divide one value from my data table with an associated ID by another value from my data table with a different ID for the same day that the value was added to the data table....
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.