473,382 Members | 1,689 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

delete duplicates after import

New member here, impressed with what I have searched so far...

So every day data will be imported into a table named the month (ie - 10_2011)... however, this data WILL contain duplicates which I need to get rid of. I am breaking the data up into daily imports into monthly tables because each day the data are over 500-700 rows, wanted to keep tables of manageable size. I'm not married to using this logic, so if someone has a better idea, please share!

I have used snippets of code I found here but I can't seem to make it work... Please if someone could help me, that would be great. I'm not a board certified programmer but I do know enough to break stuff!

Expand|Select|Wrap|Line Numbers
  1. Private Sub cmdImport_Click()
  2.     Dim strDay As String
  3.     Dim strMonth As String
  4.     Dim strFilePath As String
  5.     Dim xlobj
  6.     Dim wsobj
  7.     Dim rngobj
  8.     Dim strDataRange As String
  9.  
  10.     strDay = Format(Now(), "mm" & "." & "dd" & "." & "yyyy")
  11.     strMonth = Format(Now(), "mm" & "_" & "yyyy")
  12.  
  13.     strFilePath = "<hidden>" & strDay & ".xls"
  14.  
  15.     Set xlobj = GetObject(strFilePath)
  16.     Set wsobj = xlobj.worksheets("Sheet1")
  17.     Set rngobj = wsobj.usedrange
  18.     rngobj.Name = srtDataRange
  19.  
  20.     DoCmd.SetWarnings False
  21.     DoCmd.TransferSpreadsheet acImport, 8, strMonth, _
  22.     strFilePath, True, strDataRange
  23.     DoCmd.SetWarnings True
  24.  
  25.     Set xlobj = Nothing
  26.  
  27.  
  28.  
  29. 'routine for deleting Dups created EVERY DAY when excel is imported
  30.     Dim db As DAO.Database
  31.     Dim rs As Recordset
  32.     Dim strAccount As String
  33.  
  34.     Set db = CurrentDb
  35.     Set rs = db.OpenRecordset(strMonth)
  36.  
  37. StartFile:
  38.     rs.MoveFirst
  39.     Do Until rs!Temp = False
  40.      If Not rs.EOF Then
  41.        rs.MoveNext
  42.      Else
  43.       GoTo EndFile
  44.       End If
  45.     Loop
  46.  
  47.     strEmail = rs!Account_No
  48.     rs!Temp = True
  49.     rs.MoveNext
  50.  
  51.     Do Until rs.EOF
  52.     If rs!Account_No = strAccount Then
  53.      rs.Delete
  54.     End If
  55.     rs.MoveNext
  56.     Loop
  57.  
  58.     GoTo StartFile
  59. EndFile:
  60.  
  61.  'remove temp field from table
  62.  
  63.  
  64.     rs.Close
  65.     Set rs = Nothing
  66.     Set db = Nothing
  67.  
  68. End Sub
  69.  
Oct 4 '11 #1
14 4939
patjones
931 Expert 512MB
In regard to your data structure, I probably would not keep so many tables around that hold the exactly the same kind of data. Depending on how much information each record holds, Access can comfortably handle tens of thousands of records in a single table, or more. This is a digression from the point of your question though.

I want to start by asking what the Temp column holds, and what you are looking to accomplish by looping through the recordset until rs!Temp is false (lines 38 through 45)?

I see at least one issue further on in the code, but if you can answer this first then we can move on...

Pat
Oct 4 '11 #2
That was a code snippet that I "borrowed" from someone else... If there is a better procedure for looping through the data I would love to see it!

I am very much a beginner, so I could totally rip out that part of the code. Well you can see that the procedure runs the import then is supposed to check/delete the dups.
Oct 4 '11 #3
patjones
931 Expert 512MB
Generally, a copy and paste of code from some other source isn't successful without some degree of modification to fit the code to your particular situation.

So I go back to my previous questions: what is the purpose of the Temp column and why are you looping through the recordset looking for rs!Temp = False?

I'm asking because it's not clear to me how that particular section of code is related to checking for duplicates in your table. Thus, you need to be able to explain how it is relevant.
Oct 4 '11 #4
NeoPa
32,556 Expert Mod 16PB
My first advice would be to avoid the idea of multiple tables for similar data. You're not working with spreadsheets here. You're working with a database, and that requires some different thinking. A field that reflects the month will enable you to store all the data and report on it easily by month or otherwise. Filtering by month gives you the same result as multiple tables without all the restrictions and mess that go along with that approach.

As far as handling duplicates is concerned I would decide first what you understand by a duplicate. That's an important first step, without which all the rest of it is fairly meaningless.

With that new understanding in mind, I would proceed on the following basis :
  1. Import the day's data into a special import table.
  2. Remove any duplicates from the new data (If this involves comparing with previous days' data from the same month then so be it).
  3. Report on or log any removals or updates if that fits your requirements.
  4. Append the newly imported and tidied data into the main table ensuring the monh field is set appropriately.
  5. Clear down the temporary data from the daily import table for the next run.
Oct 5 '11 #5
NeoPa
32,556 Expert Mod 16PB
Another point to understand and so avoid in future, and it's surprisingly common in less experienced coders. It is never necessary to concatenate literal string values together for the benefit of the code. Take your line #10 :
Expand|Select|Wrap|Line Numbers
  1. strDay = Format(Now(), "mm" & "." & "dd" & "." & "yyyy")
There is absolutely no reason why this should not be written much more simply as :
Expand|Select|Wrap|Line Numbers
  1. strDay = Format(Date(), "mm.dd.yyyy")
Of course there are many reasons why it should be done that way.

NB. Using Now() instead of Date() for date values is another mistake liable to cause confusion (and well worth avoiding).
Oct 5 '11 #6
The thing that is hard for me to wrap my head around is using 1 table for ALL the daily imports forever more... We're talking over 500 rows of data per day. In a month that would be over 10,000 rows per month meaning over 120,000 rows per year! That seems like it would be way too big for access to handle. Am I wrong in thinking that?

So the daily order of operation as you see it would be Import data into a "temp" table, compare temp data to current "master" data table, make duplicate corrections and any data cleanup, append data from Temp table to master table, clear data from temp table to be used the next day. Correct?
Oct 5 '11 #7
NeoPa
32,556 Expert Mod 16PB
jroycestone:
The thing that is hard for me to wrap my head around is using 1 table for ALL the daily imports forever more... We're talking over 500 rows of data per day. In a month that would be over 10,000 rows per month meaning over 120,000 rows per year! That seems like it would be way too big for access to handle. Am I wrong in thinking that?
Yes. Quite wrong. Certainly there are limits to the capacity of an Access database, but these limits are not avoided by managing a set amount of data in multiple tables. Quite the contrary, as each table has its own overheads. You will be able to manage more data in a single table than in multiple ones.

That said, if the data is vast and is never cleared down (increasing forever) it may well be wise to consider another database engine within which to store your data. It can still be accessed from an Access front-end database, but the size limitations would be less constraining. Something for you to consider.

jroycestone:
So the daily order of operation as you see it would be Import data into a "temp" table, compare temp data to current "master" data table, make duplicate corrections and any data cleanup, append data from Temp table to master table, clear data from temp table to be used the next day. Correct?
If you reread what I posted you will see I didn't say the highlighted bit as you have it. That may be the approach, but that depends on your definition of duplicate data. It may well be that you only need to check for duplicates within the new Temp data. Otherwise it sounds like you have the idea pretty well.
Oct 5 '11 #8
patjones
931 Expert 512MB
There is some dependence on what kind of data the table is going to hold, but it's not inconceivable that an Access table can store in excess of 100,000 records. I have one database that is currently in excess of 30,000 unique records, and there really hasn't been a performance hit so far.

If you have the ability to put your data on a server and link your Access front end to it, you can go into millions of records.

At some point you will probably want to consider archiving the data somewhere else; but for the sake of proper database design you should use one table and add a column for month.

The overall process as you have explained it is pretty much how you would want to do things. I recently completed an application that allows the user to search for a spreadsheet using a file picker dialog, import the selected file into a temporary table, removes duplicates, and commits the clean data to the permanent table...all with a couple button clicks. The import is a little slow on account of the spreadsheets containing anywhere from 10,000 records up, but it works.
Oct 5 '11 #9
So here is my "tentative" plan... have one table to store all data and do an archive quarterly to keep the data somewhat limited.

Now, the duplicate compare is needed as I described it. I'll explain a little more. The spreadsheet that gets imported daily has scheduled jobs for today and into the future; this includes any date out further than today. So every day when an import is done it will contain jobs out further than today and tomorrow so these "jobs" will continue to be in the spreadsheet every day until that job is out of our queue. Hopefully that explains how the data are more migrated on a daily basis.
Oct 5 '11 #10
NeoPa
32,556 Expert Mod 16PB
That still leaves out much necessary information to enable you to develop logic to reflect your requirements. Let's see what we can do to be getting along with anyway though.

Your post would indicate to me that you need to ensure that any job in the [Temp] table doesn't clash with any other job in the same table, as well as any job already existing in the main table. How do you recognise which of the duplicates is required? It seems you may be in a position where a job imported today has more up-to-date information with it than the same job imported on a previous day. In that case you'd need to remove the record from the main table. Somewhat different from simply tidying up the data in the [Temp] table, but can still be done.
Oct 5 '11 #11
jeeez, you must be experienced in data as I didn't think of that either!

Yes, I would need the latest "version" of the row of data to overwrite the old version of the data. There would be no possible way that there would be duplicates within the [Temp] table, only when referring data from the [Temp] table to the [Master] table.

There are multiple cells that need to be checked to make sure the data are truly duplicate. Account_NO, OrderDate are the two main fields.
Oct 5 '11 #12
NeoPa
32,556 Expert Mod 16PB
So, the [Temp] table never needs to be checked for internal consistency, but it may contain new records as well as updates for existing ones.

Jet SQL has no facility for ...

WOW!! I just learned something new and very surprising when I researched this. It turns out Jet SQL can and does handle updating and appending in the same UPDATE query. That makes the suggestion somewhat simpler :

Create an UPDATE query that specifies all fields (except any AutoNumbers - That will need to be left to the Automatic process and will be unlikely to match any previous data). A link I found on how to do this is ACC2000: How to Update and Append Records in One Update Query.
Oct 5 '11 #13
Awesome... that seems to work perfectly! Thank you so much for your help NeoPa!!!!!!
Oct 5 '11 #14
NeoPa
32,556 Expert Mod 16PB
Thank you for the question. I've always worked on the assumption that this facility was non-existent. It's a pretty handy thing to know about and I'm very pleased to have discovered it :-)
Oct 5 '11 #15

Sign in to post your reply or Sign up for a free account.

Similar topics

22
by: christof hoeke | last post by:
hello, this must have come up before, so i am already sorry for asking but a quick googling did not give me any answer. i have a list from which i want a simpler list without the duplicates an...
3
by: Alexander Anderson | last post by:
I have a DELETE statement that deletes duplicate data from a table. It takes a long time to execute, so I thought I'd seek advice here. The structure of the table is little funny. The following is...
0
by: GlobalBruce | last post by:
The GAC on my development computer has several assemblies which are duplicated. For instance, the System assembly is present as two different native images as well as the non-native version. The...
3
by: EoRaptor013 | last post by:
I'm having trouble figuring out how to delete some _almost_ duplicate records in a look-up table. Here's the table: CREATE TABLE ( (16) NOT NULL , (2) NOT NULL , (20) NULL , (50) NULL...
1
by: Smythe32 | last post by:
If anyone could help, I would appreciate it. I have a table as listed below. I need to check for duplicates by the OrderItem field and if there are duplicates, it then needs to keep the...
4
by: Mokita | last post by:
Hello, I am working with Taverna to build a workflow. Taverna has a beanshell where I can program in java. I am having some problems in writing a script, where I want to eliminate the duplicates...
3
allingame
by: allingame | last post by:
Need help with append and delete duplicates I have tables namely 1)emp, 2)time and 3)payroll TABLE emp ssn text U]PK name text
1
watertraveller
by: watertraveller | last post by:
Hi all. My ultimate goal is to return two columns, where no single value appears anywhere twice. This means that not only do I want to check that nothing from column A appears in column B and...
4
by: moon24 | last post by:
Hi im working with linked list and i have to implement a function that deletes the duplicates of a number. for example if given 2 7 1 7 12 7 then the result should be 2 7 1 12 here is what I have:...
1
by: KimmyG | last post by:
I'm just starting to use SQL and am much more experienced in Access. Here is what I do in Access Copy a table and rename the new table "copytable" also select structure only. Open "copytable"...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.