473,763 Members | 1,373 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Delete duplicate records from huge tables

Hi All,

I've the following table with a PK defined on an IDENTITY column
(INSERT_SEQ):

CREATE TABLE MYDATA (
MID NUMERIC(19,0) NOT NULL,
MYVALUE FLOAT NOT NULL,
TIMEKEY INTEGER NOT NULL,
TIMEKEY_DTTM DATETIME NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) IDENTITY(1,1) NOT NULL
)
GO

ALTER TABLE MYDATA
ADD CONSTRAINT PK_MYDATA
PRIMARY KEY (INSERT_SEQ)
GO

The TIMEKEY_DTTM field is generated, from the value actually inserted
into the
TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO

There is also a composite, non unique, index defined on the
tuple:(MID,IID, TIMEKEY,EID)

CREATE INDEX IX_METDATA ON MYDATA (MID,IID,TIMEKE Y,EID)
GO

As a consequence of an application design change, I would also change
this index to be UNIQUE, but when I try to drop and create it I get an
error, because the tables stores some duplicated rows...

In order to succesfully upgrade the index definition, I wrote some DML
staements
to lookup and remove the duplicated rows, keeping only the first
record inserted, i.e. the one with the lowest INSERT_SEQ:

--
-- This table stores then umber of duplicated records eventually
discovered
-- into the MYDATA table; the initial value for the NUM_DUPLICATES
field is
-- 0 (no duplicated record)
--
DROP TABLE DUPLICATES
GO
CREATE TABLE DUPLICATES (
TABLENAME VARCHAR(17),
NUM_DUPLICATES NUMERIC(19,0) )
GO
INSERT INTO DUPLICATES VALUES ('MYDATA',0)
GO
INSERT INTO DUPLICATES VALUES ('CATEGORIESDAT A',0)
GO
--
-- ///////// CLEAN UP OF MYDATA TABLE
--
DROP TABLE TMP_MYDATA
GO
CREATE TABLE TMP_MYDATA (
MID NUMERIC(19,0) NOT NULL,
TIMEKEY INTEGER NOT NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) )
GO
--
-- Insert into the TMP_MYDATA table all the duplicated records for
-- the tuple (MID,IID,TIMEKE Y,EID) and NULL for the INSERT_SEQ field
--
INSERT INTO TMP_MYDATA (MID,IID,TIMEKE Y,EID)
SELECT MID,IID,TIMEKEY ,EID
FROM MYDATA
GROUP BY MID,IID,TIMEKEY ,EID
HAVING COUNT(*)>1
GO
--
-- Updates the INSERT_SEQ field to the lowest value in the group
-- of duplicated records
--
UPDATE TMP_MYDATA
SET TMP_MYDATA.INSE RT_SEQ = (
SELECT MIN(INSERT_SEQ)
FROM MYDATA
WHERE TMP_MYDATA.MID = MYDATA.MID AND
TMP_MYDATA.IID = MYDATA.IID AND
TMP_MYDATA.TIME KEY = MYDATA.TIMEKEY AND
TMP_MYDATA.EID = MYDATA.EID )
GO
--
-- Updates the value of NUM_DUPLICATES for the MYDATA table.
--
UPDATE DUPLICATES
SET NUM_DUPLICATES = (SELECT COUNT(*) FROM TMP_MYDATA)
WHERE TABLENAME = 'MYDATA'
GO
--
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
records
-- stored into the NUM_DUPLICATES field of the DUPLICATES table for
-- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIME KEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_S EQ > TMP_MYDATA.INSE RT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant
on huge tables (>10M records)!
Do you know a better way to achieve the task of removing all the
duplicates records, preserving the lowest INSERT_SEQ betwee the
duplicates and also preserving the sequence seed, so that a new record
inserted at time t1>t0 is enumerated with an INSERT_SEQ|t1 >
max(INSERT_SEQ) |t0 ?

Thanks a lot for your help!

Patrizio

PS. sorry for such a large post!
Jul 20 '05 #1
1 10807
[posted and mailed, per favore rispondere in news]

This reply is coming late, so may be you have resolved the issue yourself,
but here is a suggestion.

Patrizio (pt********@qui po.it) writes:
The TIMEKEY_DTTM field is generated, from the value actually inserted
into the TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO
This looks a little funny. If you insert multiple rows with different
vales in TIMEKEY, then all rows will still get the same value in
TIMEKEY_DTTM.

I have to admit that I didn't see where TIMEKEY_DTTM came into your
problem of getting rid of the duplicates?
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
-- records stored into the NUM_DUPLICATES field of the DUPLICATES table
for -- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIME KEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_S EQ > TMP_MYDATA.INSE RT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant on huge tables (>10M records)!


I don't why this performs bad, but you should probably move your
shortcut to an IF statement:

IF (SELECT num_duplicates FROM duplicates WHERE...) > 0
BEGIN
DELETE ...

The way the query is now, the extra condition may lead the optimizer
astray.

But an alteraitve appraoch could be:

SELECT mid, timekey, iid, eid, min_seq = MIN(insert_seq)
INTO #duplkeys
FROM mydata
GROUP BY mid, timekey, iid, eid
HAVING COUNT(*) > 1

IF EXISTS (SELECT * FROM #duplicates)
BEGIN
SELECT m.insert_seq
INTO #seqs_to_delete
FROM #duplkeys d
JOIN mydata m ON d.mid = m.mid
AND d.timekey = m.timekey
AND d.iid = m.iid
AND d.eid = m.eid
WHERE m.insert_seq > d.insert_seq

DELETE mydata
FROM mydata m
JOIN #seqs_to_delete s ON m.insert_seq = s.insert_seq
END

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2418
by: Alexander Anderson | last post by:
I have a DELETE statement that deletes duplicate data from a table. It takes a long time to execute, so I thought I'd seek advice here. The structure of the table is little funny. The following is NOT the table, but the representation of the data in the table: +-----------+ | a | b | +-----+-----+ | 123 | 234 | | 345 | 456 |
2
4992
by: ms | last post by:
Access 2000: I am trying to delete duplicate records imported to a staging table leaving one of the duplicates to be imported into the live table. A unique record is based on a composite key of 3 fields (vehicleID, BattID, and ChgHrs). VehicleID and BattID are a TEXT datatype and ChrHrs are a number(long int.) datatype. Since records to be imported can have duplicate records of the composite key I need to clean all but one of the...
4
6163
by: KT | last post by:
Is there any one click solution that would do the trick? I would like to create a button, so the person who maintains the database can perform clean up work to delete duplicate records which contain same information in the ID field and the account number field once a week. Thanks in advance! KT
17
2073
by: (PeteCresswell) | last post by:
I've got apps where you *really* wouldn't want to delete certain items by accident, but the users just have to have a "Delete" button. My current strategies: Plan A: ------------------------------------------------------------------------ 1) Make the cmd button black and do not give it an accelerator key. 2) Issue two levels of confirmation (Do you want to delete... Do you REALLY want
2
5425
by: Radu | last post by:
Hi. I have a "union" table which results of a union of two tables. Occasionally I could have duplicates, when the same PIN has been added to both tables, albeit at different Datees/Times, such as: PIN Name Added Date 100411 A 7/11/2007 10:12:58 AM 100411 A 7/17/2007 10:54:23 AM 100413 B 7/11/2007 10:13:28 AM
3
10947
by: rajeshkrsingh | last post by:
Hi friends, Step1- create table duplicate ( intId int, varName varchar(50) ) insert into duplicate(intId,varName) values(1,'rajesh') insert into duplicate(intId,varName) values(2,'raj12') insert into duplicate(intId,varName) values(1,'rajesh')
2
3816
by: farouqdin | last post by:
Hi all i have code which loops through table and deletes the duplicate records. This code does it for one table. How do i change it so it goes through several tables? On Error Resume Next Dim db As DAO.Database, rst As DAO.Recordset Dim strDupName As String, strSaveName As String Set db = CurrentDb()
6
5937
by: Dilip1983 | last post by:
Hi All, I want to delete duplicate records from a large table. There is one index(INDEX_U1) on 4 columns(col1,col2,col3,col4) which is in unusable state. First of all when i tried to rebuild index it showed error as unique key violation. So i want to delete duplicate records for col1,col2,col3,col4 combination. How can i delete the duplicate records from this large table?
1
4096
watertraveller
by: watertraveller | last post by:
Hi all. My ultimate goal is to return two columns, where no single value appears anywhere twice. This means that not only do I want to check that nothing from column A appears in column B and vice-versa, but I also don't want the same value appearing twice in A and twice in B. So far I have: --Diff the columns INSERT INTO @Table SELECT One, Two FROM @Column1 a FULL OUTER JOIN @Column2 b
0
9563
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10145
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9998
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9938
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9822
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8822
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7366
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
1
3917
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2793
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.