Delete duplicate records from huge tables

Patrizio

Hi All,

I've the following table with a PK defined on an IDENTITY column
(INSERT_SEQ):

CREATE TABLE MYDATA (
MID NUMERIC(19,0) NOT NULL,
MYVALUE FLOAT NOT NULL,
TIMEKEY INTEGER NOT NULL,
TIMEKEY_DTTM DATETIME NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) IDENTITY(1,1) NOT NULL
)
GO

ALTER TABLE MYDATA
ADD CONSTRAINT PK_MYDATA
PRIMARY KEY (INSERT_SEQ)
GO

The TIMEKEY_DTTM field is generated, from the value actually inserted
into the
TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO

There is also a composite, non unique, index defined on the
tuple:(MID,IID,TIMEKEY,EID)

CREATE INDEX IX_METDATA ON MYDATA (MID,IID,TIMEKEY,EID)
GO

As a consequence of an application design change, I would also change
this index to be UNIQUE, but when I try to drop and create it I get an
error, because the tables stores some duplicated rows...

In order to succesfully upgrade the index definition, I wrote some DML
staements
to lookup and remove the duplicated rows, keeping only the first
record inserted, i.e. the one with the lowest INSERT_SEQ:

--
-- This table stores then umber of duplicated records eventually
discovered
-- into the MYDATA table; the initial value for the NUM_DUPLICATES
field is
-- 0 (no duplicated record)
--
DROP TABLE DUPLICATES
GO
CREATE TABLE DUPLICATES (
TABLENAME VARCHAR(17),
NUM_DUPLICATES NUMERIC(19,0) )
GO
INSERT INTO DUPLICATES VALUES ('MYDATA',0)
GO
INSERT INTO DUPLICATES VALUES ('CATEGORIESDATA',0)
GO
--
-- ///////// CLEAN UP OF MYDATA TABLE
--
DROP TABLE TMP_MYDATA
GO
CREATE TABLE TMP_MYDATA (
MID NUMERIC(19,0) NOT NULL,
TIMEKEY INTEGER NOT NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) )
GO
--
-- Insert into the TMP_MYDATA table all the duplicated records for
-- the tuple (MID,IID,TIMEKEY,EID) and NULL for the INSERT_SEQ field
--
INSERT INTO TMP_MYDATA (MID,IID,TIMEKEY,EID)
SELECT MID,IID,TIMEKEY,EID
FROM MYDATA
GROUP BY MID,IID,TIMEKEY,EID
HAVING COUNT(*)>1
GO
--
-- Updates the INSERT_SEQ field to the lowest value in the group
-- of duplicated records
--
UPDATE TMP_MYDATA
SET TMP_MYDATA.INSERT_SEQ = (
SELECT MIN(INSERT_SEQ)
FROM MYDATA
WHERE TMP_MYDATA.MID = MYDATA.MID AND
TMP_MYDATA.IID = MYDATA.IID AND
TMP_MYDATA.TIMEKEY = MYDATA.TIMEKEY AND
TMP_MYDATA.EID = MYDATA.EID )
GO
--
-- Updates the value of NUM_DUPLICATES for the MYDATA table.
--
UPDATE DUPLICATES
SET NUM_DUPLICATES = (SELECT COUNT(*) FROM TMP_MYDATA)
WHERE TABLENAME = 'MYDATA'
GO
--
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
records
-- stored into the NUM_DUPLICATES field of the DUPLICATES table for
-- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant
on huge tables (>10M records)!
Do you know a better way to achieve the task of removing all the
duplicates records, preserving the lowest INSERT_SEQ betwee the
duplicates and also preserving the sequence seed, so that a new record
inserted at time t1>t0 is enumerated with an INSERT_SEQ|t1 >
max(INSERT_SEQ)|t0 ?

Thanks a lot for your help!

Patrizio

PS. sorry for such a large post!

Jul 20 '05 #1

Subscribe Post Reply

10781

Erland Sommarskog

[posted and mailed, per favore rispondere in news]

This reply is coming late, so may be you have resolved the issue yourself,
but here is a suggestion.

Patrizio (pt********@quipo.it) writes:

The TIMEKEY_DTTM field is generated, from the value actually inserted
into the TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO
This looks a little funny. If you insert multiple rows with different
vales in TIMEKEY, then all rows will still get the same value in
TIMEKEY_DTTM.

I have to admit that I didn't see where TIMEKEY_DTTM came into your
problem of getting rid of the duplicates?
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
-- records stored into the NUM_DUPLICATES field of the DUPLICATES table
for -- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant on huge tables (>10M records)!

I don't why this performs bad, but you should probably move your
shortcut to an IF statement:

IF (SELECT num_duplicates FROM duplicates WHERE...) > 0
BEGIN
DELETE ...

The way the query is now, the extra condition may lead the optimizer
astray.

But an alteraitve appraoch could be:

SELECT mid, timekey, iid, eid, min_seq = MIN(insert_seq)
INTO #duplkeys
FROM mydata
GROUP BY mid, timekey, iid, eid
HAVING COUNT(*) > 1

IF EXISTS (SELECT * FROM #duplicates)
BEGIN
SELECT m.insert_seq
INTO #seqs_to_delete
FROM #duplkeys d
JOIN mydata m ON d.mid = m.mid
AND d.timekey = m.timekey
AND d.iid = m.iid
AND d.eid = m.eid
WHERE m.insert_seq > d.insert_seq

DELETE mydata
FROM mydata m
JOIN #seqs_to_delete s ON m.insert_seq = s.insert_seq
END

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #2

Similar topics

optimizing a query to delete duplicates

by: Alexander Anderson | last post by:

I have a DELETE statement that deletes duplicate data from a table. It takes a long time to execute, so I thought I'd seek advice here. The structure of the table is little funny. The following is...

Microsoft SQL Server

Query to delete duplicate records in staging table

by: ms | last post by:

Access 2000: I am trying to delete duplicate records imported to a staging table leaving one of the duplicates to be imported into the live table. A unique record is based on a composite key of 3...

Microsoft Access / VBA

How to delete duplicate records and leave one?

by: KT | last post by:

Is there any one click solution that would do the trick? I would like to create a button, so the person who maintains the database can perform clean up work to delete duplicate records which...

Microsoft Access / VBA

UI Question: Protecting Critical Delete?

by: (PeteCresswell) | last post by:

I've got apps where you *really* wouldn't want to delete certain items by accident, but the users just have to have a "Delete" button. My current strategies: Plan A:...

Microsoft Access / VBA

Howto: Delete every second record if duplicates

by: Radu | last post by:

Hi. I have a "union" table which results of a union of two tables. Occasionally I could have duplicates, when the same PIN has been added to both tables, albeit at different Datees/Times, such...

Microsoft SQL Server

Delete Duplicate record from SQL Server2000

by: rajeshkrsingh | last post by:

Hi friends, Step1- create table duplicate ( intId int, varName varchar(50) ) insert into duplicate(intId,varName) values(1,'rajesh') insert into duplicate(intId,varName) values(2,'raj12')...

Microsoft SQL Server

Loop through several tables and delete

by: farouqdin | last post by:

Hi all i have code which loops through table and deletes the duplicate records. This code does it for one table. How do i change it so it goes through several tables? On Error Resume Next Dim...

Microsoft Access / VBA

Delete Duplicate Records from a large table

by: Dilip1983 | last post by:

Hi All, I want to delete duplicate records from a large table. There is one index(INDEX_U1) on 4 columns(col1,col2,col3,col4) which is in unusable state. First of all when i tried to rebuild...

Oracle Database

Delete duplicates in each column, not duplicate records

by: watertraveller | last post by:

Hi all. My ultimate goal is to return two columns, where no single value appears anywhere twice. This means that not only do I want to check that nothing from column A appears in column B and...

Microsoft SQL Server

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA