468,161 Members | 1,993 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,161 developers. It's quick & easy.

Delete duplicate records from huge tables

Hi All,

I've the following table with a PK defined on an IDENTITY column
(INSERT_SEQ):

CREATE TABLE MYDATA (
MID NUMERIC(19,0) NOT NULL,
MYVALUE FLOAT NOT NULL,
TIMEKEY INTEGER NOT NULL,
TIMEKEY_DTTM DATETIME NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) IDENTITY(1,1) NOT NULL
)
GO

ALTER TABLE MYDATA
ADD CONSTRAINT PK_MYDATA
PRIMARY KEY (INSERT_SEQ)
GO

The TIMEKEY_DTTM field is generated, from the value actually inserted
into the
TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO

There is also a composite, non unique, index defined on the
tuple:(MID,IID,TIMEKEY,EID)

CREATE INDEX IX_METDATA ON MYDATA (MID,IID,TIMEKEY,EID)
GO

As a consequence of an application design change, I would also change
this index to be UNIQUE, but when I try to drop and create it I get an
error, because the tables stores some duplicated rows...

In order to succesfully upgrade the index definition, I wrote some DML
staements
to lookup and remove the duplicated rows, keeping only the first
record inserted, i.e. the one with the lowest INSERT_SEQ:

--
-- This table stores then umber of duplicated records eventually
discovered
-- into the MYDATA table; the initial value for the NUM_DUPLICATES
field is
-- 0 (no duplicated record)
--
DROP TABLE DUPLICATES
GO
CREATE TABLE DUPLICATES (
TABLENAME VARCHAR(17),
NUM_DUPLICATES NUMERIC(19,0) )
GO
INSERT INTO DUPLICATES VALUES ('MYDATA',0)
GO
INSERT INTO DUPLICATES VALUES ('CATEGORIESDATA',0)
GO
--
-- ///////// CLEAN UP OF MYDATA TABLE
--
DROP TABLE TMP_MYDATA
GO
CREATE TABLE TMP_MYDATA (
MID NUMERIC(19,0) NOT NULL,
TIMEKEY INTEGER NOT NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) )
GO
--
-- Insert into the TMP_MYDATA table all the duplicated records for
-- the tuple (MID,IID,TIMEKEY,EID) and NULL for the INSERT_SEQ field
--
INSERT INTO TMP_MYDATA (MID,IID,TIMEKEY,EID)
SELECT MID,IID,TIMEKEY,EID
FROM MYDATA
GROUP BY MID,IID,TIMEKEY,EID
HAVING COUNT(*)>1
GO
--
-- Updates the INSERT_SEQ field to the lowest value in the group
-- of duplicated records
--
UPDATE TMP_MYDATA
SET TMP_MYDATA.INSERT_SEQ = (
SELECT MIN(INSERT_SEQ)
FROM MYDATA
WHERE TMP_MYDATA.MID = MYDATA.MID AND
TMP_MYDATA.IID = MYDATA.IID AND
TMP_MYDATA.TIMEKEY = MYDATA.TIMEKEY AND
TMP_MYDATA.EID = MYDATA.EID )
GO
--
-- Updates the value of NUM_DUPLICATES for the MYDATA table.
--
UPDATE DUPLICATES
SET NUM_DUPLICATES = (SELECT COUNT(*) FROM TMP_MYDATA)
WHERE TABLENAME = 'MYDATA'
GO
--
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
records
-- stored into the NUM_DUPLICATES field of the DUPLICATES table for
-- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant
on huge tables (>10M records)!
Do you know a better way to achieve the task of removing all the
duplicates records, preserving the lowest INSERT_SEQ betwee the
duplicates and also preserving the sequence seed, so that a new record
inserted at time t1>t0 is enumerated with an INSERT_SEQ|t1 >
max(INSERT_SEQ)|t0 ?

Thanks a lot for your help!

Patrizio

PS. sorry for such a large post!
Jul 20 '05 #1
1 10473
[posted and mailed, per favore rispondere in news]

This reply is coming late, so may be you have resolved the issue yourself,
but here is a suggestion.

Patrizio (pt********@quipo.it) writes:
The TIMEKEY_DTTM field is generated, from the value actually inserted
into the TIMEKEY field, by the following trigger:

CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO
This looks a little funny. If you insert multiple rows with different
vales in TIMEKEY, then all rows will still get the same value in
TIMEKEY_DTTM.

I have to admit that I didn't see where TIMEKEY_DTTM came into your
problem of getting rid of the duplicates?
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
-- records stored into the NUM_DUPLICATES field of the DUPLICATES table
for -- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ )
GO

This tecnique works fine on a normal table (1M recs) but is not very
performant on huge tables (>10M records)!


I don't why this performs bad, but you should probably move your
shortcut to an IF statement:

IF (SELECT num_duplicates FROM duplicates WHERE...) > 0
BEGIN
DELETE ...

The way the query is now, the extra condition may lead the optimizer
astray.

But an alteraitve appraoch could be:

SELECT mid, timekey, iid, eid, min_seq = MIN(insert_seq)
INTO #duplkeys
FROM mydata
GROUP BY mid, timekey, iid, eid
HAVING COUNT(*) > 1

IF EXISTS (SELECT * FROM #duplicates)
BEGIN
SELECT m.insert_seq
INTO #seqs_to_delete
FROM #duplkeys d
JOIN mydata m ON d.mid = m.mid
AND d.timekey = m.timekey
AND d.iid = m.iid
AND d.eid = m.eid
WHERE m.insert_seq > d.insert_seq

DELETE mydata
FROM mydata m
JOIN #seqs_to_delete s ON m.insert_seq = s.insert_seq
END

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Alexander Anderson | last post: by
17 posts views Thread by (PeteCresswell) | last post: by
1 post views Thread by gcdp | last post: by
reply views Thread by kamranasdasdas | last post: by
reply views Thread by gcreed | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.