Hi All,
I've the following table with a PK defined on an IDENTITY column
(INSERT_SEQ):
CREATE TABLE MYDATA (
MID NUMERIC(19,0) NOT NULL,
MYVALUE FLOAT NOT NULL,
TIMEKEY INTEGER NOT NULL,
TIMEKEY_DTTM DATETIME NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) IDENTITY(1,1) NOT NULL
)
GO
ALTER TABLE MYDATA
ADD CONSTRAINT PK_MYDATA
PRIMARY KEY (INSERT_SEQ)
GO
The TIMEKEY_DTTM field is generated, from the value actually inserted
into the
TIMEKEY field, by the following trigger:
CREATE TRIGGER TIMEKEY1
ON MYDATA
FOR INSERT AS
BEGIN
DECLARE @M_TIMEKEY_DTTM DATETIME
SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY +
EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000')
FROM INSERTED INS, LOCATIONINFO EP
WHERE INS.EID = EP.EID
UPDATE MYDATA
SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM
FROM INSERTED INS, MYDATA MD
WHERE MD.INSERT_SEQ = INS.INSERT_SEQ
END
GO
There is also a composite, non unique, index defined on the
tuple:(MID,IID,TIMEKEY,EID)
CREATE INDEX IX_METDATA ON MYDATA (MID,IID,TIMEKEY,EID)
GO
As a consequence of an application design change, I would also change
this index to be UNIQUE, but when I try to drop and create it I get an
error, because the tables stores some duplicated rows...
In order to succesfully upgrade the index definition, I wrote some DML
staements
to lookup and remove the duplicated rows, keeping only the first
record inserted, i.e. the one with the lowest INSERT_SEQ:
--
-- This table stores then umber of duplicated records eventually
discovered
-- into the MYDATA table; the initial value for the NUM_DUPLICATES
field is
-- 0 (no duplicated record)
--
DROP TABLE DUPLICATES
GO
CREATE TABLE DUPLICATES (
TABLENAME VARCHAR(17),
NUM_DUPLICATES NUMERIC(19,0) )
GO
INSERT INTO DUPLICATES VALUES ('MYDATA',0)
GO
INSERT INTO DUPLICATES VALUES ('CATEGORIESDATA',0)
GO
--
-- ///////// CLEAN UP OF MYDATA TABLE
--
DROP TABLE TMP_MYDATA
GO
CREATE TABLE TMP_MYDATA (
MID NUMERIC(19,0) NOT NULL,
TIMEKEY INTEGER NOT NULL,
IID NUMERIC(19,0) NOT NULL,
EID NUMERIC(19,0) NOT NULL,
INSERT_SEQ NUMERIC(19,0) )
GO
--
-- Insert into the TMP_MYDATA table all the duplicated records for
-- the tuple (MID,IID,TIMEKEY,EID) and NULL for the INSERT_SEQ field
--
INSERT INTO TMP_MYDATA (MID,IID,TIMEKEY,EID)
SELECT MID,IID,TIMEKEY,EID
FROM MYDATA
GROUP BY MID,IID,TIMEKEY,EID
HAVING COUNT(*)>1
GO
--
-- Updates the INSERT_SEQ field to the lowest value in the group
-- of duplicated records
--
UPDATE TMP_MYDATA
SET TMP_MYDATA.INSERT_SEQ = (
SELECT MIN(INSERT_SEQ)
FROM MYDATA
WHERE TMP_MYDATA.MID = MYDATA.MID AND
TMP_MYDATA.IID = MYDATA.IID AND
TMP_MYDATA.TIMEKEY = MYDATA.TIMEKEY AND
TMP_MYDATA.EID = MYDATA.EID )
GO
--
-- Updates the value of NUM_DUPLICATES for the MYDATA table.
--
UPDATE DUPLICATES
SET NUM_DUPLICATES = (SELECT COUNT(*) FROM TMP_MYDATA)
WHERE TABLENAME = 'MYDATA'
GO
--
-- Delete from the MYDATA table all the duplicated records,
-- keeping only the row with the lowest INSERT_SEQ
-- The delete is performed only if there are duplicated recors;
-- this is achieved using a "short circuit" AND on the number of
records
-- stored into the NUM_DUPLICATES field of the DUPLICATES table for
-- the MYDATA table...
--
DELETE FROM MYDATA
WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME =
'MYDATA') > 0 AND
EXISTS ( SELECT 1
FROM TMP_MYDATA
WHERE MYDATA.MID = TMP_MYDATA.MID AND
MYDATA.IID = TMP_MYDATA.IID AND
MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND
MYDATA.EID = TMP_MYDATA.EID AND
MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ )
GO
This tecnique works fine on a normal table (1M recs) but is not very
performant
on huge tables (>10M records)!
Do you know a better way to achieve the task of removing all the
duplicates records, preserving the lowest INSERT_SEQ betwee the
duplicates and also preserving the sequence seed, so that a new record
inserted at time t1>t0 is enumerated with an INSERT_SEQ|t1 >
max(INSERT_SEQ)|t0 ?
Thanks a lot for your help!
Patrizio
PS. sorry for such a large post! 1 10781
[posted and mailed, per favore rispondere in news]
This reply is coming late, so may be you have resolved the issue yourself,
but here is a suggestion.
Patrizio (pt********@quipo.it) writes: The TIMEKEY_DTTM field is generated, from the value actually inserted into the TIMEKEY field, by the following trigger:
CREATE TRIGGER TIMEKEY1 ON MYDATA FOR INSERT AS BEGIN DECLARE @M_TIMEKEY_DTTM DATETIME SELECT @M_TIMEKEY_DTTM = DATEADD(SECOND, INS.TIMEKEY + EP.GMT_OFFSET * 0 ,'1970-01-01 00:00:00.000') FROM INSERTED INS, LOCATIONINFO EP WHERE INS.EID = EP.EID UPDATE MYDATA SET TIMEKEY_DTTM = @M_TIMEKEY_DTTM FROM INSERTED INS, MYDATA MD WHERE MD.INSERT_SEQ = INS.INSERT_SEQ END GO
This looks a little funny. If you insert multiple rows with different
vales in TIMEKEY, then all rows will still get the same value in
TIMEKEY_DTTM.
I have to admit that I didn't see where TIMEKEY_DTTM came into your
problem of getting rid of the duplicates?
-- Delete from the MYDATA table all the duplicated records, -- keeping only the row with the lowest INSERT_SEQ -- The delete is performed only if there are duplicated recors; -- this is achieved using a "short circuit" AND on the number of -- records stored into the NUM_DUPLICATES field of the DUPLICATES table for -- the MYDATA table... -- DELETE FROM MYDATA WHERE ( SELECT NUM_DUPLICATES FROM DUPLICATES WHERE TABLENAME = 'MYDATA') > 0 AND EXISTS ( SELECT 1 FROM TMP_MYDATA WHERE MYDATA.MID = TMP_MYDATA.MID AND MYDATA.IID = TMP_MYDATA.IID AND MYDATA.TIMEKEY = TMP_MYDATA.TIMEKEY AND MYDATA.EID = TMP_MYDATA.EID AND MYDATA.INSERT_SEQ > TMP_MYDATA.INSERT_SEQ ) GO
This tecnique works fine on a normal table (1M recs) but is not very performant on huge tables (>10M records)!
I don't why this performs bad, but you should probably move your
shortcut to an IF statement:
IF (SELECT num_duplicates FROM duplicates WHERE...) > 0
BEGIN
DELETE ...
The way the query is now, the extra condition may lead the optimizer
astray.
But an alteraitve appraoch could be:
SELECT mid, timekey, iid, eid, min_seq = MIN(insert_seq)
INTO #duplkeys
FROM mydata
GROUP BY mid, timekey, iid, eid
HAVING COUNT(*) > 1
IF EXISTS (SELECT * FROM #duplicates)
BEGIN
SELECT m.insert_seq
INTO #seqs_to_delete
FROM #duplkeys d
JOIN mydata m ON d.mid = m.mid
AND d.timekey = m.timekey
AND d.iid = m.iid
AND d.eid = m.eid
WHERE m.insert_seq > d.insert_seq
DELETE mydata
FROM mydata m
JOIN #seqs_to_delete s ON m.insert_seq = s.insert_seq
END
--
Erland Sommarskog, SQL Server MVP, so****@algonet.se
Books Online for SQL Server SP3 at http://www.microsoft.com/sql/techinf...2000/books.asp This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Alexander Anderson |
last post by:
I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is...
|
by: ms |
last post by:
Access 2000:
I am trying to delete duplicate records imported to a staging table leaving one
of the duplicates to be imported into the live table. A unique record is based
on a composite key of 3...
|
by: KT |
last post by:
Is there any one click solution that would do the trick? I would like
to create a button, so the person who maintains the database can
perform clean up work to delete duplicate records which...
|
by: (PeteCresswell) |
last post by:
I've got apps where you *really* wouldn't want to delete certain items by
accident, but the users just have to have a "Delete" button.
My current strategies:
Plan A:...
|
by: Radu |
last post by:
Hi.
I have a "union" table which results of a union of two tables.
Occasionally I could have duplicates, when the same PIN has been added
to both tables, albeit at different Datees/Times, such...
|
by: rajeshkrsingh |
last post by:
Hi friends,
Step1-
create table duplicate
(
intId int, varName varchar(50)
)
insert into duplicate(intId,varName) values(1,'rajesh')
insert into duplicate(intId,varName) values(2,'raj12')...
|
by: farouqdin |
last post by:
Hi all i have code which loops through table and deletes the duplicate
records. This code does it for one table. How do i change it so it
goes through several tables?
On Error Resume Next
Dim...
|
by: Dilip1983 |
last post by:
Hi All,
I want to delete duplicate records from a large table.
There is one index(INDEX_U1) on 4 columns(col1,col2,col3,col4) which is in unusable state.
First of all when i tried to rebuild...
|
by: watertraveller |
last post by:
Hi all. My ultimate goal is to return two columns, where no single value appears anywhere twice. This means that not only do I want to check that nothing from column A appears in column B and...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |