473,385 Members | 2,014 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

optimizing a query to delete duplicates

I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is NOT the table,
but the representation of the data in the table:

+-----------+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 123 |
+-----+-----+

As you can see, the data is tabular. This is how it is stored in the table:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 234 |
+-----+-----------+------------+

What I need is to delete all records having the same "Row" when there exists
the same set of records with a different (smaller, to be precise) "Row".
Using the example above, what I need to get is:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
+-----+-----------+------------+

A slow way of doing this seem to be:

DELETE FROM X
WHERE Row IN
(SELECT DISTINCT Row FROM X x1
WHERE EXISTS
(SELECT * FROM X x2
WHERE x2.Row < x1.Row
AND NOT EXISTS
(SELECT * FROM X x3
WHERE x3.Row = x2.Row
AND x3.FieldName = x2.FieldName
AND x3.FieldValue <> x1.FieldValue)))

Can this be done faster, better, and cheaper?
Jul 20 '05 #1
3 2391
my knee-jerk reaction is:

Why is it important to optimize it? I think you should delete the
duplicates, then create a constraint that prevents them from recurring.

If, for some reason, you are unable to fix the application that creates
these duplicates, and creating a constraint causes errors in the application
that you can't tolerate, then I suppose an alternative would be to create a
trigger that deletes them upon entry. Having a composite index on the
columns that are being duplicated would enable such a trigger to run
quickly.

But looking at your query, I find it strangely complex.

Why not just:

DELETE FROM X
WHERE EXISTS (SELECT * FROM X x2
WHERE x2.Row < x.Row
AND X.FieldName = x2.FieldName
AND X.FieldValue = x2.FieldValue)

Am I missing something? Your NOT EXISTS has me a bit confused... I think it
might delete data in situations other than described.

Also, NOT EXISTS is generally slow.
Jul 20 '05 #2
On 2004-07-15, Aaron W. West <ta******@hotmail.NO.SPAM> wrote:
Why is it important to optimize it? I think you should delete the
duplicates, then create a constraint that prevents them from recurring.
Such constraint may not be created. This table is a temporary table, where
data from an input file is loaded. Duplicate sets of records must be
deleted because the data then goes into permanent tables. Those table have
constraints against duplicates.
But looking at your query, I find it strangely complex.
Me too. I'm trying to improve it. Its complexity seems to hinder its
performance.
Why not just:

DELETE FROM X
WHERE EXISTS (SELECT * FROM X x2
WHERE x2.Row < x.Row
AND X.FieldName = x2.FieldName
AND X.FieldValue = x2.FieldValue)


This would delete records that should not be deleted. Here's an example:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 666 |
+-----+-----------+------------+

Here the combination of values for "a" and "b" on every "Row" is
different. There are no duplicates here. The query that you proposed would
delete the second to last row

+-----+-----------+------------+
| 3 | a | 123 |
+-----+-----------+------------+

because it has the same FieldName and FieldValue as the first row.

Think of it the data this way:

+-----+-----+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 666 |
+-----+-----+

No duplicate rows here.
Jul 20 '05 #3
Hi

You could try only selecting the correct data when you move it into the
permanent tables. But the following may work better:

DELETE FROM X1
FROM X X1 JOIN X X2
ON x2.Row < x1.Row
AND x1.Fieldvalue = x2.Fieldvalue
AND x1.FieldName = x2.FieldName

John
"Alexander Anderson" <no@spam.com> wrote in message
news:sl*****************@Toronto-HSE-ppp3682122.sympatico.ca...
I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is NOT the table,
but the representation of the data in the table:

+-----------+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 123 |
+-----+-----+

As you can see, the data is tabular. This is how it is stored in the table:
+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 234 |
+-----+-----------+------------+

What I need is to delete all records having the same "Row" when there exists the same set of records with a different (smaller, to be precise) "Row".
Using the example above, what I need to get is:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
+-----+-----------+------------+

A slow way of doing this seem to be:

DELETE FROM X
WHERE Row IN
(SELECT DISTINCT Row FROM X x1
WHERE EXISTS
(SELECT * FROM X x2
WHERE x2.Row < x1.Row
AND NOT EXISTS
(SELECT * FROM X x3
WHERE x3.Row = x2.Row
AND x3.FieldName = x2.FieldName
AND x3.FieldValue <> x1.FieldValue)))

Can this be done faster, better, and cheaper?

Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Patrizio | last post by:
Hi All, I've the following table with a PK defined on an IDENTITY column (INSERT_SEQ): CREATE TABLE MYDATA ( MID NUMERIC(19,0) NOT NULL, MYVALUE FLOAT NOT NULL, TIMEKEY ...
2
by: ms | last post by:
Access 2000: I am trying to delete duplicate records imported to a staging table leaving one of the duplicates to be imported into the live table. A unique record is based on a composite key of 3...
16
by: tyrfboard | last post by:
I've been searching for awhile now on how to remove duplicates from a table within an Access db and have found plenty of articles on finding or deleting duplicates. All I want to do is remove them...
6
by: das | last post by:
Hello all, I have a table with thousands of rows and is in this format: id col1 col2 col3 col4 --- ------ ----- ------ ------ 1 nm 78 xyz pir 2 ...
1
by: JC | last post by:
Hello, I am trying to change a select, find duplicates, query into a delete query. I want to get rid of the records in the main table "tblHurnsHistory." I changed the Find Duplicates query to...
16
by: ARC | last post by:
Hello all, So I'm knee deep in this import utility program, and am coming up with all sorts of "gotcha's!". 1st off. On a "Find Duplicates Query", does anyone have a good solution for...
1
by: bcquadrant | last post by:
Hi there I am trying to make an update query that will update information in all occurances of duplicate records based on the First record for that duplicate set. There are about 30,000 records...
2
by: fperri | last post by:
Hello, I have a query that creates a table of duplicates in my database. For one set of duplicates (three records), all the fields & values are the same except for the values in one field. This...
0
by: csin | last post by:
I have an Access DB backend for the application I am running, I want to use the built in ability in Access to remove duplicate entries... Say I have table1 with fields field1 field2 and field3,...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.