optimizing a query to delete duplicates

Alexander Anderson

I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is NOT the table,
but the representation of the data in the table:

+-----------+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 123 |
+-----+-----+

As you can see, the data is tabular. This is how it is stored in the table:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 234 |
+-----+-----------+------------+

What I need is to delete all records having the same "Row" when there exists
the same set of records with a different (smaller, to be precise) "Row".
Using the example above, what I need to get is:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
+-----+-----------+------------+

A slow way of doing this seem to be:

DELETE FROM X
WHERE Row IN
(SELECT DISTINCT Row FROM X x1
WHERE EXISTS
(SELECT * FROM X x2
WHERE x2.Row < x1.Row
AND NOT EXISTS
(SELECT * FROM X x3
WHERE x3.Row = x2.Row
AND x3.FieldName = x2.FieldName
AND x3.FieldValue <> x1.FieldValue)))

Can this be done faster, better, and cheaper?

Jul 20 '05 #1

Subscribe Post Reply

2391

Aaron W. West

my knee-jerk reaction is:

Why is it important to optimize it? I think you should delete the
duplicates, then create a constraint that prevents them from recurring.

If, for some reason, you are unable to fix the application that creates
these duplicates, and creating a constraint causes errors in the application
that you can't tolerate, then I suppose an alternative would be to create a
trigger that deletes them upon entry. Having a composite index on the
columns that are being duplicated would enable such a trigger to run
quickly.

But looking at your query, I find it strangely complex.

Why not just:

DELETE FROM X
WHERE EXISTS (SELECT * FROM X x2
WHERE x2.Row < x.Row
AND X.FieldName = x2.FieldName
AND X.FieldValue = x2.FieldValue)

Am I missing something? Your NOT EXISTS has me a bit confused... I think it
might delete data in situations other than described.

Also, NOT EXISTS is generally slow.

Jul 20 '05 #2

Alexander Anderson

On 2004-07-15, Aaron W. West <ta******@hotmail.NO.SPAM> wrote:

Why is it important to optimize it? I think you should delete the
duplicates, then create a constraint that prevents them from recurring.
Such constraint may not be created. This table is a temporary table, where
data from an input file is loaded. Duplicate sets of records must be
deleted because the data then goes into permanent tables. Those table have
constraints against duplicates.
But looking at your query, I find it strangely complex.
Me too. I'm trying to improve it. Its complexity seems to hinder its
performance.
Why not just:

DELETE FROM X
WHERE EXISTS (SELECT * FROM X x2
WHERE x2.Row < x.Row
AND X.FieldName = x2.FieldName
AND X.FieldValue = x2.FieldValue)

This would delete records that should not be deleted. Here's an example:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 666 |
+-----+-----------+------------+

Here the combination of values for "a" and "b" on every "Row" is
different. There are no duplicates here. The query that you proposed would
delete the second to last row

+-----+-----------+------------+
| 3 | a | 123 |
+-----+-----------+------------+

because it has the same FieldName and FieldValue as the first row.

Think of it the data this way:

+-----+-----+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 666 |
+-----+-----+

No duplicate rows here.

Jul 20 '05 #3

John Bell

Hi

You could try only selecting the correct data when you move it into the
permanent tables. But the following may work better:

DELETE FROM X1
FROM X X1 JOIN X X2
ON x2.Row < x1.Row
AND x1.Fieldvalue = x2.Fieldvalue
AND x1.FieldName = x2.FieldName

John
"Alexander Anderson" <no@spam.com> wrote in message
news:sl*****************@Toronto-HSE-ppp3682122.sympatico.ca...

I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is NOT the table,
but the representation of the data in the table:

+-----------+
| a | b |
+-----+-----+
| 123 | 234 |
| 345 | 456 |
| 123 | 123 |
+-----+-----+

As you can see, the data is tabular. This is how it is stored in the table:
+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 234 |
+-----+-----------+------------+

What I need is to delete all records having the same "Row" when there exists the same set of records with a different (smaller, to be precise) "Row".
Using the example above, what I need to get is:

+-----+-----------+------------+
| Row | FieldName | FieldValue |
+-----+-----------+------------+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
+-----+-----------+------------+

A slow way of doing this seem to be:

DELETE FROM X
WHERE Row IN
(SELECT DISTINCT Row FROM X x1
WHERE EXISTS
(SELECT * FROM X x2
WHERE x2.Row < x1.Row
AND NOT EXISTS
(SELECT * FROM X x3
WHERE x3.Row = x2.Row
AND x3.FieldName = x2.FieldName
AND x3.FieldValue <> x1.FieldValue)))

Can this be done faster, better, and cheaper?

Jul 20 '05 #4

by: Patrizio | last post by:

Hi All, I've the following table with a PK defined on an IDENTITY column (INSERT_SEQ): CREATE TABLE MYDATA ( MID NUMERIC(19,0) NOT NULL, MYVALUE FLOAT NOT NULL, TIMEKEY ...

Microsoft SQL Server

Query to delete duplicate records in staging table

by: ms | last post by:

Access 2000: I am trying to delete duplicate records imported to a staging table leaving one of the duplicates to be imported into the live table. A unique record is based on a composite key of 3...

Microsoft Access / VBA

Removing duplicates from query, but not from table

by: tyrfboard | last post by:

I've been searching for awhile now on how to remove duplicates from a table within an Access db and have found plenty of articles on finding or deleting duplicates. All I want to do is remove them...

Microsoft Access / VBA

Optimizing SQL - Union

by: das | last post by:

Hello all, I have a table with thousands of rows and is in this format: id col1 col2 col3 col4 --- ------ ----- ------ ------ 1 nm 78 xyz pir 2 ...

Microsoft SQL Server

Changing a duplicate select query into a delete query, help!!

by: JC | last post by:

Hello, I am trying to change a select, find duplicates, query into a delete query. I want to get rid of the records in the main table "tblHurnsHistory." I changed the Find Duplicates query to...

Microsoft Access / VBA

duplicates query help & strategy for update queries with SetWarnings = False

by: ARC | last post by:

Hello all, So I'm knee deep in this import utility program, and am coming up with all sorts of "gotcha's!". 1st off. On a "Find Duplicates Query", does anyone have a good solution for...

Microsoft Access / VBA

Duplicates Updates Query

by: bcquadrant | last post by:

Hi there I am trying to make an update query that will update information in all occurances of duplicate records based on the First record for that duplicate set. There are about 30,000 records...

Microsoft Access / VBA

SQL Query to Select Largest Value

by: fperri | last post by:

Hello, I have a query that creates a table of duplicates in my database. For one set of duplicates (three records), all the fields & values are the same except for the values in one field. This...

Microsoft Access / VBA

vb.net Appen Query to Access DB

by: csin | last post by:

I have an Access DB backend for the application I am running, I want to use the built in ability in Access to remove duplicate entries... Say I have table1 with fields field1 field2 and field3,...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

optimizing a query to delete duplicates

Similar topics