By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,978 Members | 1,373 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,978 IT Pros & Developers. It's quick & easy.

Best way for finding duplicate entries in table?

P: n/a
Hi all,

since a few years I use the following statement to find duplicate
entries in a table:

SELECT t1.id, t2.id,...
FROM table AS t1 INNER JOIN table AS t2
ON t1.field = t2.field
WHERE t1.id < t2.id

Of course Indexes are set on id (Primary Key, Autoincrement) and field
(normal Index) and are, according to EXPLAIN, used (index "field").

This statement works quite well but takes ages (> 10 minutes) of time
when using it on a table with tenthousands of rows.

Now my question:
Are there better and especially faster ways to find duplicate entries
within a table without having to create a seperate table which will be
filled with a SELECT DISTINCT... statement from the "old" table?

Thank's in advance for your help and time,

Andy Maurer

Jul 23 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
an****************@web.de wrote:
Now my question:
Are there better and especially faster ways to find duplicate entries
within a table without having to create a seperate table which will be
filled with a SELECT DISTINCT... statement from the "old" table?


select id,count(field) as amount
from table group by field having amount > 1;
Jul 23 '05 #2

P: n/a


Aggro schrieb:
an****************@web.de wrote:
Now my question:
Are there better and especially faster ways to find duplicate entries
within a table without having to create a seperate table which will be
filled with a SELECT DISTINCT... statement from the "old" table?


select id,count(field) as amount
from table group by field having amount > 1;


Hi Aggro,

thank you very much for your help, it shows me duplicate entries but,
as well due to a not better specified question of mine, some problems
are left open.
First of all your statement, when issued in phpmyadmin, tells me that
it found in total all rows of the table. Do you have any idea, why?
Another problem I have is: It shows me the duplicate entries in "field"
but only one of the many keys related to this entry. Therefore, if I
wanted to delete all but one of these entries, I have to issue a lot of
SELECT and DELETE statements in order to clean my table.

Of course I always could create another (temporary) table and fill it
with DISTINCT entries of table1 but this is sometimes not an option for
me.

Do you have another idea how to fix this?

Thank's again,

Andy

Jul 23 '05 #3

P: n/a
an****************@web.de wrote:
select id,count(field) as amount
from table group by field having amount > 1;
Another problem I have is: It shows me the duplicate entries in "field"
but only one of the many keys related to this entry. Therefore, if I
wanted to delete all but one of these entries, I have to issue a lot of
SELECT and DELETE statements in order to clean my table.


This should give the list of fields that occur more than once in the
database:

SELECT t2.`field`
FROM `table` AS t2
GROUP BY t2.`field`
HAVING COUNT(t2.`field`) > 1;

This should give the list of id's in whose records those values for
field occur:

SELECT t.id
FROM `table` AS t
WHERE t.`field` IN (
SELECT t2.`field`
FROM `table` AS t2
GROUP BY t2.`field`
HAVING COUNT(t2.`field`) > 1);

This should give the list of the lowest id for each given value in field:

SELECT MIN(t3.id)
FROM `table` AS t3
GROUP BY t3.`field`;

Putting it all together, this should delete records who contain values
for field that appear on more than one row, but do not delete such
records with the lowest id for a given value of `field`:

DELETE FROM `table`
WHERE id IN (
SELECT t.id
FROM `table` AS t
WHERE t.`field` IN (
SELECT t2.`field`
FROM `table` AS t2
GROUP BY t2.`field`
HAVING COUNT(t2.`field`) > 1))
AND id NOT IN (
SELECT MIN(t3.id)
FROM `table` AS t3
GROUP BY t3.`field`);

I'm assuming that you want to keep each record with the lowest id value.
In order to do this in a single operation, you must have some kind of
uniform policy like this, that applies to all cases. You could use
MAX(t3.id) instead to delete all but the most recent records, but other
than those two options, the alternatives get more complex to implement.

Regards,
Bill K.
Jul 23 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.