By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,766 Members | 1,252 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,766 IT Pros & Developers. It's quick & easy.

Very chalanging question - Explanation

P: n/a
Ok, here is a asample table representing the problem more clearly

A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

Tha duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user
goal: to find all duplicated rows & to delete them all accept one
instance to leave.

Note:
Finding that row 1similar to 2 in A & deleting it will loose data
because we won't know that row 1 is ALSO similar to 3 on C & later on
finding that 3 is similar to 4 & 6 on D & so on

The simple time consuming (about 2 weaks) query to acomplish the task
is:
SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1

I THANK YOU ALL

May 31 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
> SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
If that query takes 2 weeks to run then you've got serious and I mean
serious problems with your hardware, that type of query should take minutes
if not seconds most desktops let alone servers.

To aid performance for that specific query create your clustered index on A,
B, C and D.

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmail.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com... Ok, here is a asample table representing the problem more clearly

A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

Tha duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user
goal: to find all duplicated rows & to delete them all accept one
instance to leave.

Note:
Finding that row 1similar to 2 in A & deleting it will loose data
because we won't know that row 1 is ALSO similar to 3 on C & later on
finding that 3 is similar to 4 & 6 on D & so on

The simple time consuming (about 2 weaks) query to acomplish the task
is:
SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1

I THANK YOU ALL

May 31 '06 #2

P: n/a
Thanks Tony i've just managed something..
Tony Rogerson כתב:
SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1


If that query takes 2 weeks to run then you've got serious and I mean
serious problems with your hardware, that type of query should take minutes
if not seconds most desktops let alone servers.

To aid performance for that specific query create your clustered index onA,
B, C and D.

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmail.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...
Ok, here is a asample table representing the problem more clearly

A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

Tha duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user
goal: to find all duplicated rows & to delete them all accept one
instance to leave.

Note:
Finding that row 1similar to 2 in A & deleting it will loose data
because we won't know that row 1 is ALSO similar to 3 on C & later on
finding that 3 is similar to 4 & 6 on D & so on

The simple time consuming (about 2 weaks) query to acomplish the task
is:
SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1

I THANK YOU ALL


May 31 '06 #3

P: n/a
A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

I am going to guess at the DDL and add another column

CREATE TABLE Foobar
(a CHAR(2) NOT NULL,
b CHAR(2) NOT NULL,
c CHAR(2) NOT NULL,
d CHAR(2) NOT NULL,
dups INTEGER DEFAULT 0 NOT NULL
CHECK(dups >= 0),
PRIMARY KEY(a, b, c, d));

INSERT INTO Foobar(a, b, c, d) VALUES ('a1', 'b1', 'c1', 'd1');
INSERT INTO Foobar(a, b, c, d) VALUES ('a1', 'b2', 'c2', 'd2');
INSERT INTO Foobar(a, b, c, d) VALUES ('a3', 'b3', 'c1', 'd3');
INSERT INTO Foobar(a, b, c, d) VALUES ('a4', 'b4', 'c4', 'd3');
INSERT INTO Foobar(a, b, c, d) VALUES ('a5', 'b5', 'c5', 'd5');
INSERT INTO Foobar(a, b, c, d) VALUES ('a6', 'b6', 'c6', 'd3');

Since you seem to want to preserve some of the information about
duplications, you can keep a tally

UPDATE Foobar
SET dups
= dups
+ (SELECT 1
FROM Foobar AS F1
WHERE F1.a = Foobar.a
HAVING COUNT(*) > 1)
+ (SELECT 1
FROM Foobar AS F1
WHERE F1.b = Foobar.b
HAVING COUNT(*) > 1)
+ (SELECT 1
FROM Foobar AS F1
WHERE F1.c = Foobar.c
HAVING COUNT(*) > 1)
+ (SELECT 1
FROM Foobar AS F1
WHERE F1.d = Foobar.d
HAVING COUNT(*) > 1);

The duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user

May 31 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.