473,799 Members | 3,740 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

VERY chalanging question

input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU

May 30 '06
13 1915
Hi There,

Sorry!!!! for providing incorrect answer
correct one that you may like to try is
Create Table myData
(
a varchar(2),
b varchar(2),
c varchar(2),
d varchar(2)
)
insert into myData
Select 'a1', 'b1', 'c1', 'd1'
Union
Select 'a1', 'b2', 'c2', 'd2'
Union
Select 'a1', 'b1', 'c3', 'd3'
Union
Select 'a4', 'b4', 'c4', 'd3'
Union
Select 'a5', 'b5', 'c5', 'd5'
Union
Select 'a6', 'b6', 'c6', 'd3'
Alter Table myData add myid int identity(1,1)
Select * from myData
Delete from myData Where myID in
(
Select myID From myData MA ,(Select A from myData group by A having
count(*)>1) AA Where MA.A = AA.A
Union All
Select myID From myData MB ,(Select B from myData group by B having
count(*)>1) BB Where MB.B = BB.B
Union All
Select myID From myData MC ,(Select C from myData group by C having
count(*)>1) CC Where MC.C = CC.C
Union All
Select myID From myData MD ,(Select D from myData group by D having
count(*)>1) DD Where MD.D = DD.D
)
Select * from myData

With Warm regards
Jatinder Singh

Erland Sommarskog wrote:
groupy (li*******@gmai l.com) writes:
A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server


I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.

Anyway, first index each column individually. Then try:

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C

After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:

ATLER TABLE tbl ADD ident int IDENTITY

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident

ALTER TABLE tbl DROP COLUMN ident

Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:

o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.

--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx


May 31 '06 #11
Thank you all very much, i think i managed something..

May 31 '06 #12

How long does it take to run the query now?

Madhivanan
groupy wrote:
Thank you all very much, i think i managed something..


Jun 2 '06 #13
It takes about 2 hours...

Jun 2 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
1526
by: primus | last post by:
Hello all, First let me tell you that I have searched this group and the internet and have not been able to find the answer to my problem. It is probably because I don't know what to look for. I did though find my question asked in 1997 by a guy in this group but it seems that noone answered his(my) question. As a very last resort I ask my question now. I have a function to handle many divs. How do I pass that function the 'object' of...
50
5746
by: diffuser78 | last post by:
I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans
1
2520
by: Erik | last post by:
Where is the "MTAThread()" method? I'm writing a service in VB.Net. Being of a curious nature, I looked at the generated code, which has the following: > <MTAThread()> _ > Shared Sub Main() > ' stuff deleted > End Sub
10
5893
by: Peter Duniho | last post by:
This is kind of a question about C# and kind of one about the framework. Hopefully, there's an answer in there somewhere. :) I'm curious about the status of 32-bit vs 64-bit in C# and the framework classes. The specific example I'm running into is with respect to byte arrays and the BitConverter class. In C# you can create arrays larger than 2^32, using the overloaded methods that take 64-bit parameters. But as near as I can tell,...
1
1976
by: datapro01 | last post by:
X-No-Archive: Yes I have a questionabout reorging very large tables. Running DB2 8.1 Fixpack 6 on AIX 5.2 supporting Siebel. I've read through the docs I could find and the postings in this newsgroup on the challenges in reorging very large tables, using partitioning, MDC, union all views of smaller tables etc. However in supporting Siebel we can't partition. The online reorg runs for a week,
52
2550
by: robert | last post by:
I'm very pleased to announce that Foundations of F#, the first book to be published on the F# programming, will finish its first printing run, tomorrow, Friday 25th May. It should reach any pre-order customers between 5 to 10 days later, meaning if ordered it on Amazon or Borders (or any other online store), it should be with you before the end of May. A few weeks after that it should start appearing in books stores, at least bookstores...
4
3616
by: maria | last post by:
I only use C++ with Visual Studio 6.0 for string manipulations in thousands of HTML pages on my website. Many times, the output files of many of my C++ programs contain a spanish question mark (¿) as their first character. What creates it? How do we avoid it? Thanks! maria
112
4762
by: Prisoner at War | last post by:
Friends, your opinions and advice, please: I have a very simple JavaScript image-swap which works on my end but when uploaded to my host at http://buildit.sitesell.com/sunnyside.html does not work. To rule out all possible factors, I made up a dummy page for an index.html to upload, along the lines of <html><head><title></title></ head><body></body></html>.; the image-swap itself is your basic <img src="blah.png"...
56
2667
by: mdh | last post by:
As I begin to write more little programs without the help of the exercises, little things pop up that I need to understand more fully. Thus, below, and although this is not the exact code, the principle of the question is the same, ( I hope :-) ) #include <stdio.h> int i = 0; int main () { return 0; } /* no errors or warnings*/
0
9541
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10485
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10231
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10027
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9073
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5463
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5585
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4141
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3759
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.