473,786 Members | 2,638 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

VERY chalanging question

input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU

May 30 '06 #1
13 1914
Moving it to Oracle won't buy you anything. Perhaps indexing on each of the
columns to be filtered will help you.

--
Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON Canada
..
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU

May 30 '06 #2
If the table contains 1.5 millions rows, and the query runs for 16 days,
then there must be something wrong with the query or with the table
setup (inclusing indexes).

From your narrative I do not really understand what you are trying to
achieve. Please post DDL (including indexes), some sample data and the
results you are trying to achieve.

Gert-Jan
Tom Moreau wrote:

Moving it to Oracle won't buy you anything. Perhaps indexing on each of the
columns to be filtered will help you.

--
Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON Canada
.
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU

May 30 '06 #3
Post the SQL you have so far.

Also post the hardware specification you are using this on.

I regularly deal with queries that consume tables with multi-millions of
rows in seconds without problem, the size of your data looks to be around
190MBytes based on 4 columns of 25 characters, basically its piddly.

Tony.

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU

May 30 '06 #4
groupy (li*******@gmai l.com) writes:
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.


The description is vague, but sounds like you should run:

SELECT userid, A, B, C, D, COUNT(*)
FROM tbl
GROUP BY userid, A, B, C, D
HAVING COUNT(*) >1

While that is not running snap, it should not take 16 days for 1.5
million rows.

--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
May 30 '06 #5
On 30 May 2006 10:39:23 -0700, groupy wrote:
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C ,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.
Hi groupy,

No. Only four possibilities: duplicate A, duplicate B, duplicate C, and
duplicate D. Combinations are just a special case (you can only have a
duplicate A+B if you have both a duplicate A and a duplicate B - though
you can have duplicate A and duplicate B but no duplicate A+B).
aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.


This specification is incorrect. For instance, with the input like this:

num A B C D
--- --- --- --- ---
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a1 b2 c3 d3
4 a2 b1 c4 d4

there are two possible result sets, both containing two rows, that have
no duplicates anymore (1 + 2 or 3 + 4).

If the answer is "I don't care - any resultset without duplicates will
do", then the code below should run pretty fast:

CREATE TABLE #Temp
(A nvarchar(25) NOT NULL,
B nvarchar(25) NOT NULL,
C nvarchar(25) NOT NULL,
D nvarchar(25) NOT NULL)
go
CREATE UNIQUE INDEX x_A ON #Temp(A) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_B ON #Temp(B) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_C ON #Temp(C) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_D ON #Temp(D) WITH (IGNORE_DUP_KEY = ON)
go
INSERT INTO #Temp (A, B, C, D)
SELECT A, B, C, D
FROM YourBigTable
-- Show results
SELECT * FROM #Temp
go
DROP TABLE #Temp
go
--
Hugo Kornelis, SQL Server MVP
May 30 '06 #6
ok, let's take a look at a sample table representing the problem:

A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server

THANK YOU ALL

May 31 '06 #7
Whats your hardware?

Please post the CREATE TABLE with any indexes for your schema.

What version of SQL Server?

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@f 6g2000cwb.googl egroups.com...
ok, let's take a look at a sample table representing the problem:

A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server

THANK YOU ALL

May 31 '06 #8
Hi There,

IF there is no identity column then we may use.
Select identity(int,1, 1) myid ,A,B,C,D into tmpTable from select * from
BASETABLE;
create index tmpA on tmpTable(A,myid );
create index tmpB on tmpTable(B,myid );
create index tmpC on tmpTable(C,myid );
create index tmpD on tmpTable(D,myid );

Assuming that there is a column rowid which is monotonically increasing
and there are as many covering indexes as there are columns the query
can become like this .
Delete from tmpTable where myId in
(
Select myID from tmpTable group by A having count(*)>1
Union All
Select myID from tmpTable group by B having count(*)>1
Union All
Select myID from tmpTable group by C having count(*)>1
Union All
Select myID from tmpTable group by D having count(*)>1
);

Hope this serve the purpose.
With Warm regards
Jatinder Singh
groupy wrote:
ok, let's take a look at a sample table representing the problem:

A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server

THANK YOU ALL


May 31 '06 #9
groupy (li*******@gmai l.com) writes:
A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server


I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.

Anyway, first index each column individually. Then try:

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C

After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:

ATLER TABLE tbl ADD ident int IDENTITY

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident

ALTER TABLE tbl DROP COLUMN ident

Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:

o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.

--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
May 31 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
1526
by: primus | last post by:
Hello all, First let me tell you that I have searched this group and the internet and have not been able to find the answer to my problem. It is probably because I don't know what to look for. I did though find my question asked in 1997 by a guy in this group but it seems that noone answered his(my) question. As a very last resort I ask my question now. I have a function to handle many divs. How do I pass that function the 'object' of...
50
5740
by: diffuser78 | last post by:
I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans
1
2520
by: Erik | last post by:
Where is the "MTAThread()" method? I'm writing a service in VB.Net. Being of a curious nature, I looked at the generated code, which has the following: > <MTAThread()> _ > Shared Sub Main() > ' stuff deleted > End Sub
10
5893
by: Peter Duniho | last post by:
This is kind of a question about C# and kind of one about the framework. Hopefully, there's an answer in there somewhere. :) I'm curious about the status of 32-bit vs 64-bit in C# and the framework classes. The specific example I'm running into is with respect to byte arrays and the BitConverter class. In C# you can create arrays larger than 2^32, using the overloaded methods that take 64-bit parameters. But as near as I can tell,...
1
1976
by: datapro01 | last post by:
X-No-Archive: Yes I have a questionabout reorging very large tables. Running DB2 8.1 Fixpack 6 on AIX 5.2 supporting Siebel. I've read through the docs I could find and the postings in this newsgroup on the challenges in reorging very large tables, using partitioning, MDC, union all views of smaller tables etc. However in supporting Siebel we can't partition. The online reorg runs for a week,
52
2549
by: robert | last post by:
I'm very pleased to announce that Foundations of F#, the first book to be published on the F# programming, will finish its first printing run, tomorrow, Friday 25th May. It should reach any pre-order customers between 5 to 10 days later, meaning if ordered it on Amazon or Borders (or any other online store), it should be with you before the end of May. A few weeks after that it should start appearing in books stores, at least bookstores...
4
3615
by: maria | last post by:
I only use C++ with Visual Studio 6.0 for string manipulations in thousands of HTML pages on my website. Many times, the output files of many of my C++ programs contain a spanish question mark (¿) as their first character. What creates it? How do we avoid it? Thanks! maria
112
4761
by: Prisoner at War | last post by:
Friends, your opinions and advice, please: I have a very simple JavaScript image-swap which works on my end but when uploaded to my host at http://buildit.sitesell.com/sunnyside.html does not work. To rule out all possible factors, I made up a dummy page for an index.html to upload, along the lines of <html><head><title></title></ head><body></body></html>.; the image-swap itself is your basic <img src="blah.png"...
56
2666
by: mdh | last post by:
As I begin to write more little programs without the help of the exercises, little things pop up that I need to understand more fully. Thus, below, and although this is not the exact code, the principle of the question is the same, ( I hope :-) ) #include <stdio.h> int i = 0; int main () { return 0; } /* no errors or warnings*/
0
9650
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9497
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10164
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10110
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8992
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6748
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5534
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4067
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3670
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.