input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.
aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.
We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.
How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...
THANK YOU 13 1914
Moving it to Oracle won't buy you anything. Perhaps indexing on each of the
columns to be filtered will help you.
--
Tom
----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON Canada
..
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.
aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.
We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.
How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...
THANK YOU
If the table contains 1.5 millions rows, and the query runs for 16 days,
then there must be something wrong with the query or with the table
setup (inclusing indexes).
From your narrative I do not really understand what you are trying to
achieve. Please post DDL (including indexes), some sample data and the
results you are trying to achieve.
Gert-Jan
Tom Moreau wrote: Moving it to Oracle won't buy you anything. Perhaps indexing on each of the columns to be filtered will help you.
-- Tom
---------------------------------------------------- Thomas A. Moreau, BSc, PhD, MCSE, MCDBA SQL Server MVP Toronto, ON Canada . "groupy" <li*******@gmai l.com> wrote in message news:11******** *************@i 40g2000cwc.goog legroups.com... input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication.
aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication.
We can do it by a simple select query that logicly checks the duplication in a OR operator. But it takes about 16 days in a very fast PC. The DB is in sql-server, converting it to Oracle might acomplish it to 8 days.
How can i do it in a few hours? Remeber that filtering first the users with parameter A & than by parameter B & so on will result an error in the final result because it will loose the information regarding the filtered users - maybe in parameter C they are equal to other users in the table...
THANK YOU
Post the SQL you have so far.
Also post the hardware specification you are using this on.
I regularly deal with queries that consume tables with multi-millions of
rows in seconds without problem, the size of your data looks to be around
190MBytes based on 4 columns of 25 characters, basically its piddly.
Tony.
--
Tony Rogerson
SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com... input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication.
aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication.
We can do it by a simple select query that logicly checks the duplication in a OR operator. But it takes about 16 days in a very fast PC. The DB is in sql-server, converting it to Oracle might acomplish it to 8 days.
How can i do it in a few hours? Remeber that filtering first the users with parameter A & than by parameter B & so on will result an error in the final result because it will loose the information regarding the filtered users - maybe in parameter C they are equal to other users in the table...
THANK YOU
groupy (li*******@gmai l.com) writes: input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication.
aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication.
We can do it by a simple select query that logicly checks the duplication in a OR operator. But it takes about 16 days in a very fast PC.
The description is vague, but sounds like you should run:
SELECT userid, A, B, C, D, COUNT(*)
FROM tbl
GROUP BY userid, A, B, C, D
HAVING COUNT(*) >1
While that is not running snap, it should not take 16 days for 1.5
million rows.
--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se
Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinf...ons/books.mspx
On 30 May 2006 10:39:23 -0700, groupy wrote: input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C ,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication.
Hi groupy,
No. Only four possibilities: duplicate A, duplicate B, duplicate C, and
duplicate D. Combinations are just a special case (you can only have a
duplicate A+B if you have both a duplicate A and a duplicate B - though
you can have duplicate A and duplicate B but no duplicate A+B).
aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication.
This specification is incorrect. For instance, with the input like this:
num A B C D
--- --- --- --- ---
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a1 b2 c3 d3
4 a2 b1 c4 d4
there are two possible result sets, both containing two rows, that have
no duplicates anymore (1 + 2 or 3 + 4).
If the answer is "I don't care - any resultset without duplicates will
do", then the code below should run pretty fast:
CREATE TABLE #Temp
(A nvarchar(25) NOT NULL,
B nvarchar(25) NOT NULL,
C nvarchar(25) NOT NULL,
D nvarchar(25) NOT NULL)
go
CREATE UNIQUE INDEX x_A ON #Temp(A) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_B ON #Temp(B) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_C ON #Temp(C) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_D ON #Temp(D) WITH (IGNORE_DUP_KEY = ON)
go
INSERT INTO #Temp (A, B, C, D)
SELECT A, B, C, D
FROM YourBigTable
-- Show results
SELECT * FROM #Temp
go
DROP TABLE #Temp
go
--
Hugo Kornelis, SQL Server MVP
ok, let's take a look at a sample table representing the problem:
A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3
The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C, D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server
THANK YOU ALL
Whats your hardware?
Please post the CREATE TABLE with any indexes for your schema.
What version of SQL Server?
--
Tony Rogerson
SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant http://sqlserverfaq.com - free video tutorials
"groupy" <li*******@gmai l.com> wrote in message
news:11******** *************@f 6g2000cwb.googl egroups.com... ok, let's take a look at a sample table representing the problem:
A | B | C | D -------------------- a1 b1 c1 d1 a1 b2 c2 d2 a1 b1 c3 d3 a4 b4 c4 d3 a5 b5 c5 d5 a6 b6 c6 d3
The duplications are: rows 1+2+3 on A row 1+3 on B rows 3+4+6 on D the only unique (in all params) row is 5 note: finding first that row 1 similar to 2 on A & deleting it will loose information because we WON'T know if row 1 similar to row 3 on B. The same goes for the deletion of row 3 : it will cause lose of data regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most time is: SELECT COUNT(*),A,B,C, D FROM tbl GROUP BY A,B,C,D HAVING count(*)>1 It takes about 2 weaks on a 1.5 million rows, while all fields are nvchars & the DB is in SQL-Server
THANK YOU ALL
Hi There,
IF there is no identity column then we may use.
Select identity(int,1, 1) myid ,A,B,C,D into tmpTable from select * from
BASETABLE;
create index tmpA on tmpTable(A,myid );
create index tmpB on tmpTable(B,myid );
create index tmpC on tmpTable(C,myid );
create index tmpD on tmpTable(D,myid );
Assuming that there is a column rowid which is monotonically increasing
and there are as many covering indexes as there are columns the query
can become like this .
Delete from tmpTable where myId in
(
Select myID from tmpTable group by A having count(*)>1
Union All
Select myID from tmpTable group by B having count(*)>1
Union All
Select myID from tmpTable group by C having count(*)>1
Union All
Select myID from tmpTable group by D having count(*)>1
);
Hope this serve the purpose.
With Warm regards
Jatinder Singh
groupy wrote: ok, let's take a look at a sample table representing the problem:
A | B | C | D -------------------- a1 b1 c1 d1 a1 b2 c2 d2 a1 b1 c3 d3 a4 b4 c4 d3 a5 b5 c5 d5 a6 b6 c6 d3
The duplications are: rows 1+2+3 on A row 1+3 on B rows 3+4+6 on D the only unique (in all params) row is 5 note: finding first that row 1 similar to 2 on A & deleting it will loose information because we WON'T know if row 1 similar to row 3 on B. The same goes for the deletion of row 3 : it will cause lose of data regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most time is: SELECT COUNT(*),A,B,C, D FROM tbl GROUP BY A,B,C,D HAVING count(*)>1 It takes about 2 weaks on a 1.5 million rows, while all fields are nvchars & the DB is in SQL-Server
THANK YOU ALL
groupy (li*******@gmai l.com) writes: A | B | C | D -------------------- a1 b1 c1 d1 a1 b2 c2 d2 a1 b1 c3 d3 a4 b4 c4 d3 a5 b5 c5 d5 a6 b6 c6 d3
The duplications are: rows 1+2+3 on A row 1+3 on B rows 3+4+6 on D the only unique (in all params) row is 5 note: finding first that row 1 similar to 2 on A & deleting it will loose information because we WON'T know if row 1 similar to row 3 on B. The same goes for the deletion of row 3 : it will cause lose of data regarding it's similarity to row 4 on D
The Simple query for retriving all duplicated rows which consumes most time is: SELECT COUNT(*),A,B,C, D FROM tbl GROUP BY A,B,C,D HAVING count(*)>1 It takes about 2 weaks on a 1.5 million rows, while all fields are nvchars & the DB is in SQL-Server
I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.
Anyway, first index each column individually. Then try:
DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D
DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D
DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C
After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:
ATLER TABLE tbl ADD ident int IDENTITY
DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident
DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident
DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident
ALTER TABLE tbl DROP COLUMN ident
Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:
o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.
--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se
Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinf...ons/books.mspx This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: primus |
last post by:
Hello all,
First let me tell you that I have searched this group and the internet
and have not been able to find the answer to my problem. It is probably
because I don't know what to look for. I did though find my question
asked in 1997 by a guy in this group but it seems that noone answered
his(my) question. As a very last resort I ask my question now.
I have a function to handle many divs. How do I pass that function the
'object' of...
|
by: diffuser78 |
last post by:
I have just started to learn python. Some said that its slow. Can
somebody pin point the issue.
Thans
|
by: Erik |
last post by:
Where is the "MTAThread()" method?
I'm writing a service in VB.Net. Being of a curious nature, I looked
at the generated code, which has the following:
> <MTAThread()> _
> Shared Sub Main()
> ' stuff deleted
> End Sub
|
by: Peter Duniho |
last post by:
This is kind of a question about C# and kind of one about the framework.
Hopefully, there's an answer in there somewhere. :)
I'm curious about the status of 32-bit vs 64-bit in C# and the framework
classes. The specific example I'm running into is with respect to byte
arrays and the BitConverter class. In C# you can create arrays larger than
2^32, using the overloaded methods that take 64-bit parameters. But as near
as I can tell,...
|
by: datapro01 |
last post by:
X-No-Archive: Yes
I have a questionabout reorging very large tables. Running DB2 8.1
Fixpack 6 on AIX 5.2 supporting Siebel. I've read through the docs I
could find and the postings in this newsgroup on the challenges in
reorging very large tables, using partitioning, MDC, union all views
of smaller tables etc.
However in supporting Siebel we can't partition. The online reorg runs
for a week,
| |
by: robert |
last post by:
I'm very pleased to announce that Foundations of F#, the first book to
be published on the F# programming, will finish its first printing
run, tomorrow, Friday 25th May. It should reach any pre-order
customers between 5 to 10 days later, meaning if ordered it on Amazon
or Borders (or any other online store), it should be with you before
the end of May. A few weeks after that it should start appearing in
books stores, at least bookstores...
|
by: maria |
last post by:
I only use C++ with Visual Studio 6.0 for string manipulations in
thousands of HTML pages on my website. Many times, the output files of
many of my C++ programs contain a spanish question mark (¿) as their
first character. What creates it? How do we avoid it?
Thanks!
maria
|
by: Prisoner at War |
last post by:
Friends, your opinions and advice, please:
I have a very simple JavaScript image-swap which works on my end but
when uploaded to my host at http://buildit.sitesell.com/sunnyside.html
does not work.
To rule out all possible factors, I made up a dummy page for an
index.html to upload, along the lines of <html><head><title></title></
head><body></body></html>.; the image-swap itself is your basic <img
src="blah.png"...
|
by: mdh |
last post by:
As I begin to write more little programs without the help of the
exercises, little things pop up that I need to understand more fully.
Thus, below, and although this is not the exact code, the principle of
the question is the same, ( I hope :-) )
#include <stdio.h>
int i = 0;
int main () { return 0; } /* no errors or warnings*/
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |