By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,849 Members | 2,169 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,849 IT Pros & Developers. It's quick & easy.

Need help speedig up query

P: n/a
The following query needs about 2 minutes to complete (finding dupes)
on a table of about 10000 addresses. Does anyone have an idea on how
to speed this up ?

Thanks in advance !!!

Sebastian

Select
Top 1000 *
From
addresses ab1
Where
(
Select Count(*) From addresses base ab2 Where
(
(
(ab2.LastName = ab1.LastName And Ltrim(RTrim(ab1.LastName)) != '' )
Or
(ab2.Company = ab1.Company And (Ltrim(RTrim(ab1.Company)) != '') )
)
And
(
ab2.ZipCode = ab1.ZipCode
Or
ab1.ZipCode = ''
)
)
And ab2.Ad_Id != ab1.Ad_Id
) >= 1
Order By
LastName, FirstName
Jul 23 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On 14 Feb 2005 04:06:13 -0800, Sebastian wrote:
The following query needs about 2 minutes to complete (finding dupes)
on a table of about 10000 addresses. Does anyone have an idea on how
to speed this up ?
Hi Sebastian,

I'm hope you made a mistake while copying the query. It should return an
error message in mere milliseconds:
Select Count(*) From addresses base ab2 Where ^^^^^^^^

A table can have a maximum of one alias, never two.

A quick win in this case is to replace the test for COUNT(*) >= 1 with a
test for EXISTS. With COUNT(*), SQL Server will go on to find a second,
third, etc., match after finding the first; with EXISTS it won't.

Another quick win is to not use SELECT *, but specify a column list. You
may be lucky and have a covering index that can be used to speed up the
query if you don't show all columns.

Why are you using things like "Ltrim(RTrim(ab1.LastName)) != ''"? Do you
mean to say that your LastName column might contain empty strings, but
also a series of spaces? Why don't you use NULL to represent missing data,
that's exactly what the NULL symbol is invented for.

From your query, I get the impression that each row in your table has
exactly one of LastName and Company filled; the other column is always an
empty string or some spaces. If you had used NULLS, you could now simply
have written "ab2.LastName = ab1.LastName OR ab2.Company = ab1.Company".
Not necessarily faster (though certainly not slower), but a lot more
readable!

This code: And
(
ab2.ZipCode = ab1.ZipCode
Or
ab1.ZipCode = ''
)

will result in ANY zip code from ab2 being considered a match if the zip
code in ab1 is blank. Are you sure that is what you want? If you want a
blank zip code in ab1 to match only blank zip codes in ab2, reduce this to
AND ab2.ZipCode = ab1.ZipCode
Not only shorter and easier, but probably quicker as well.

For more help, you'll have to post more information: the structure of your
table (as CREATE TABLE statement, with irrelevant columns omitted, but all
constraints and properties included - and don't forget to include indexes
as well), some sample data (as INSERT statements) to illustrate how your
data looks and the output you expect to get from that sample data. Plus a
description of what you consider to be a duplicate, as your query
indicates that your definition is not trivial.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 23 '05 #2

P: n/a
I'll have a go at it. Try this:

Select
Top 1000 *
From (
SELECT *
FROM addresses ab1
INNER JOIN (
SELECT LastName,ZipCode
FROM Addresses
WHERE LastName > Space(100)
AND Zipcode <> ''
GROUP BY LastName,ZipCode
HAVING COUNT(*)>1
) ab2
ON ab1.LastName=ab2.LastName
AND ab1.ZipCode =ab2.ZipCode

UNION ALL

SELECT *
FROM addresses ab1
INNER JOIN (
SELECT LastName
FROM Addresses
WHERE LastName > Space(100)
HAVING COUNT(*)>1
GROUP BY LastName
AND MIN(ZipCode)=''
) ab2
ON ab1.LastName=ab2.LastName
AND ab1.ZipCode =''

UNION ALL

SELECT *
FROM addresses ab1
INNER JOIN (
SELECT Company,ZipCode
FROM Addresses
WHERE Company > Space(100)
AND Zipcode <> ''
GROUP BY Company,ZipCode
HAVING COUNT(*)>1
AND MIN(LastName) < MAX(LastName)
) ab2
ON ab1.Company=ab2.Company
AND ab1.ZipCode=ab2.ZipCode

UNION ALL

SELECT *
FROM addresses ab1
INNER JOIN (
SELECT Company
FROM Addresses
WHERE Company > Space(100)
GROUP BY Company
HAVING COUNT(*)>1
AND MIN(LastName) < MAX(LastName)
AND MIN(ZipCode)=''
) ab2
ON ab1.Company=ab2.Company
AND ab1.ZipCode=''
) X
Order By
LastName, FirstName

Note that the predicate "AND MIN(LastName) < MAX(LastName)" tries to
eliminate duplicate duplicates. However, this may result in a missed
Company duplicate, because of existing LastName duplicates for the same
ZipCode.

Of course, if you are using TOP 1000 to just get the first 1000
duplicates (and not all duplicates), then you can also do something like
this:

Declare @count int
Set @count=0

SELECT TOP 1000 *
FROM addresses ab1
INNER JOIN (
SELECT LastName,ZipCode
FROM Addresses
WHERE LastName > Space(100)
AND Zipcode <> ''
GROUP BY LastName,ZipCode
HAVING COUNT(*)>1
) ab2
ON ab1.LastName=ab2.LastName
AND ab1.ZipCode =ab2.ZipCode
ORDER BY LastName, FirstName

Set @Count=@Count+@@rowcount
If @Count < 1000
Begin
SET ROWCOUNT 1000-@Count

SELECT TOP 1000 *
FROM addresses ab1
INNER JOIN (
SELECT LastName
FROM Addresses
WHERE LastName > Space(100)
HAVING COUNT(*)>1
GROUP BY LastName
AND MIN(ZipCode)=''
) ab2
ON ab1.LastName=ab2.LastName
AND ab1.ZipCode =''
ORDER BY LastName, FirstName

Set @Count=@Count+@@rowcount
End

If @Count < 1000
Begin
SET ROWCOUNT 1000-@Count

SELECT TOP 1000 *
FROM addresses ab1
INNER JOIN (
SELECT Company,ZipCode
FROM Addresses
WHERE Company > Space(100)
AND Zipcode <> ''
GROUP BY Company,ZipCode
HAVING COUNT(*)>1
AND MIN(LastName) < MAX(LastName)
) ab2
ON ab1.Company=ab2.Company
AND ab1.ZipCode=ab2.ZipCode
ORDER BY LastName, FirstName

Set @Count=@Count+@@rowcount
End

If @Count < 1000
Begin
SET ROWCOUNT 1000-@Count

SELECT TOP 1000 *
FROM addresses ab1
INNER JOIN (
SELECT Company
FROM Addresses
WHERE Company > Space(100)
GROUP BY Company
HAVING COUNT(*)>1
AND MIN(LastName) < MAX(LastName)
AND MIN(ZipCode)=''
) ab2
ON ab1.Company=ab2.Company
AND ab1.ZipCode=''
ORDER BY LastName, FirstName
End
SET ROWCOUNT 0
Other notes:
- The predicate "ab2.Ad_Id != ab1.Ad_Id" uses proprietary syntax. The
ANSI-SQL syntax is "ab2.Ad_Id <> ab1.Ad_Id"
- If you are comparing with an empty string, then it is useless to
perform two Trim functions. So you can simplify
"Ltrim(RTrim(ab1.Company)) != ''" to "RTrim(ab1.Company) <> ''". In the
query above, it is translated to "ab1.Company > Space(100)", because
this makes it a usuable search argument for the optimizer

Hope this helps,
Gert-Jan
Sebastian wrote:

The following query needs about 2 minutes to complete (finding dupes)
on a table of about 10000 addresses. Does anyone have an idea on how
to speed this up ?

Thanks in advance !!!

Sebastian

Select
Top 1000 *
From
addresses ab1
Where
(
Select Count(*) From addresses base ab2 Where
(
(
(ab2.LastName = ab1.LastName And Ltrim(RTrim(ab1.LastName)) != '' )
Or
(ab2.Company = ab1.Company And (Ltrim(RTrim(ab1.Company)) != '') )
)
And
(
ab2.ZipCode = ab1.ZipCode
Or
ab1.ZipCode = ''
)
)
And ab2.Ad_Id != ab1.Ad_Id
) >= 1
Order By
LastName, FirstName

Jul 23 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.