SQL to find short, fat and stupid people

Justin Hoffman

This is a question concerning query optimisation. Sorry if it's a bit long,
but thanks to anyone who has the patience to help - This is my first post
here...
If I have two tables: 'tblContact' and 'tblCategory' where categories are
like:
Code Name
010101 Short
010102 Fat
010103 Stupid

The junction table 'tblConCat' has fields CctConID, CctCatCode to tell me
which contacts have which category codes. These are nchar(6) fields, if it
makes any difference.

If I need to find all people who are short, fat and stupid I can see two
ways:

Solution One:
SELECT tblContact.* FROM tblContact WHERE
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010101') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010102') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010103')
Solution Two:
Build a helper table which contains the codes I'm looking for

SELECT tblContact.* FROM tblContact
WHERE ConID IN
(SELECT CctConID
FROM tblConCat INNER JOIN tblHelper
ON tblConCat.CctCatCode = tblHelper.HlpCatCode
GROUP BY CctConID HAVING Count(*)=3)
I have tried them both out and although I looked at the query analyzer it
provided more information than I knew what to do with. In practise they
both provide similar working times (I wait about a second) but I thought the
first looked rather inefficient and thought my helper table might help.
There are about 30,000 contact records, 180 category records and 120,000
junction table records.

All I am looking for comments on any pros and cons of these two approaches -
does one look that bad? The database was migrated from Access where the
helper table really did help, but using SQL Server I might not need it.

Thanks again, if you got this far!

Jul 23 '05 #1

Subscribe Post Reply

2052

Dan Guzman

> I have tried them both out and although I looked at the query analyzer it

provided more information than I knew what to do with. In practise they
both provide similar working times
You can compare the estimated costs by running your queries at the same time
in Query Analyzer with the show execution plan option on. However, I would
still choose the query with the lowest actual execution time. All things
being equal, select the simplest approach.

BTW, make sure you have a unique index on tblConCat (CctConID and
CctCatCode). A primary key, unique constraint or unique index will provide
this.

Untested Solution Three (there are many more):

SELECT tblContact.*
FROM tblContact c
JOIN tblConCat cc1 ON
c.ConID = cc1.CctConID AND
CctCatCode='010101'
JOIN tblConCat cc2 ON
c.ConID = cc2.CctConID AND
CctCatCode='010102'
JOIN tblConCat cc3 ON
c.ConID = cc3.CctConID AND
CctCatCode='010103'

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Justin Hoffman" <j@b.com> wrote in message
news:d4**********@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com...
This is a question concerning query optimisation. Sorry if it's a bit
long, but thanks to anyone who has the patience to help - This is my first
post here...
If I have two tables: 'tblContact' and 'tblCategory' where categories are
like:
Code Name
010101 Short
010102 Fat
010103 Stupid

The junction table 'tblConCat' has fields CctConID, CctCatCode to tell me
which contacts have which category codes. These are nchar(6) fields, if
it makes any difference.

If I need to find all people who are short, fat and stupid I can see two
ways:

Solution One:
SELECT tblContact.* FROM tblContact WHERE
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010101') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010102') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010103')
Solution Two:
Build a helper table which contains the codes I'm looking for

SELECT tblContact.* FROM tblContact
WHERE ConID IN
(SELECT CctConID
FROM tblConCat INNER JOIN tblHelper
ON tblConCat.CctCatCode = tblHelper.HlpCatCode
GROUP BY CctConID HAVING Count(*)=3)
I have tried them both out and although I looked at the query analyzer it
provided more information than I knew what to do with. In practise they
both provide similar working times (I wait about a second) but I thought
the first looked rather inefficient and thought my helper table might
help. There are about 30,000 contact records, 180 category records and
120,000 junction table records.

All I am looking for comments on any pros and cons of these two
approaches - does one look that bad? The database was migrated from
Access where the helper table really did help, but using SQL Server I
might not need it.

Thanks again, if you got this far!

Jul 23 '05 #2

Justin Hoffman

"Dan Guzman" <gu******@nospam-online.sbcglobal.net> wrote in message
news:vb****************@newssvr11.news.prodigy.com ...

I have tried them both out and although I looked at the query analyzer it
provided more information than I knew what to do with. In practise they
both provide similar working times

You can compare the estimated costs by running your queries at the same
time in Query Analyzer with the show execution plan option on. However, I
would still choose the query with the lowest actual execution time. All
things being equal, select the simplest approach.

BTW, make sure you have a unique index on tblConCat (CctConID and
CctCatCode). A primary key, unique constraint or unique index will
provide this.

Untested Solution Three (there are many more):

SELECT tblContact.*
FROM tblContact c
JOIN tblConCat cc1 ON
c.ConID = cc1.CctConID AND
CctCatCode='010101'
JOIN tblConCat cc2 ON
c.ConID = cc2.CctConID AND
CctCatCode='010102'
JOIN tblConCat cc3 ON
c.ConID = cc3.CctConID AND
CctCatCode='010103'

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Justin Hoffman" <j@b.com> wrote in message
news:d4**********@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com...

This is a question concerning query optimisation. Sorry if it's a bit
long, but thanks to anyone who has the patience to help - This is my
first post here...
If I have two tables: 'tblContact' and 'tblCategory' where categories are
like:
Code Name
010101 Short
010102 Fat
010103 Stupid

The junction table 'tblConCat' has fields CctConID, CctCatCode to tell me
which contacts have which category codes. These are nchar(6) fields, if
it makes any difference.

If I need to find all people who are short, fat and stupid I can see two
ways:

Solution One:
SELECT tblContact.* FROM tblContact WHERE
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010101') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010102') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010103')
Solution Two:
Build a helper table which contains the codes I'm looking for

SELECT tblContact.* FROM tblContact
WHERE ConID IN
(SELECT CctConID
FROM tblConCat INNER JOIN tblHelper
ON tblConCat.CctCatCode = tblHelper.HlpCatCode
GROUP BY CctConID HAVING Count(*)=3)
I have tried them both out and although I looked at the query analyzer it
provided more information than I knew what to do with. In practise they
both provide similar working times (I wait about a second) but I thought
the first looked rather inefficient and thought my helper table might
help. There are about 30,000 contact records, 180 category records and
120,000 junction table records.

All I am looking for comments on any pros and cons of these two
approaches - does one look that bad? The database was migrated from
Access where the helper table really did help, but using SQL Server I
might not need it.

Thanks again, if you got this far!

Hi Dan
Thank you for your comments. The SQL needed a couple of adjustments: the
CctCatCode fields needed a table prefix to make them unambiguous and also
(probable typo) I need (select c.*) rather than (select tblContact.*) So
the final version is:

SELECT c.*
FROM tblContact c
JOIN tblConCat cc1 ON
c.ConID = cc1.CctConID AND
cc1.CctCatCode='010101'
JOIN tblConCat cc2 ON
c.ConID = cc2.CctConID AND
cc2.CctCatCode='010102'
JOIN tblConCat cc3 ON
c.ConID = cc3.CctConID AND
cc3.CctCatCode='010103'

Anyway, the result again is pretty fast and I guess I am wasting my time
building a helper table if it doesn't noticeably increase the speed but does
increase the complexity (the helper tables need to be dynamically created in
a multi-user environment).
I looked at the execution plans, but it gives so much information (cost,
subtree cost, etc) that it leaves a simple person like me somewhat lost.
What I would like to know is 'which is better: A or B?' Even when I run
them, I am not sure how to compare the actual performance. Is there one key
number to look at? Perhaps 'Cumulative client processing time'? Perhaps I
should try to read up on this.
(PS I do have that index you mentioned)

Jul 23 '05 #3

Dan Guzman

> What I would like to know is 'which is better: A or B?' Even when I run

them, I am not sure how to compare the actual performance. Is there one
key number to look at?
I usually compare actual query duration in a controlled test environment
with no other activity and clean cache.

CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
SELECT GETDATE() AS StartTime
GO
--execute first query
GO
SELECT GETDATE() AS EndTime
GO

CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
SELECT GETDATE() AS StartTime
GO
--execute second query
GO
SELECT GETDATE() AS EndTime
GO

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Justin Hoffman" <j@b.com> wrote in message
news:d4**********@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com...
"Dan Guzman" <gu******@nospam-online.sbcglobal.net> wrote in message
news:vb****************@newssvr11.news.prodigy.com ...
I have tried them both out and although I looked at the query analyzer
it provided more information than I knew what to do with. In practise
they both provide similar working times

You can compare the estimated costs by running your queries at the same
time in Query Analyzer with the show execution plan option on. However,
I would still choose the query with the lowest actual execution time.
All things being equal, select the simplest approach.

BTW, make sure you have a unique index on tblConCat (CctConID and
CctCatCode). A primary key, unique constraint or unique index will
provide this.

Untested Solution Three (there are many more):

SELECT tblContact.*
FROM tblContact c
JOIN tblConCat cc1 ON
c.ConID = cc1.CctConID AND
CctCatCode='010101'
JOIN tblConCat cc2 ON
c.ConID = cc2.CctConID AND
CctCatCode='010102'
JOIN tblConCat cc3 ON
c.ConID = cc3.CctConID AND
CctCatCode='010103'

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Justin Hoffman" <j@b.com> wrote in message
news:d4**********@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com...

This is a question concerning query optimisation. Sorry if it's a bit
long, but thanks to anyone who has the patience to help - This is my
first post here...
If I have two tables: 'tblContact' and 'tblCategory' where categories
are like:
Code Name
010101 Short
010102 Fat
010103 Stupid

The junction table 'tblConCat' has fields CctConID, CctCatCode to tell
me which contacts have which category codes. These are nchar(6) fields,
if it makes any difference.

If I need to find all people who are short, fat and stupid I can see two
ways:

Solution One:
SELECT tblContact.* FROM tblContact WHERE
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010101') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010102') AND
ConID IN (SELECT CctConID FROM tblConCat WHERE CctCatCode='010103')
Solution Two:
Build a helper table which contains the codes I'm looking for

SELECT tblContact.* FROM tblContact
WHERE ConID IN
(SELECT CctConID
FROM tblConCat INNER JOIN tblHelper
ON tblConCat.CctCatCode = tblHelper.HlpCatCode
GROUP BY CctConID HAVING Count(*)=3)
I have tried them both out and although I looked at the query analyzer
it provided more information than I knew what to do with. In practise
they both provide similar working times (I wait about a second) but I
thought the first looked rather inefficient and thought my helper table
might help. There are about 30,000 contact records, 180 category records
and 120,000 junction table records.

All I am looking for comments on any pros and cons of these two
approaches - does one look that bad? The database was migrated from
Access where the helper table really did help, but using SQL Server I
might not need it.

Thanks again, if you got this far!

Hi Dan
Thank you for your comments. The SQL needed a couple of adjustments: the
CctCatCode fields needed a table prefix to make them unambiguous and also
(probable typo) I need (select c.*) rather than (select tblContact.*) So
the final version is:

SELECT c.*
FROM tblContact c
JOIN tblConCat cc1 ON
c.ConID = cc1.CctConID AND
cc1.CctCatCode='010101'
JOIN tblConCat cc2 ON
c.ConID = cc2.CctConID AND
cc2.CctCatCode='010102'
JOIN tblConCat cc3 ON
c.ConID = cc3.CctConID AND
cc3.CctCatCode='010103'

Anyway, the result again is pretty fast and I guess I am wasting my time
building a helper table if it doesn't noticeably increase the speed but
does increase the complexity (the helper tables need to be dynamically
created in a multi-user environment).
I looked at the execution plans, but it gives so much information (cost,
subtree cost, etc) that it leaves a simple person like me somewhat lost.
What I would like to know is 'which is better: A or B?' Even when I run
them, I am not sure how to compare the actual performance. Is there one
key number to look at? Perhaps 'Cumulative client processing time'?
Perhaps I should try to read up on this.
(PS I do have that index you mentioned)

Jul 23 '05 #4

Justin Hoffman

"Dan Guzman" <gu******@nospam-online.sbcglobal.net> wrote in message
news:Cc****************@newssvr11.news.prodigy.com ...

What I would like to know is 'which is better: A or B?' Even when I run
them, I am not sure how to compare the actual performance. Is there one
key number to look at?

I usually compare actual query duration in a controlled test environment
with no other activity and clean cache.

CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
SELECT GETDATE() AS StartTime
GO
--execute first query
GO
SELECT GETDATE() AS EndTime
GO

CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
SELECT GETDATE() AS StartTime
GO
--execute second query
GO
SELECT GETDATE() AS EndTime
GO

--
Hope this helps.

Dan Guzman
SQL Server MVP

Thanks, Dan. I'll put that in my 'useful things to know' file.

Jul 23 '05 #5

--CELKO--

This is a "relational division"; it is one of Codd's original
operations. You might also want to look up the proper way to name data
elements, so you will not have those silly "tbl-" prefixes and multiple
suffixes. And rows are not records.

Relational division is one of the eight basic operations in Codd's
relational algebra. The idea is that a divisor table is used to
partition a dividend table and produce a quotient or results table.
The quotient table is made up of those values of one column for which a
second column had all of the values in the divisor.

This is easier to explain with an example. We have a table of pilots
and the planes they can fly (dividend); we have a table of planes in
the hangar (divisor); we want the names of the pilots who can fly every
plane (quotient) in the hangar. To get this result, we divide the
PilotSkills table by the planes in the hangar.

CREATE TABLE PilotSkills
(pilot CHAR(15) NOT NULL,
plane CHAR(15) NOT NULL,
PRIMARY KEY (pilot, plane));

PilotSkills
pilot plane
=========================
'Celko' 'Piper Cub'
'Higgins' 'B-52 Bomber'
'Higgins' 'F-14 Fighter'
'Higgins' 'Piper Cub'
'Jones' 'B-52 Bomber'
'Jones' 'F-14 Fighter'
'Smith' 'B-1 Bomber'
'Smith' 'B-52 Bomber'
'Smith' 'F-14 Fighter'
'Wilson' 'B-1 Bomber'
'Wilson' 'B-52 Bomber'
'Wilson' 'F-14 Fighter'
'Wilson' 'F-17 Fighter'

CREATE TABLE Hangar
(plane CHAR(15) NOT NULL PRIMARY KEY);

Hangar
plane
=============
'B-1 Bomber'
'B-52 Bomber'
'F-14 Fighter'

PilotSkills DIVIDED BY Hangar
pilot
=============================
'Smith'
'Wilson'

In this example, Smith and Wilson are the two pilots who can fly
everything in the hangar. Notice that Higgins and Celko know how to
fly a Piper Cub, but we don't have one right now. In Codd's original
definition of relational division, having more rows than are called for
is not a problem.

The important characteristic of a relational division is that the CROSS
JOIN (Cartesian product) of the divisor and the quotient produces a
valid subset of rows from the dividend. This is where the name comes
from, since the CROSS JOIN acts like a multiplication operator.

Division with a Remainder

There are two kinds of relational division. Division with a remainder
allows the dividend table to have more values than the divisor, which
was Codd's original definition. For example, if a pilot can fly more
planes than just those we have in the hangar, this is fine with us.
The query can be written in SQL-89 as

SELECT DISTINCT pilot
FROM PilotSkills AS PS1
WHERE NOT EXISTS
(SELECT *
FROM Hangar
WHERE NOT EXISTS
(SELECT *
FROM PilotSkills AS PS2
WHERE (PS1.pilot = PS2.pilot)
AND (PS2.plane = Hangar.plane)));

The quickest way to explain what is happening in this query is to
imagine an old World War II movie where a cocky pilot has just walked
into the hangar, looked over the fleet, and announced, "There ain't no
plane in this hangar that I can't fly!" We are finding the pilots for
whom there does not exist a plane in the hangar for which they have no
skills. The use of the NOT EXISTS() predicates is for speed. Most SQL
systems will look up a value in an index rather than scan the whole
table. The SELECT * clause lets the query optimizer choose the column
to use when looking for the index.

This query for relational division was made popular by Chris Date in
his textbooks, but it is not the only method nor always the fastest.
Another version of the division can be written so as to avoid three
levels of nesting. While it is not original with me, I have made it
popular in my books.

SELECT PS1.pilot
FROM PilotSkills AS PS1, Hangar AS H1
WHERE PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar);

There is a serious difference in the two methods. Burn down the
hangar, so that the divisor is empty. Because of the NOT EXISTS()
predicates in Date's query, all pilots are returned from a division by
an empty set. Because of the COUNT() functions in my query, no pilots
are returned from a division by an empty set.

In the sixth edition of his book, INTRODUCTION TO DATABASE SYSTEMS
(Addison-Wesley; 1995 ;ISBN 0-201-82458-2), Chris Date defined another
operator (DIVIDEBY ... PER) which produces the same results as my
query, but with more complexity.

Exact Division

The second kind of relational division is exact relational division.
The dividend table must match exactly to the values of the divisor
without any extra values.

SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar);

This says that a pilot must have the same number of certificates as
there planes in the hangar and these certificates all match to a plane
in the hangar, not something else. The "something else" is shown by a
created NULL from the LEFT OUTER JOIN.

Please do not make the mistake of trying to reduce the HAVING clause
with a little algebra to:

HAVING COUNT(PS1.plane) = COUNT(H1.plane)

because it does not work; it will tell you that the hangar has (n)
planes in it and the pilot is certified for (n) planes, but not that
those two sets of planes are equal to each other.

Note on Performance

The nested EXISTS() predicates version of relational division was made
popular by Chris Date's textbooks, while the author is associated with
popularizing the COUNT(*) version of relational division. The Winter
1996 edition of DB2 ON-LINE MAGAZINE
(http://www.db2mag.com/96011ar:htm) had an article entitled "Powerful
SQL:Beyond the Basics" by Sheryl Larsen which gave the results of
testing both methods. Her conclusion for DB2 was that the nested
EXISTS() version is better when the quotient has less than 25% of the
dividend table's rows and the COUNT(*) version is better when the
quotient is more than 25% of the dividend table.

Jul 23 '05 #6

Justin Hoffman

"--CELKO--" <jc*******@earthlink.net> wrote in message
news:11**********************@g14g2000cwa.googlegr oups.com...

This is a "relational division"; it is one of Codd's original
operations. You might also want to look up the proper way to name data
elements, so you will not have those silly "tbl-" prefixes and multiple
suffixes. And rows are not records.

Relational division is one of the eight basic operations in Codd's
relational algebra. The idea is that a divisor table is used to
partition a dividend table and produce a quotient or results table.
The quotient table is made up of those values of one column for which a
second column had all of the values in the divisor.

This is easier to explain with an example. We have a table of pilots
and the planes they can fly (dividend); we have a table of planes in
the hangar (divisor); we want the names of the pilots who can fly every
plane (quotient) in the hangar. To get this result, we divide the
PilotSkills table by the planes in the hangar.

CREATE TABLE PilotSkills
(pilot CHAR(15) NOT NULL,
plane CHAR(15) NOT NULL,
PRIMARY KEY (pilot, plane));

PilotSkills
pilot plane
=========================
'Celko' 'Piper Cub'
'Higgins' 'B-52 Bomber'
'Higgins' 'F-14 Fighter'
'Higgins' 'Piper Cub'
'Jones' 'B-52 Bomber'
'Jones' 'F-14 Fighter'
'Smith' 'B-1 Bomber'
'Smith' 'B-52 Bomber'
'Smith' 'F-14 Fighter'
'Wilson' 'B-1 Bomber'
'Wilson' 'B-52 Bomber'
'Wilson' 'F-14 Fighter'
'Wilson' 'F-17 Fighter'

CREATE TABLE Hangar
(plane CHAR(15) NOT NULL PRIMARY KEY);

Hangar
plane
=============
'B-1 Bomber'
'B-52 Bomber'
'F-14 Fighter'

PilotSkills DIVIDED BY Hangar
pilot
=============================
'Smith'
'Wilson'

In this example, Smith and Wilson are the two pilots who can fly
everything in the hangar. Notice that Higgins and Celko know how to
fly a Piper Cub, but we don't have one right now. In Codd's original
definition of relational division, having more rows than are called for
is not a problem.

The important characteristic of a relational division is that the CROSS
JOIN (Cartesian product) of the divisor and the quotient produces a
valid subset of rows from the dividend. This is where the name comes
from, since the CROSS JOIN acts like a multiplication operator.

Division with a Remainder

There are two kinds of relational division. Division with a remainder
allows the dividend table to have more values than the divisor, which
was Codd's original definition. For example, if a pilot can fly more
planes than just those we have in the hangar, this is fine with us.
The query can be written in SQL-89 as

SELECT DISTINCT pilot
FROM PilotSkills AS PS1
WHERE NOT EXISTS
(SELECT *
FROM Hangar
WHERE NOT EXISTS
(SELECT *
FROM PilotSkills AS PS2
WHERE (PS1.pilot = PS2.pilot)
AND (PS2.plane = Hangar.plane)));

The quickest way to explain what is happening in this query is to
imagine an old World War II movie where a cocky pilot has just walked
into the hangar, looked over the fleet, and announced, "There ain't no
plane in this hangar that I can't fly!" We are finding the pilots for
whom there does not exist a plane in the hangar for which they have no
skills. The use of the NOT EXISTS() predicates is for speed. Most SQL
systems will look up a value in an index rather than scan the whole
table. The SELECT * clause lets the query optimizer choose the column
to use when looking for the index.

This query for relational division was made popular by Chris Date in
his textbooks, but it is not the only method nor always the fastest.
Another version of the division can be written so as to avoid three
levels of nesting. While it is not original with me, I have made it
popular in my books.

SELECT PS1.pilot
FROM PilotSkills AS PS1, Hangar AS H1
WHERE PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar);

There is a serious difference in the two methods. Burn down the
hangar, so that the divisor is empty. Because of the NOT EXISTS()
predicates in Date's query, all pilots are returned from a division by
an empty set. Because of the COUNT() functions in my query, no pilots
are returned from a division by an empty set.

In the sixth edition of his book, INTRODUCTION TO DATABASE SYSTEMS
(Addison-Wesley; 1995 ;ISBN 0-201-82458-2), Chris Date defined another
operator (DIVIDEBY ... PER) which produces the same results as my
query, but with more complexity.

Exact Division

The second kind of relational division is exact relational division.
The dividend table must match exactly to the values of the divisor
without any extra values.

SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar);

This says that a pilot must have the same number of certificates as
there planes in the hangar and these certificates all match to a plane
in the hangar, not something else. The "something else" is shown by a
created NULL from the LEFT OUTER JOIN.

Please do not make the mistake of trying to reduce the HAVING clause
with a little algebra to:

HAVING COUNT(PS1.plane) = COUNT(H1.plane)

because it does not work; it will tell you that the hangar has (n)
planes in it and the pilot is certified for (n) planes, but not that
those two sets of planes are equal to each other.

Note on Performance

The nested EXISTS() predicates version of relational division was made
popular by Chris Date's textbooks, while the author is associated with
popularizing the COUNT(*) version of relational division. The Winter
1996 edition of DB2 ON-LINE MAGAZINE
(http://www.db2mag.com/96011ar:htm) had an article entitled "Powerful
SQL:Beyond the Basics" by Sheryl Larsen which gave the results of
testing both methods. Her conclusion for DB2 was that the nested
EXISTS() version is better when the quotient has less than 25% of the
dividend table's rows and the COUNT(*) version is better when the
quotient is more than 25% of the dividend table.

Thank you CELKO. Was this an answer specifically for me, or a cut and paste
from elsewhere? Apart from the specific advice that my naming convention is
silly, do you have any suggested SQL for the problem in hand? I can't quite
make the jump from 'which pilots can fly all the planes in the hanger' to
'which contacts are short, fat and stupid'. I do already have two solutions
and Dan Guzman suggested a third and all of these work. The question is
really about optimizing the SQL now that the database has moved from Access
to SQL Server.

Jul 23 '05 #7

Ross Presser

On Fri, 22 Apr 2005 15:54:31 +0000 (UTC), Justin Hoffman wrote:

Thank you CELKO. Was this an answer specifically for me, or a cut and paste
from elsewhere?
It is his standard Relational Division piece.
Apart from the specific advice that my naming convention is
silly, do you have any suggested SQL for the problem in hand? I can't quite
make the jump from 'which pilots can fly all the planes in the hanger' to
'which contacts are short, fat and stupid'. I do already have two solutions
and Dan Guzman suggested a third and all of these work. The question is
really about optimizing the SQL now that the database has moved from Access
to SQL Server.

In fact, Celko's preferred Relational Division style amounts to the same
thing as your second solution, except that no "helper table" is really
necessary. Here is my take on it:

SELECT C.* FROM tblContact C
INNER JOIN tblConCat CC
ON C.ConID = CC.CctConID
WHERE CC.CctCatCode IN ('010101', '010102', '010103')
GROUP BY C.ConID
HAVING COUNT(*)=3

Now say you don't want to limit it to just those three categories. Suppose
your tblCategory looks like this:

CREATE TABLE tblCategory (
CatCode char(6),
CatName varchar(20),
InsultLevel int
)

INSERT INTO tblCategory VALUES ('010101', 'Short', 1)
INSERT INTO tblCategory VALUES ('010102', 'Fat', 2)
INSERT INTO tblCategory VALUES ('010103', 'Stupid', 2)
INSERT INTO tblCategory VALUES ('010104', 'Spanish', 0)
INSERT INTO tblCategory VALUES ('010105', 'Unemployable', 1)

And you want to find all the contacts who match all categories that have an
InsultLevel of at least 1. You don't have to know ahead of time what
categories those are, or even how many categories those are:

SELECT C.* FROM tblContact C
INNER JOIN tblConCat CC
ON C.ConID = CC.CctConID
INNER JOIN tblCategory CG
ON CC.CctCatCode = CG.CatCode
WHERE CG.InsultLevel >= 1
GROUP BY C.ConID
HAVING COUNT(*)=(
SELECT COUNT(*) FROM Category
WHERE Category.InsultLevel >= 1
)

Make sense?

Jul 23 '05 #8

Hugo Kornelis

On 22 Apr 2005 08:05:29 -0700, --CELKO-- wrote:

(snip)

Exact Division (snip)SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar); (snip)Please do not make the mistake of trying to reduce the HAVING clause
with a little algebra to:

HAVING COUNT(PS1.plane) = COUNT(H1.plane)

because it does not work;

Hi Joe,

But (correct me if I'm wrong) there is nothing wrong with reducing the
HAVING clause to

HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = COUNT(PS1.plane);

Right?

A clever optimizer would of course not perform the same subquery twice,
but I'm not too convinced that all optimizers are that clever...

(In fact, I just tested both on your sample data, and the execution plan
shows that SQL Server 2000 will indeed perform the same COUNT subquery
twice - at least with this small amount of rows. Maybe it will cough up
a better plan when there are a few million rows involved?)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #9

--CELKO--

>> just tested both on your sample data, and the execution plan shows
that SQL Server 2000 will indeed perform the same COUNT subquery twice
- at least with this small amount of rows. <<

That is awful! That is a matter of a parser finding a deterministic
common subexpression, not something hard.

Maybe it will cough up a better plan when there are a few million

rows involved? <<

I don't know, but it looks like the optimizer is just plain stupid in
the HAVING clause.

Jul 23 '05 #10

Erland Sommarskog

Justin Hoffman (j@b.com) writes:

Anyway, the result again is pretty fast and I guess I am wasting my time
building a helper table if it doesn't noticeably increase the speed but
does increase the complexity (the helper tables need to be dynamically
created in a multi-user environment). I looked at the execution plans,
but it gives so much information (cost, subtree cost, etc) that it
leaves a simple person like me somewhat lost. What I would like to know
is 'which is better: A or B?'

You are right that looking at the execution plan is very confusing,
and in the general case it is very difficult to tell from the plans
which query is the best. In some cases where you see a big table being
table-scanned, and you know that this should not need to happen, the
query plan can help you to rule something out.

I'd say that the prime usage for query plans is to help diagnosing why a
query is running slow.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #11

Hugo Kornelis

On 23 Apr 2005 08:43:26 -0700, --CELKO-- wrote:

just tested both on your sample data, and the execution plan showsthat SQL Server 2000 will indeed perform the same COUNT subquery twice
- at least with this small amount of rows. <<

That is awful! That is a matter of a parser finding a deterministic
common subexpression, not something hard.
Maybe it will cough up a better plan when there are a few million

rows involved? <<

I don't know, but it looks like the optimizer is just plain stupid in
the HAVING clause.

Hi Joe,

Since the optimizer in SQL Server is cost based, I can't rule out the
possibility that it will find this optimization if the table is much
bigger - but I agree that this plan seems wrong to me.

FWIW, I've just sent a mail to sq*****@microsoft.com, requesting that
they teach the optimizer to recognize and eliminate duplicate execution
of repeated subqueries in the same query.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #12

--CELKO--

>> Since the optimizer in SQL Server is cost based, I can't rule out
the possibility that it will find this optimization if the table is
much bigger - but I agree that this plan seems wrong to me. <<

We have had parsers that find common sub-expressions since FORTRAN in
the 1950's. This is not anything new to SQL. And it is the kind of
thing that you do *before* you turn the intermediate code into an
execution plan.

Jul 23 '05 #13

Ross Presser

On 23 Apr 2005 20:15:24 -0700, --CELKO-- wrote:

Since the optimizer in SQL Server is cost based, I can't rule out

the possibility that it will find this optimization if the table is
much bigger - but I agree that this plan seems wrong to me. <<

We have had parsers that find common sub-expressions since FORTRAN in
the 1950's. This is not anything new to SQL. And it is the kind of
thing that you do *before* you turn the intermediate code into an
execution plan.

It's not a simple parsing question, is it? if the clause like

HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar WHERE
Hangar.plane >= PS1.plane)
AND COUNT(H1.plane) = COUNT(PS1.plane)

then the SELECT *should* be executed repeatedly.

Perhaps I don't understand fully what you mean by "find common
sub-expressions."

(Admittedly, my HAVING clause destroys the relational division being done,
so it's useless in this context; but clauses like that are still legal SQL
that the optimizer must accept.)

Jul 23 '05 #14

Hugo Kornelis

On Tue, 26 Apr 2005 17:36:15 -0400, Ross Presser wrote:

On 23 Apr 2005 20:15:24 -0700, --CELKO-- wrote:
Since the optimizer in SQL Server is cost based, I can't rule out

the possibility that it will find this optimization if the table is
much bigger - but I agree that this plan seems wrong to me. <<

We have had parsers that find common sub-expressions since FORTRAN in
the 1950's. This is not anything new to SQL. And it is the kind of
thing that you do *before* you turn the intermediate code into an
execution plan.

It's not a simple parsing question, is it? if the clause like

HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar WHERE
Hangar.plane >= PS1.plane)
AND COUNT(H1.plane) = COUNT(PS1.plane)

then the SELECT *should* be executed repeatedly.

Perhaps I don't understand fully what you mean by "find common
sub-expressions."

(snip)

Hi Ross,

What I think Joe means, and what I definitely would like to see in the
next version of SQL Server, is that the optimizer scans the query, finds
subqueries that are either equal or differ only in the SELECT clause, then
makes a plan that eliminates the duplicate execution of that subquery.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Jul 23 '05 #15

Erland Sommarskog

Hugo Kornelis (hugo@pe_NO_rFact.in_SPAM_fo) writes:

What I think Joe means, and what I definitely would like to see in the
next version of SQL Server, is that the optimizer scans the query, finds
subqueries that are either equal or differ only in the SELECT clause, then
makes a plan that eliminates the duplicate execution of that subquery.

It's more to it than that. SQL 2005 has Common Table Expression, so that
you can define an expression prior to the query, and use that expression
in the query. When I played with such a query, I found that the expression
was computed once for each coccurence. But I like to add here that 1) this
was time ago 2) SQL 2005 is still in beta.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #16

--CELKO--

>> It's not a simple parsing question, is it? <<

Not always, but a lot of the time, yes.

SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar);

Factor out the un-correlated subquery like this and do it once:

BEGIN ATOMIC
SET plane_cnt = AS (SELECT COUNT(plane) FROM Hangar);

SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = plane_cnt
AND COUNT(H1.plane) = plane_cnt;
END;

Jul 23 '05 #17

SQL to find short, fat and stupid people

Similar topics