INNER JOIN/Index Threshold?

Dave

Guys I am really stuck on this one. Any help or suggestions would be
appreciated.
We have a large table which seemed to just hit some kind of threshold.
They query is somewhat responsive when there are NO indexes on the
table. However, when we index email the query takes forever.

FACTS
- The problem is very "data specific". I can not recreate the
problem using different data.
- There is only a problem when I index email on the base table.
- The problem goes away when I add "AND b.email IS NOT NULL" to the
inner join condition. It does not help when I add the logic to the
"WHERE" clause.
DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index icx on base(bk)
create index ix_email on base(email)
CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index icx on filter (bk)
create index ix_email on filter (email)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null

Data Profile
--35120500, 35120491, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
--16796199, 16796192, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
WHERE email IS NOT NULL

--250552, 250552, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
--250208, 250208, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
WHERE email IS NOT NULL

Jan 4 '07 #1

Subscribe Post Reply

5003

Tracy McKibben

Dave wrote:

Guys I am really stuck on this one. Any help or suggestions would be
appreciated.
We have a large table which seemed to just hit some kind of threshold.
They query is somewhat responsive when there are NO indexes on the
table. However, when we index email the query takes forever.

FACTS
- The problem is very "data specific". I can not recreate the
problem using different data.
- There is only a problem when I index email on the base table.
- The problem goes away when I add "AND b.email IS NOT NULL" to the
inner join condition. It does not help when I add the logic to the
"WHERE" clause.
DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index icx on base(bk)
create index ix_email on base(email)
CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index icx on filter (bk)
create index ix_email on filter (email)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null

Data Profile
--35120500, 35120491, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
--16796199, 16796192, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
WHERE email IS NOT NULL

--250552, 250552, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
--250208, 250208, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
WHERE email IS NOT NULL

What does the estimated execution plan look like? Try adding the column
BK as a second key in your email index.
--
Tracy McKibben
MCDBA
http://www.realsqlguy.com

Jan 4 '07 #2

Dave

bk is clustered so there is no reason to add it to the other index.

Estimated plan does an index scan on each table, then a merge join.

When I remove the index on base.email the plan does a table scan on
base then performs a Hash Match
Tracy McKibben wrote:

Dave wrote:
Guys I am really stuck on this one. Any help or suggestions would be
appreciated.
We have a large table which seemed to just hit some kind of threshold.
They query is somewhat responsive when there are NO indexes on the
table. However, when we index email the query takes forever.

FACTS
- The problem is very "data specific". I can not recreate the
problem using different data.
- There is only a problem when I index email on the base table.
- The problem goes away when I add "AND b.email IS NOT NULL" to the
inner join condition. It does not help when I add the logic to the
"WHERE" clause.
DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index icx on base(bk)
create index ix_email on base(email)
CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index icx on filter (bk)
create index ix_email on filter (email)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null

Data Profile
--35120500, 35120491, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
--16796199, 16796192, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
WHERE email IS NOT NULL

--250552, 250552, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
--250208, 250208, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
WHERE email IS NOT NULL

What does the estimated execution plan look like? Try adding the column
BK as a second key in your email index.
--
Tracy McKibben
MCDBA
http://www.realsqlguy.com

Jan 4 '07 #3

AlterEgo

Dave,

Do you have a lot of null emails on both tables? If so, that might be an
influencing factor.

-- Bill

"Dave" <da******@gmail.comwrote in message
news:11**********************@s80g2000cwa.googlegr oups.com...

Guys I am really stuck on this one. Any help or suggestions would be
appreciated.
We have a large table which seemed to just hit some kind of threshold.
They query is somewhat responsive when there are NO indexes on the
table. However, when we index email the query takes forever.

FACTS
- The problem is very "data specific". I can not recreate the
problem using different data.
- There is only a problem when I index email on the base table.
- The problem goes away when I add "AND b.email IS NOT NULL" to the
inner join condition. It does not help when I add the logic to the
"WHERE" clause.
DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index icx on base(bk)
create index ix_email on base(email)
CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index icx on filter (bk)
create index ix_email on filter (email)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null

Data Profile
--35120500, 35120491, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
--16796199, 16796192, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
WHERE email IS NOT NULL

--250552, 250552, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
--250208, 250208, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
WHERE email IS NOT NULL

Jan 4 '07 #4

Erland Sommarskog

Dave (da******@gmail.com) writes:

bk is clustered so there is no reason to add it to the other index.

Estimated plan does an index scan on each table, then a merge join.

When I remove the index on base.email the plan does a table scan on
base then performs a Hash Match

And the merge-join plan takes forever? That's indeed very funny, because
that's sounds like a quite decent plan. Your tables looks funny, because
they have no keys, but that would matter.

What happens if you try:

SELECT b.hk, b.email
FROM base b
WHERE EXISTS (SELECT *
FROM filter
WHERE f.email = b.email)
Which version of SQL Server?

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 4 '07 #5

Dave

Your correlated subquery works.
It added a stream aggregate step to the execution plan right before the
merge join.

This is really weird. Could it be an SQL Server Bug? Like I said
earlier, it is very data specific. I can add 3 million random records
to the filter table and it normally works fine. It is only this
specific dataset that is causing problems.
Erland Sommarskog wrote:

Dave (da******@gmail.com) writes:
bk is clustered so there is no reason to add it to the other index.

Estimated plan does an index scan on each table, then a merge join.

When I remove the index on base.email the plan does a table scan on
base then performs a Hash Match

And the merge-join plan takes forever? That's indeed very funny, because
that's sounds like a quite decent plan. Your tables looks funny, because
they have no keys, but that would matter.

What happens if you try:

SELECT b.hk, b.email
FROM base b
WHERE EXISTS (SELECT *
FROM filter
WHERE f.email = b.email)
Which version of SQL Server?

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 4 '07 #6

Gert-Jan Strik

Hi Dave,

I tried to create a situation similar to yours on SQL Server 2000. What
I noticed is:
- Without the extra predicate, the optimizer does not seem smart enough
to chose SEEK:([b].[email] IsNotNull) instead of a full Index Scan
- You are correct that the change in access method is not achieved if
the predicate is added to the WHERE clause. It seems it must be part of
the join ON clause
- If either "b.email IS NOT NULL" or "f.email IS NOT NULL" is added, the
join type remains MERGE JOIN. If both are added to the ON clause I get a
HASH JOIN
- I am not sure if the MERGE JOIN is faster or if the HASH JOIN is, but
in either case you may want to add the two "unnecessary" predicates

There is a thresshold somewhere (depending on your specific situation)
where merge-joining or hashing the two partially scanned indexes is
faster. Up to that point, the fastest plan will be to partially scan the
index of filter and loop join and seek each matching row in base. In all
cases it will be faster if NULL rows are excluded from the index seeks.

HTH,
Gert-Jan
Dave wrote:

>
Guys I am really stuck on this one. Any help or suggestions would be
appreciated.

We have a large table which seemed to just hit some kind of threshold.
They query is somewhat responsive when there are NO indexes on the
table. However, when we index email the query takes forever.

FACTS
- The problem is very "data specific". I can not recreate the
problem using different data.
- There is only a problem when I index email on the base table.
- The problem goes away when I add "AND b.email IS NOT NULL" to the
inner join condition. It does not help when I add the logic to the
"WHERE" clause.

DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index icx on base(bk)
create index ix_email on base(email)

CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index icx on filter (bk)
create index ix_email on filter (email)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null

Data Profile
--35120500, 35120491, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base

--16796199, 16796192, 14221553
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM base
WHERE email IS NOT NULL

--250552, 250552, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter

--250208, 250208, 250205
SELECT COUNT(*) ,COUNT(DISTINCT bk), COUNT(DISTINCT email)
FROM filter
WHERE email IS NOT NULL

Jan 5 '07 #7

Erland Sommarskog

Dave (da******@gmail.com) writes:

Your correlated subquery works.
It added a stream aggregate step to the execution plan right before the
merge join.

This is really weird. Could it be an SQL Server Bug? Like I said
earlier, it is very data specific. I can add 3 million random records
to the filter table and it normally works fine. It is only this
specific dataset that is causing problems.

Note that the two queries are not semantically equivalent. If a certain
email address appears four times in the filter and four times in base, your
original query will return 16 rows, while mine will return four.

With the numbers you posted there are only four duplicates in the filter.
Provided, that is, you are running with ANSI_NULLS ON. If this is in a
stored procedure saved from Enterprise Manager, you are likely to be running
with ANSI_NULLS OFF, in which case NULL = NULL, and you will get an
enormous number of duplicates. Which are filtered away if you add the
IS NOT NULL filter.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 5 '07 #8

Dave

We are using SQL 2K Enterprise SP4.

Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.

I added this to the test script and it did not help.
SET ANSI_NULLS ON
I think there are two things going on here.
1. SQL Server is not picking the best plan
2. There is an issue with my specific data set that I can not reproduce
with any other dataset except for "some" subsets or supersets.
Thanks for the help guys!!!

Erland Sommarskog wrote:

Dave (da******@gmail.com) writes:
Your correlated subquery works.
It added a stream aggregate step to the execution plan right before the
merge join.

This is really weird. Could it be an SQL Server Bug? Like I said
earlier, it is very data specific. I can add 3 million random records
to the filter table and it normally works fine. It is only this
specific dataset that is causing problems.

Note that the two queries are not semantically equivalent. If a certain
email address appears four times in the filter and four times in base, your
original query will return 16 rows, while mine will return four.

With the numbers you posted there are only four duplicates in the filter.
Provided, that is, you are running with ANSI_NULLS ON. If this is in a
stored procedure saved from Enterprise Manager, you are likely to be running
with ANSI_NULLS OFF, in which case NULL = NULL, and you will get an
enormous number of duplicates. Which are filtered away if you add the
IS NOT NULL filter.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 5 '07 #9

Erland Sommarskog

Dave (da******@gmail.com) writes:

Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.

I added this to the test script and it did not help.
SET ANSI_NULLS ON

As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 5 '07 #10

dmarkle

I suspect the reason (...AND ... IS NOT NULL) is helping you out is
that it's forcing the optimizer to choose a table scan instead if
hitting up that index you made on email.

Before you go changing minutiae like ANSI_NULLS, do realize that your
tables are, (relationally speaking), nonsense. Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

You should do the following:
1) Make the clustered index on "bk" UNIQUE. If you can't do that...

2) Add email to your clustered index as a secondary key and recreate
the clustered index as UNIQUE, or make the combination of the two
columns a PRIMARY KEY. If you can't do that...

3) Analyze your data and fix your process to not allow wholly duplicate
rows, which are nonsensical. If you can't do THAT (ugh):

4) Add an IDENTITY column to your table. So as not to drastically
increase your index sizes, you could either add this column as a
secondary key after bk and make the clustered index UNIQUE, or you
could create a nonclustered PRIMARY KEY on that identity.

Let me know how it goes.
-Dave Markle
http://www.markleconsulting.com

Erland Sommarskog wrote:

Dave (da******@gmail.com) writes:
Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.

I added this to the test script and it did not help.
SET ANSI_NULLS ON

As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 8 '07 #11

Dave

I like your explanation!

The table does not currently have a primary key.

I basically have options 2 and 4 available; 1 and 3 are out of my
control.
I will lobby for adding an identity column to this table. It happens
to be very wide with 12 single column indexes so I really don't want
to make the clustered super wide.

Just for clarities sake, in order for the optimizer to properly
understand the tables, do I have to add a PK constraint to the identity
column?

>Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

Do you know of any book or other reference to this (specifically how
the optimizer needs a key)? It would really help me convince everyone
to use keys on everything.

Thanks!

dmarkle wrote:

I suspect the reason (...AND ... IS NOT NULL) is helping you out is
that it's forcing the optimizer to choose a table scan instead if
hitting up that index you made on email.

Before you go changing minutiae like ANSI_NULLS, do realize that your
tables are, (relationally speaking), nonsense. Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

You should do the following:
1) Make the clustered index on "bk" UNIQUE. If you can't do that...

2) Add email to your clustered index as a secondary key and recreate
the clustered index as UNIQUE, or make the combination of the two
columns a PRIMARY KEY. If you can't do that...

3) Analyze your data and fix your process to not allow wholly duplicate
rows, which are nonsensical. If you can't do THAT (ugh):

4) Add an IDENTITY column to your table. So as not to drastically
increase your index sizes, you could either add this column as a
secondary key after bk and make the clustered index UNIQUE, or you
could create a nonclustered PRIMARY KEY on that identity.

Let me know how it goes.
-Dave Markle
http://www.markleconsulting.com

Erland Sommarskog wrote:
Dave (da******@gmail.com) writes:
Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.
>
I added this to the test script and it did not help.
SET ANSI_NULLS ON
As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 8 '07 #12

dmarkle

Well, it comes mostly from relational and set theory. I like to
explain the need for PK's like this:

If your wife makes you go to the grocery store and she asks you what
you bought, you'd give a set back as your answer:

"Peas, carrots, potatoes, and Coca-Cola"

You wouldn't say:

"Peas, carrots, potatoes, peas, and Coca-Cola".

The two answers aren't equivalent. One simply makes sense, and one
does not. For example, in set theory, you can express a set of numbers
as:

{1,2,3,4}

but not:

{1,2,2,3,4}. The second representation is invalid.

Like giving your grocery list or giving a set of numbers, the engine's
theoretical foundation is based on relational theory, which is closely
tied to set theory.

If you want a book with a recommendation, "Inside SQL Server" as well
as a lot of other DB books out there will tell you the same thing --
always have a PK on each table you create.

I'm sorry, I'll admit that I don't have the hard evidence to prove to
you that not having a PK is going to screw up the optimizer. I do try
to stay away from "cargo cult" type "I saw this once..." sort of
recommendations, but in this case, since not having a PK on each table
is such a no-no, I'd try this first. I like #2 best. If "email" is
distinct in the table by itself, that's going to be even better...
Maybe copy your data over to a development database and try these out
and see how it goes.

Good luck.
-Dave

Dave wrote:

I like your explanation!

The table does not currently have a primary key.

I basically have options 2 and 4 available; 1 and 3 are out of my
control.
I will lobby for adding an identity column to this table. It happens
to be very wide with 12 single column indexes so I really don't want
to make the clustered super wide.

Just for clarities sake, in order for the optimizer to properly
understand the tables, do I have to add a PK constraint to the identity
column?

>Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

Do you know of any book or other reference to this (specifically how
the optimizer needs a key)? It would really help me convince everyone
to use keys on everything.

Thanks!

dmarkle wrote:
I suspect the reason (...AND ... IS NOT NULL) is helping you out is
that it's forcing the optimizer to choose a table scan instead if
hitting up that index you made on email.

Before you go changing minutiae like ANSI_NULLS, do realize that your
tables are, (relationally speaking), nonsense. Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

You should do the following:
1) Make the clustered index on "bk" UNIQUE. If you can't do that...

2) Add email to your clustered index as a secondary key and recreate
the clustered index as UNIQUE, or make the combination of the two
columns a PRIMARY KEY. If you can't do that...

3) Analyze your data and fix your process to not allow wholly duplicate
rows, which are nonsensical. If you can't do THAT (ugh):

4) Add an IDENTITY column to your table. So as not to drastically
increase your index sizes, you could either add this column as a
secondary key after bk and make the clustered index UNIQUE, or you
could create a nonclustered PRIMARY KEY on that identity.

Let me know how it goes.
-Dave Markle
http://www.markleconsulting.com

Erland Sommarskog wrote:
Dave (da******@gmail.com) writes:
Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.

I added this to the test script and it did not help.
SET ANSI_NULLS ON
>
As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.
>
>
>
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se
>
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 8 '07 #13

Dave

Thanks for the explanation.

I copied the tables to a test db. I created an identity column on each
table and created a unique clustered in the identity columns. I then
created a non-clustered on bk and a non clustered on email.

This did not help the query performance.

I will keep checking back here to see if anyone else has any
suggestions on how to trouble shoot this. It is no longer a critical
issue due to the work around. I just don't like the fact that we can
not explain this behavior.
dmarkle wrote:

Well, it comes mostly from relational and set theory. I like to
explain the need for PK's like this:

If your wife makes you go to the grocery store and she asks you what
you bought, you'd give a set back as your answer:

"Peas, carrots, potatoes, and Coca-Cola"

You wouldn't say:

"Peas, carrots, potatoes, peas, and Coca-Cola".

The two answers aren't equivalent. One simply makes sense, and one
does not. For example, in set theory, you can express a set of numbers
as:

{1,2,3,4}

but not:

{1,2,2,3,4}. The second representation is invalid.

Like giving your grocery list or giving a set of numbers, the engine's
theoretical foundation is based on relational theory, which is closely
tied to set theory.

If you want a book with a recommendation, "Inside SQL Server" as well
as a lot of other DB books out there will tell you the same thing --
always have a PK on each table you create.

I'm sorry, I'll admit that I don't have the hard evidence to prove to
you that not having a PK is going to screw up the optimizer. I do try
to stay away from "cargo cult" type "I saw this once..." sort of
recommendations, but in this case, since not having a PK on each table
is such a no-no, I'd try this first. I like #2 best. If "email" is
distinct in the table by itself, that's going to be even better...
Maybe copy your data over to a development database and try these out
and see how it goes.

Good luck.
-Dave

Dave wrote:
I like your explanation!

The table does not currently have a primary key.

I basically have options 2 and 4 available; 1 and 3 are out of my
control.
I will lobby for adding an identity column to this table. It happens
to be very wide with 12 single column indexes so I really don't want
to make the clustered super wide.

Just for clarities sake, in order for the optimizer to properly
understand the tables, do I have to add a PK constraint to the identity
column?

>Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.
Do you know of any book or other reference to this (specifically how
the optimizer needs a key)? It would really help me convince everyone
to use keys on everything.

Thanks!

dmarkle wrote:
I suspect the reason (...AND ... IS NOT NULL) is helping you out is
that it's forcing the optimizer to choose a table scan instead if
hitting up that index you made on email.
>
Before you go changing minutiae like ANSI_NULLS, do realize that your
tables are, (relationally speaking), nonsense. Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.
>
You should do the following:
1) Make the clustered index on "bk" UNIQUE. If you can't do that...
>
2) Add email to your clustered index as a secondary key and recreate
the clustered index as UNIQUE, or make the combination of the two
columns a PRIMARY KEY. If you can't do that...
>
3) Analyze your data and fix your process to not allow wholly duplicate
rows, which are nonsensical. If you can't do THAT (ugh):
>
4) Add an IDENTITY column to your table. So as not to drastically
increase your index sizes, you could either add this column as a
secondary key after bk and make the clustered index UNIQUE, or you
could create a nonclustered PRIMARY KEY on that identity.
>
Let me know how it goes.
-Dave Markle
http://www.markleconsulting.com
>
>
>
>
>
Erland Sommarskog wrote:
Dave (da******@gmail.com) writes:
Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.
>
I added this to the test script and it did not help.
SET ANSI_NULLS ON

As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 8 '07 #14

Erland Sommarskog

Dave (da******@gmail.com) writes:

Do you know of any book or other reference to this (specifically how
the optimizer needs a key)? It would really help me convince everyone
to use keys on everything.

It's not really a matter of the optimizer needing a key. Adding an
IDENTITY column is not going to change this particular problem. It's
more about database design in general. Every table in a database should
have a primary key. Preferrably this should be a natural key, but this
is not always possible. In a relational database, you access data through
data, that is the primary key.

By defining good primary keys, you can avoid duplicate data. With further
normalisation, there are more anomalies you can avoid.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 8 '07 #15

JXStern

Interesting problem. It seems that the optimizer likes the "not null"
hint and is treating it differently than other predicates. I see you
do have a lot of nulls in your database.

I've run into a number of situations with minimally indexed tables,
where adding additional indexes slows down execution (of selects).
This is always weird, since one would hope the optimizer could judge
and ignore the index in those cases!

The size of your data can count, too. In fact, I'm working on
something along those lines these days myself. Once the optimizer
realizes your data is a lot larger than your RAM, it has to switch to
multipass sorts and merges which are MUCH slower. I'm not sure these
even show on the exection plans, preview nor afterwards. Maybe it
shows on set statistics profile.

I don't have any specific suggestions. If you kept the null-valued
records in a separate table, ... something along those lines, is
somewhat extreme but might help out, except you have your
"work-around", so what the heck! Maybe SQL2005 partitioned tables ...

Josh

On 8 Jan 2007 13:00:17 -0800, "Dave" <da******@gmail.comwrote:

>Thanks for the explanation.

I copied the tables to a test db. I created an identity column on each
table and created a unique clustered in the identity columns. I then
created a non-clustered on bk and a non clustered on email.

This did not help the query performance.

I will keep checking back here to see if anyone else has any
suggestions on how to trouble shoot this. It is no longer a critical
issue due to the work around. I just don't like the fact that we can
not explain this behavior.
dmarkle wrote:
>Well, it comes mostly from relational and set theory. I like to
explain the need for PK's like this:

If your wife makes you go to the grocery store and she asks you what
you bought, you'd give a set back as your answer:

"Peas, carrots, potatoes, and Coca-Cola"

You wouldn't say:

"Peas, carrots, potatoes, peas, and Coca-Cola".

The two answers aren't equivalent. One simply makes sense, and one
does not. For example, in set theory, you can express a set of numbers
as:

{1,2,3,4}

but not:

{1,2,2,3,4}. The second representation is invalid.

Like giving your grocery list or giving a set of numbers, the engine's
theoretical foundation is based on relational theory, which is closely
tied to set theory.

If you want a book with a recommendation, "Inside SQL Server" as well
as a lot of other DB books out there will tell you the same thing --
always have a PK on each table you create.

I'm sorry, I'll admit that I don't have the hard evidence to prove to
you that not having a PK is going to screw up the optimizer. I do try
to stay away from "cargo cult" type "I saw this once..." sort of
recommendations, but in this case, since not having a PK on each table
is such a no-no, I'd try this first. I like #2 best. If "email" is
distinct in the table by itself, that's going to be even better...
Maybe copy your data over to a development database and try these out
and see how it goes.

Good luck.
-Dave

Dave wrote:
I like your explanation!

The table does not currently have a primary key.

I basically have options 2 and 4 available; 1 and 3 are out of my
control.
I will lobby for adding an identity column to this table. It happens
to be very wide with 12 single column indexes so I really don't want
to make the clustered super wide.

Just for clarities sake, in order for the optimizer to properly
understand the tables, do I have to add a PK constraint to the identity
column?

>Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

Do you know of any book or other reference to this (specifically how
the optimizer needs a key)? It would really help me convince everyone
to use keys on everything.

Thanks!

dmarkle wrote:
I suspect the reason (...AND ... IS NOT NULL) is helping you out is
that it's forcing the optimizer to choose a table scan instead if
hitting up that index you made on email.

Before you go changing minutiae like ANSI_NULLS, do realize that your
tables are, (relationally speaking), nonsense. Any table without a
PRIMARY KEY or otherwise unique key is likely going to cause problems
with the optimizer, and it often happens in bizarre ways like this.

You should do the following:
1) Make the clustered index on "bk" UNIQUE. If you can't do that...

2) Add email to your clustered index as a secondary key and recreate
the clustered index as UNIQUE, or make the combination of the two
columns a PRIMARY KEY. If you can't do that...

3) Analyze your data and fix your process to not allow wholly duplicate
rows, which are nonsensical. If you can't do THAT (ugh):

4) Add an IDENTITY column to your table. So as not to drastically
increase your index sizes, you could either add this column as a
secondary key after bk and make the clustered index UNIQUE, or you
could create a nonclustered PRIMARY KEY on that identity.

Let me know how it goes.
-Dave Markle
http://www.markleconsulting.com

Erland Sommarskog wrote:
Dave (da******@gmail.com) writes:
Erland that was an excellent post! However it was not my problem. The
code is run from .NET application and I am testing using SQL Management
Studio.
>
I added this to the test script and it did not help.
SET ANSI_NULLS ON

As long it's a loose script, ANSI_BULLS would be on by default. I will
have to admit that I was clutching a straws. Without access to the
database seeing it, it is very difficult to analyse the problem accurately.
It's not alwyas that easy even if you have full acecss.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 9 '07 #16

othellomy

Hi,
I have modified your query a little bit and hope this will help.

DDL
CREATE TABLE base (bk char(25), email varchar(100))
create clustered index ix_email on base(email)
create index icx on base(bk)

CREATE TABLE filter (bk char(25), email varchar(100))
create clustered index ix_email on filter (email)
create index icx on filter (bk)

Query
SELECT b.bk, b.email
FROM base b WITH(NOLOCK)
INNER JOIN filter f ON f.email = b.email
--and f.email is not null
Dave wrote:

Thanks for the explanation.

I copied the tables to a test db. I created an identity column on each
table and created a unique clustered in the identity columns. I then
created a non-clustered on bk and a non clustered on email.

This did not help the query performance.

I will keep checking back here to see if anyone else has any
suggestions on how to trouble shoot this. It is no longer a critical
issue due to the work around. I just don't like the fact that we can
not explain this behavior.
dmarkle wrote:
Well, it comes mostly from relational and set theory. I like to
explain the need for PK's like this:

If your wife makes you go to the grocery store and she asks you what
you bought, you'd give a set back as your answer:

"Peas, carrots, potatoes, and Coca-Cola"

You wouldn't say:

"Peas, carrots, potatoes, peas, and Coca-Cola".

Jan 9 '07 #17

Hugo Kornelis

On Fri, 5 Jan 2007 23:18:30 +0000 (UTC), Erland Sommarskog wrote:

ANSI_BULLS

Hi Erland,

Is that your secret nickname for Joe Celko? <g>

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis

Jan 9 '07 #18

Erland Sommarskog

Hugo Kornelis (hu**@perFact.REMOVETHIS.info.INVALID) writes:

On Fri, 5 Jan 2007 23:18:30 +0000 (UTC), Erland Sommarskog wrote:

>ANSI_BULLS

Hi Erland,

Is that your secret nickname for Joe Celko? <g>

Nah, I am to COWard for that.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jan 9 '07 #19

Similar topics