Recursive Query

Bob

Hi there,

Need a little help with a certain query that's causing a lot of acid
in my stomach...

Have a table that stores sales measures for a given client. The sales
measures are stored per year and there could be multiple sales
measures every year per client. There is another field called last
update date. If there are multiple sales measures then need to select
the one that's been entered last based on this field. Also, if there
is no sales measure data for current year then I need to return the
last year's data for which sales measure has been entered. For
example: if client #1 has sales measure value of $200 for 1999 and
nothing since, then I need to return $200 for any year following 1999.

So the query would look something like this:

SELECT client_name, sm_dollars FROM <tables>

Based on the DDL at the bottom I would expect to get back: c1, 100;
c2, 200

The way I am doing it now is with correlated subqueries (3 to be
exact) that each do an aggregate and join back to the original table.
It works, but it is notoriously slow. SQL Server is scanning the
index and does a merge join which in a large query takes %95 of the
time. Here is the part of the query plan for it:

| | | | | | |--Merge
Join(Inner Join, MANY-TO-MANY
MERGE:([sales_measure].[client_id])=([sales_measure].[client_id]),RESIDUAL:(([sales_measure].[client_id]=[sales_measure].[client_id]
AND [sales_measure].[tax_year]=[sales_measure].[tax_year]) AND
[Expr1013]=[sales_measure].[last_update_date]))
| | | | | | |--Stream
Aggregate(GROUP BY:([sales_measure].[client_id],
[sales_measure].[tax_year])
DEFINE:([Expr1013]=MAX([sales_measure].[last_update_date])))
| | | | | | | |--Merge
Join(Inner Join, MERGE:([sales_measure].[client_id],
[Expr1010])=([sales_measure].[client_id], [sales_measure].[tax_year]),
RESIDUAL:([sales_measure].[client_id]=[sales_measure].[client_id] AND
[sales_measure].[tax_year]=[Expr1010]))
| | | | | | |
|--Stream Aggregate(GROUP BY:([sales_measure].[client_id])
DEFINE:([Expr1010]=MAX([sales_measure].[tax_year])))
| | | | | | | |
|--Index Scan(OBJECT:([stars_perftest].[dbo].[sales_measure].[sales_measure_idx1]),
ORDERED FORWARD)
| | | | | | |
|--Index Scan(OBJECT:([stars_perftest].[dbo].[sales_measure].[sales_measure_idx1]),
ORDERED FORWARD)
| | | | | | |--Index
Scan(OBJECT:([stars_perftest].[dbo].[sales_measure].[sales_measure_idx1]),
ORDERED FORWARD)
There are two indexes on sales measure table:

sales_measure_pk - sales_measure_id (primary key) clustered
sales_measure_idx1 - client_id, tax_year, last_update_date, sm_dollars

sales_measure table has 800,000 rows in it.

Here is the rest of the DDL:

IF OBJECT_ID('dbo.client') IS NOT NULL
DROP TABLE dbo.client
GO
create table dbo.client (
client_id int identity primary key
, client_name varchar(100) NOT NULL)
GO
IF OBJECT_ID('dbo.sales_measure') IS NOT NULL
DROP TABLE dbo.sales_measure
GO
create table dbo.sales_measure(
sales_measure_id int identity primary key
, client_id int NOT NULL
, tax_year smallint NOT NULL
, sm_dollars money NOT NULL
, last_update_date datetime NOT NULL)
GO
CREATE INDEX sales_measure_idx1 ON sales_measure (client_id, tax_year,
last_update_date, sm_dollars)
GO
INSERT dbo.client(client_name)
SELECT 'c1' UNION SELECT 'c2' UNION SELECT 'c3'
GO
INSERT dbo.sales_measure(client_id, tax_year, sm_dollars,
last_update_date)
SELECT 1, 2004, 100, '1/4/2004'
UNION
SELECT 2, 2003, 100, '1/3/2004'
UNION
SELECT 2, 2004, 150, '1/4/2004'
UNION
SELECT 2, 2004, 200, '1/5/2004'
The view that I use to calculate sales measures:

CREATE VIEW sales_measure_vw AS
SELECT sm.*
FROM sales_measure sm
INNER JOIN (SELECT sm2.client_id, sm2.tax_year,
MAX(sm2.last_update_date) as last_update_date
FROM sales_measure sm2
INNER JOIN (SELECT sm4.client_id, MAX(sm4.tax_year)
as tax_year
FROM sales_measure sm4 GROUP BY
sm4.client_id) sm3
on sm3.client_id = sm2.client_id
and sm3.tax_year = sm2.tax_year
GROUP BY sm2.client_id, sm2.tax_year ) sm1
ON sm.client_id = sm1.client_id AND
sm.tax_year = sm1.tax_year AND
sm.last_update_date = sm1.last_update_date

Any advice on how to tame this would be appreciated. Also, any advice
on the indexes would help as well.

Thanks

Bob

Jul 20 '05 #1

Subscribe Post Reply

9262

--CELKO--

Do you really have clients with 100 character names? Well, if you
don't take the time to design the columns in the tables properly, you
will. Why do you think that an IDENTITY can ever be used as a key? The
tax year is computed from the creation date, so you have a redundancy.
Isn't the real key for the sales measures is (client_id,
creation_date)? You are missing DRI. Unless you have only one client
and one sales measurement, you ought to use plural or collective nouns
for the table names; tables are sets, not scalars.

You used the proprietary MONEY data type instead of a valid SQL
datatype. The ISO-8601 standard date format "yyyy-mm-dd" is the only
one allowed in Standard SQL. So cleaning up the DDL, we get something
like:

CREATE TABLE Clients
(client_id INTEGER NOT NULL PRIMARY KEY,
client_name VARCHAR(35) NOT NULL);

CREATE TABLE SalesMeasures
(client_id INTEGER NOT NULL
REFERENCES Clients (client_id),
creation_date DATETIME NOT NULL,
sm_dollars DECIMAL(12,4) NOT NULL,
PRIMARY KEY (client_id, creation_date));

Have a table that stores sales measures for a given client. The sales measures are stored per year and there could be multiple sales
measures every year per client. There is another field [sic] called
last update date. If there are multiple sales measures then need to
select the one that's been entered last based on this field [sic]. <<

Fields and columns are totally different concepts. What you have is a
history of the Sales measures. You keep adding to it, not updating
it.
Also, if there is no sales measure data for current year then I

need to return the last year's data for which sales measure has been
entered. For example: if client #1 has sales measure value of $200 for
1999 and nothing since, then I need to return $200 for any year
following 1999. <<

How do you tell if the $200 in the year 2000 is a real or constructed
value? You are destroying information. Missing data is what a NULL
is used for. If you want to do this, then do it in the front end.

One of the standard programming tricks in SQL is to build a Calendar
table to hold all of the temporal data for your enterprise - fiscal
years, reporting periods, work days, holidays, etc. Let's assume that
you have a VIEW or a derived table on this Calendar that returns just
a list of years for which you want reports.

SELECT CL.cal_year, C1.client_id, S1.creation_date, S1.sm_dollars
FROM ((SELECT DISTINCT EXTRACT(YEAR FROM cal_date) AS cal_year
FROM Calendar
WHERE cal_date BETWEEN @start_date AND @end_date)
CROSS JOIN
(SELECT client_name FROM Clients)) AS C1 (cal_year,
client_name)
LEFT OUTER JOIN
SalesMeasures AS S1
ON S1.creation_date
= (SELECT MAX(S2.creation_date)
FROM SalesMeasures AS S2
WHERE S1.client_id = S2.client_id
AND EXTRACT(YEAR FROM S2.creation_date)
= C1.cal_year);
The cross join builds all combinations of clients and years. The
outer join will give you the latest creation date within each year for
each client.

Jul 20 '05 #2

Erland Sommarskog

[posted and mailed, please reply in news]

Bob (bo*********@yahoo.com) writes:

Have a table that stores sales measures for a given client. The sales
measures are stored per year and there could be multiple sales
measures every year per client. There is another field called last
update date. If there are multiple sales measures then need to select
the one that's been entered last based on this field. Also, if there
is no sales measure data for current year then I need to return the
last year's data for which sales measure has been entered. For
example: if client #1 has sales measure value of $200 for 1999 and
nothing since, then I need to return $200 for any year following 1999.
Could you assume that the most recent LastUpdated also refers to the
most recent year? Or could the figures for 2002 be updated after those
for 2002? If not, you could take a shortcut.
The way I am doing it now is with correlated subqueries (3 to be
exact) that each do an aggregate and join back to the original table.
It works, but it is notoriously slow. SQL Server is scanning the
index and does a merge join which in a large query takes %95 of the
time. Here is the part of the query plan for it:

Actually, those are derived tables, not correlated subqueries. A correlated
subquery correlates to the outer query, by refering to tables in the
outer query. Here, each derived table is independent of what is outside
of it.

If your query produces the right result, then I would not rewrite it.
Better look at the indexing. Here are some suggestions:

1) Drop sales_measure_id and make (client_id, tax_year, last_updated)
the primary key of the table. This saves four bytes per row from
each query, so you can only win. If you cannot drop this key, then
make it non-clustered.
2) Replace sales_measure_idx1, by two indexes a non-clustered index on
(client_id, tax_year) and a clustered index on (client_id, tax_year,
last_updated). Possibly also a non-clustered index on this triplet
as well. Don't include sm_dollars in any index.

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #3

Bob

Hi Bill,

Thanks for your reply. Here are few comments:

Why do you think that an IDENTITY can ever be used as a key? The
tax year is computed from the creation date, so you have a redundancy.
Actually, the tax year is not computed. A user can input sales
measure data for 5 years back or anytime for that metter. So the
fields need to be separated.
Isn't the real key for the sales measures is (client_id,
creation_date)? You are missing DRI.
There could be multiple entries per client per creation_date so the
combonation of these two fields is not unique.
How do you tell if the $200 in the year 2000 is a real or constructed
value? You are destroying information. Missing data is what a NULL
is used for. If you want to do this, then do it in the front end.
If I understand correctly, you are suggesting that I create at least
one entry for each client per year and enter NULL as a dollar amount.
I am not sure what you mean by "do it in the front end". I think I
would have to return a lot more data to the front end then I would
like.
One of the standard programming tricks in SQL is to build a Calendar
table to hold all of the temporal data for your enterprise - fiscal
years, reporting periods, work days, holidays, etc. Let's assume that
you have a VIEW or a derived table on this Calendar that returns just
a list of years for which you want reports.

SELECT CL.cal_year, C1.client_id, S1.creation_date, S1.sm_dollars
FROM ((SELECT DISTINCT EXTRACT(YEAR FROM cal_date) AS cal_year
FROM Calendar
WHERE cal_date BETWEEN @start_date AND @end_date)
CROSS JOIN
(SELECT client_name FROM Clients)) AS C1 (cal_year,
client_name)
LEFT OUTER JOIN
SalesMeasures AS S1
ON S1.creation_date
= (SELECT MAX(S2.creation_date)
FROM SalesMeasures AS S2
WHERE S1.client_id = S2.client_id
AND EXTRACT(YEAR FROM S2.creation_date)
= C1.cal_year);
The cross join builds all combinations of clients and years. The
outer join will give you the latest creation date within each year for
each client.

I am having difficulties running this SQL. There is no alias on the
first subquery in the cross join and also SQL Server is not
understanding some of the syntax.

Thanks

Bob

Jul 20 '05 #4

--CELKO--

>> Actually, the tax year is not computed. A user can input sales
measure data for 5 years back or anytime for that metter. So the
fields [sic] need to be separated. <<

That does not make any sense. If I put in *any* date, I can extract a
year from it.

There could be multiple entries per client per creation_date so the combination of these two fields is not unique. <<

So what is the real key? If you don't have one, then this is not a
real table and you cannot expect to get a relational answer to the
query.
If I understand correctly, you are suggesting that I create at least one entry for each client per year and enter NULL as a dollar
amount. I am not sure what you mean by "do it in the front end". I
think I would have to return a lot more data to the front end then I
would like. <<

If you wanted one row per client per year, then the number of rows are
the same if you use NULL or a bogus "carry-forward" value. But the
NULL shows that there was actually no value for that client that year.
The NULL can then be handled in the front end application to "destroy
the truth" for your report.
I am having difficulties running this SQL. <<

Since you cross-posted in the general database newsgroup, I gave you
Standard SQL-92 code (well, the at-sign -- aka "petite escarot" --
should have been colon). A little trick; you can test code for
conformance at this website:

http://developer.mimer.com/validator...dex.tml#parser

Jul 20 '05 #5

Erland Sommarskog

--CELKO-- (jc*******@earthlink.net) writes:

Actually, the tax year is not computed. A user can input sales
measure data for 5 years back or anytime for that metter. So the
fields [sic] need to be separated. <<

That does not make any sense. If I put in *any* date, I can extract a
year from it.

I understand Bob as that a user could on 2004-05-13 enter data for 2002,
in which case you cannot extract any year from any date.
Since you cross-posted in the general database newsgroup, I gave you
Standard SQL-92 code (well, the at-sign -- aka "petite escarot" --
should have been colon). A little trick; you can test code for
conformance at this website:

So in SQL-92 you are not required to alias your derived tables?

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #6

Bob

Hi Erland,

Thanks for your reply. Your suggestion with indexes did not help.
The query plan did not change for the better with the suggested
indexes in place.

Thanks

Bob

Erland Sommarskog <so****@algonet.se> wrote in message news:<Xn**********************@127.0.0.1>...

[posted and mailed, please reply in news]

Bob (bo*********@yahoo.com) writes:
Have a table that stores sales measures for a given client. The sales
measures are stored per year and there could be multiple sales
measures every year per client. There is another field called last
update date. If there are multiple sales measures then need to select
the one that's been entered last based on this field. Also, if there
is no sales measure data for current year then I need to return the
last year's data for which sales measure has been entered. For
example: if client #1 has sales measure value of $200 for 1999 and
nothing since, then I need to return $200 for any year following 1999.

Could you assume that the most recent LastUpdated also refers to the
most recent year? Or could the figures for 2002 be updated after those
for 2002? If not, you could take a shortcut.
The way I am doing it now is with correlated subqueries (3 to be
exact) that each do an aggregate and join back to the original table.
It works, but it is notoriously slow. SQL Server is scanning the
index and does a merge join which in a large query takes %95 of the
time. Here is the part of the query plan for it:

Actually, those are derived tables, not correlated subqueries. A correlated
subquery correlates to the outer query, by refering to tables in the
outer query. Here, each derived table is independent of what is outside
of it.

If your query produces the right result, then I would not rewrite it.
Better look at the indexing. Here are some suggestions:

1) Drop sales_measure_id and make (client_id, tax_year, last_updated)
the primary key of the table. This saves four bytes per row from
each query, so you can only win. If you cannot drop this key, then
make it non-clustered.
2) Replace sales_measure_idx1, by two indexes a non-clustered index on
(client_id, tax_year) and a clustered index on (client_id, tax_year,
last_updated). Possibly also a non-clustered index on this triplet
as well. Don't include sm_dollars in any index.

Jul 20 '05 #7

Erland Sommarskog

Bob (bo*********@yahoo.com) writes:

Thanks for your reply. Your suggestion with indexes did not help.
The query plan did not change for the better with the suggested
indexes in place.

Too bad! I don't have any more suggestions right, as I'm off to bed.
All I can say now is just try to fiddle around with query and indexes.
I'll have another look at your problem later.

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #8

Joe Celko

>> So in SQL-92 you are not required to alias your derived tables? <<

Nope, not required. The scoping rules can hide things. For example:

SELECT ..
FROM (SELECT x FROM Foo
LEFT OUTER JOIN
SELECT y FROM Bar
ON Foo.a = Bar.a)
WHERE ..;

exposes both Bar and Foo for use in the WHERE clause. However,

SELECT ..
FROM (SELECT x FROM Foo
LEFT OUTER JOIN
SELECT y FROM Bar
ON Foo.a = Bar.a) AS Foobar(x,y)
WHERE ..;

Hides Foo and Bar, and you have to use the alias Foobar in the WHERE
clause. Pretty much what you would get from a VIEW.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #9

Erland Sommarskog

Bob (bo*********@yahoo.com) writes:

Thanks for your reply. Your suggestion with indexes did not help.
The query plan did not change for the better with the suggested
indexes in place.

But at least you should get slightly better performance with sm_dollars
out from the index.

I have looked at little more at your case, and I don't really have any
good suggestions right now. Query tuning when you don't have access
to the data is kind of difficult. (It can be difficult even if you have
access!)

Your query looks good; it is basically the kind of query you would write
for a situation like this one. What you could do is create a new table,
in which client_id is the primary key, and which would only hold the
most recent rows. You would maintain this table through a trigger. The
query would be faster, but updates would take a toll.

While you did post CREATE TABLE scripts and the query and pieces of the
query plan, I notice that the query-plan extract mentions an object which
was not in the script. Maybe you could post a script and the complete
query plan that goes with it. Could you put the plan in an zip file or
something, so that the lines does not get broken.

Also, if you could give some indications of the volume and the distribution
that would be helpful.

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #10

Bob

Hi,

First, let me say that your time and replies are greatly appreciated.

Erland Sommarskog <so****@algonet.se> wrote in message news:<Xn**********************@127.0.0.1>...

Bob (bo*********@yahoo.com) writes:
Thanks for your reply. Your suggestion with indexes did not help.
The query plan did not change for the better with the suggested
indexes in place.
But at least you should get slightly better performance with sm_dollars
out from the index.

Wouldn't SQL Server have to do a bookmark lookup if I take the dollars
out of the index? My thought was to cover the query.
I have looked at little more at your case, and I don't really have any
good suggestions right now. Query tuning when you don't have access
to the data is kind of difficult. (It can be difficult even if you have
access!)

Your query looks good; it is basically the kind of query you would write
for a situation like this one. What you could do is create a new table,
in which client_id is the primary key, and which would only hold the
most recent rows. You would maintain this table through a trigger. The
query would be faster, but updates would take a toll.
We are looking at denormalizing the table and possibly adding a flag
that marks the "current" sales measure for each client. This is an
alternative to what you are suggesting.
While you did post CREATE TABLE scripts and the query and pieces of the
query plan, I notice that the query-plan extract mentions an object which
was not in the script. Maybe you could post a script and the complete
query plan that goes with it. Could you put the plan in an zip file or
something, so that the lines does not get broken.
Actually, the script is an extreme simplification of the actual
objects involved. The top level query is about three pages long and
involves two dozen tables. The problem is that the query is not
particularly selective and potentially could return a large number of
rows. The reason I felt compelled to post this is that in the large
query 95% of the time is spend executing this particular "small" part
of the query. If I take sales measure out of the query it comes back
in 1 second. With sales measure in, it takes 17 seconds.
Also, if you could give some indications of the volume and the distribution
that would be helpful.

There are ~ 700K clients.
~ 470K clients have sales measures, most (288K) have one, hand full
have 9 (max)

Thanks

Bob

Jul 20 '05 #11

Erland Sommarskog

Bob (bo*********@yahoo.com) writes:

Wouldn't SQL Server have to do a bookmark lookup if I take the dollars
out of the index? My thought was to cover the query.
True. On the other hand it makes the index broader.
We are looking at denormalizing the table and possibly adding a flag
that marks the "current" sales measure for each client. This is an
alternative to what you are suggesting.
Yes, I was thinking along those lines too, but I figured that I should
stop at one suggestion, and to get your thoughts going.
Actually, the script is an extreme simplification of the actual
objects involved. The top level query is about three pages long and
involves two dozen tables. The problem is that the query is not
particularly selective and potentially could return a large number of
rows. The reason I felt compelled to post this is that in the large
query 95% of the time is spend executing this particular "small" part
of the query. If I take sales measure out of the query it comes back
in 1 second. With sales measure in, it takes 17 seconds.
But if you run this extracted query, it still runs slow? Or is it the
interaction between this part and the rest of the query that is causing
problems?
There are ~ 700K clients.
~ 470K clients have sales measures, most (288K) have one, hand full
have 9 (max)

No small numbers, but alarmingly high.

--
Erland Sommarskog, SQL Server MVP, so****@algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #12

Paul

jc*******@earthlink.net says...

Since you cross-posted in the general database newsgroup, I gave you
Standard SQL-92 code (well, the at-sign -- aka "petite escarot" --
should have been colon).

Eh? The at sign (@) is called "robasse" in French - where did you get
this "escarot" word from? My dictionary doesn't even have it!
Paul...
--
plinehan y_a_h_o_o and d_o_t com
C++ Builder 5 SP1, Interbase 6.0.1.6 IBX 5.04 W2K Pro
Please do not top-post.

"XML avoids the fundamental question of what we should do,
by focusing entirely on how we should do it."

quote from http://www.metatorial.com

Jul 20 '05 #13

Christopher Browne

A long time ago, in a galaxy far, far away, Paul <pa**@not.a.chance.ie> wrote:

jc*******@earthlink.net says...
Since you cross-posted in the general database newsgroup, I gave you
Standard SQL-92 code (well, the at-sign -- aka "petite escarot" --
should have been colon).

Eh? The at sign (@) is called "robasse" in French - where did you get
this "escarot" word from? My dictionary doesn't even have it!

If it had been spelled "escargot," I'd buy that the "@" looks somewhat
like a snail...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','cbbrowne.com').
http://www.ntlug.org/~cbbrowne/x.html
"Now they show you how detergents take out bloodstains, a pretty
violent image there. I think if you've got a T-shirt with a bloodstain
all over it, maybe laundry isn't your biggest problem. Maybe you
should get rid of the body before you do the wash." --Jerry Seinfeld

Jul 20 '05 #14

Paul

cb******@acm.org says...

Paul <pa**@not.a.chance.ie> wrote:

Eh? The at sign (@) is called "robasse" in French - where did you get
this "escarot" word from? My dictionary doesn't even have it!
If it had been spelled "escargot," I'd buy that the "@" looks somewhat
like a snail...

Putain, mais oui (Doh). J'avais oublié que M. Celko aime s'amuser de
temps en temps!
Paul...
--
plinehan y_a_h_o_o and d_o_t com
C++ Builder 5 SP1, Interbase 6.0.1.6 IBX 5.04 W2K Pro
Please do not top-post.

"XML avoids the fundamental question of what we should do,
by focusing entirely on how we should do it."

quote from http://www.metatorial.com

Jul 20 '05 #15

Similar topics