By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,419 Members | 1,584 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,419 IT Pros & Developers. It's quick & easy.

Fastest way to execute aggegate functions on a table?

P: n/a
I've probably not given the best title to this topic, but that
reflects my relative "newbie" status.

I have a table that goes essentially

TS DATETIME
jobnumber VARCHAR
jobentry VARCHAR
...

the TS is a time stamp, and the other two fields are job number
and entries. There are (many) more fields, but this is the core of
it. The relationship is there will be several entries per job, with
one row in the table per entry (i.e several rows per job).

In constructing a web interface I want to create a list of recent
job numbers, and summarize it broadly as follows

max(ts)
jobnumber
count(jobentry)
...

I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than

select top 30 jobnumber
from Jobs
group by jobnumber
order by jobnumber desc

leading me to suspect that the aggregate functions are slowing this
down. I would guesstimate the difference in speed is around a factor
of 10-20.

As I type this I think I've just realised it's the

order by time desc

that is probably causing the problem (in that it will need to
calculate max(ts) for all jobnumber's in the table in order to
execute the "order by" clause).

I think I can see the solution now (amazing what typing out a long
question can do for you :-)

My question was going to be if there was any approved method/tricks
for avoiding this sort performance hit.

It seemed to me that if I could first grab a decent number of recent
records (quickly) I could then execute the aggregate functions against
that smaller set of records. I know in broad terms how any entries
there can be per job, so I could select enough to make at least 30
jobs, and then execute the real query against that. In my case there
will be probably less than 10 entries per job, so I could grab 300
records and execute against that, instead of against the whole
400,000.

That being the case is this best done

a) as a subquery, and if so, how

b) by creating a temporary intermediate table (which is
in essence a more explicit version of (a) isn't it?)

Another solution that occurred was

c) to create and maintain a summary table, with just
one record per jobnumber with a view to having it
serve this particular (common) enquiry.

For (c) to work I would probably need triggers on the Jobs table to
maintain the summary table. This would probably represent more
overhead overall, but should certainly give a better response
when the query outlined above is executed on the web site.

The response time to this query is becoming critical on the web
interface to avoid timeouts.

Any suggestions or comments are welcome. If it's RTFM, then I'd
appreciate page numbers :-)

Thanks for reading this far :-)
--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/
Jul 23 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
John A Fotheringham (ja*@jafsoft.com) writes:
I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than


A non-clustered index on (jobumber, time) should do wonders. Replace
count(jobentry) with count(*). The difference is that count(*) counts
all rows, count(jobentry) only count rows where jobentry is non-NULL.
Assuming that jobentry is NOT NULL, count(*) should not make a difference.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #2

P: n/a
Erland Sommarskog <es****@sommarskog.se> wrote:
John A Fotheringham (ja*@jafsoft.com) writes:
I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than
A non-clustered index on (jobumber, time) should do wonders.


I have separate indexes on jobnumber, ts etc. In particular I have
added an index to each column mentioned in the "logic" of the select
but probably not on all the columns that are selected as data from the
table. Would that make a difference? It doesn't seem to be as big an
impact as using the aggregate function in the "order by" clause.

Also, what is the difference between creating

index (jobnumber, time)

and creating separate indexes on jobnumber and time? I should restate
that I'm a comparative newbie here, so still getting to grips with
these sorts of performance tuning issues, so any pointers are
appreciated.
Replace
count(jobentry) with count(*). The difference is that count(*) counts
all rows, count(jobentry) only count rows where jobentry is non-NULL.
Assuming that jobentry is NOT NULL, count(*) should not make a difference.


Probably not an issue here as jobentry is never NULL (although that's
more by implementation than table design :-)

--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/
Jul 23 '05 #3

P: n/a
John A Fotheringham (ja*@jafsoft.com) writes:
Also, what is the difference between creating

index (jobnumber, time)

and creating separate indexes on jobnumber and time? I should restate
that I'm a comparative newbie here, so still getting to grips with
these sorts of performance tuning issues, so any pointers are
appreciated.


Well, try this:

SELECT jobnumber FROM tbl ORDER BY jobnumber
SELECT time FROM tbl ORDER BY time

Run these queries, and the print the result. Then use the printouts to
find the MAX(time) for each job manually. (Or at least try to imagine
how you would do it!)

When you have grown tired of this, try:

SELECT jobnumber, time FROM tbl ORDER BY jobnumber, time

print this as well, and do the same manual exercise. You will find that
what previously was a very labourous and boring task is now suddenly quite
simple.

There are two things that gain with the index (jobnumber, time):
1) The index makes it easy for SQL Server to find the max time for each
job quickly.
2) The index covers the query, which means that the query can be
computed entirely from the index, no need to access the data pages.
The index is much smaller than the data pages, since there are fewer
columns in the index. And since the index is smaller, more rows fits
one page, and SQL Server has to read fewer pages. This was why I removed
jobentry. With jobentry in the query, SQL Server would have to
access the data pages, which could lead to SQL Server opting to scan
the entire table instead.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #4

P: n/a

I'm answering my own post to document the solution I adopted
I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc


I used a subqeuery as follows

select top 30 max(ts) as time, jobnumber, count(jobentry), ...
from Jobs
where jobnumber in
(select top 1000 jobnumber from Jobs
order by ts desc)
order by time desc

In essence I grab the last 1000 records from the Jobs table (which has
nearly 500,000), and then execute the aggregate query against the
results. This is much quicker than the original query, completing in
under a second compared to 20 seconds.

The only problem is that the choice of 1000 records is a little
arbitrary as it's difficult to know what number is required to get
the 30 results wanted. In my situation I know the average number of
entries per job, and so can be confident that 1000 is ample, but the
situation becomes more complex if I want to add a "where" clause to
query (e.g. to limit to jobs allocated to a particular user), as the
number of records that need to be scanned to get 30 results will then
be larger, but in this limited and key situation this solution seems
to work for me.
--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/
Jul 23 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.