Fastest way to execute aggegate functions on a table?

John A Fotheringham

I've probably not given the best title to this topic, but that
reflects my relative "newbie" status.

I have a table that goes essentially

TS DATETIME
jobnumber VARCHAR
jobentry VARCHAR
...

the TS is a time stamp, and the other two fields are job number
and entries. There are (many) more fields, but this is the core of
it. The relationship is there will be several entries per job, with
one row in the table per entry (i.e several rows per job).

In constructing a web interface I want to create a list of recent
job numbers, and summarize it broadly as follows

max(ts)
jobnumber
count(jobentry)
...

I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than

select top 30 jobnumber
from Jobs
group by jobnumber
order by jobnumber desc

leading me to suspect that the aggregate functions are slowing this
down. I would guesstimate the difference in speed is around a factor
of 10-20.

As I type this I think I've just realised it's the

order by time desc

that is probably causing the problem (in that it will need to
calculate max(ts) for all jobnumber's in the table in order to
execute the "order by" clause).

I think I can see the solution now (amazing what typing out a long
question can do for you :-)

My question was going to be if there was any approved method/tricks
for avoiding this sort performance hit.

It seemed to me that if I could first grab a decent number of recent
records (quickly) I could then execute the aggregate functions against
that smaller set of records. I know in broad terms how any entries
there can be per job, so I could select enough to make at least 30
jobs, and then execute the real query against that. In my case there
will be probably less than 10 entries per job, so I could grab 300
records and execute against that, instead of against the whole
400,000.

That being the case is this best done

a) as a subquery, and if so, how

b) by creating a temporary intermediate table (which is
in essence a more explicit version of (a) isn't it?)

Another solution that occurred was

c) to create and maintain a summary table, with just
one record per jobnumber with a view to having it
serve this particular (common) enquiry.

For (c) to work I would probably need triggers on the Jobs table to
maintain the summary table. This would probably represent more
overhead overall, but should certainly give a better response
when the query outlined above is executed on the web site.

The response time to this query is becoming critical on the web
interface to avoid timeouts.

Any suggestions or comments are welcome. If it's RTFM, then I'd
appreciate page numbers :-)

Thanks for reading this far :-)
--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/

Jul 23 '05 #1

Subscribe Post Reply

1604

Erland Sommarskog

John A Fotheringham (ja*@jafsoft.com) writes:

I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than

A non-clustered index on (jobumber, time) should do wonders. Replace
count(jobentry) with count(*). The difference is that count(*) counts
all rows, count(jobentry) only count rows where jobentry is non-NULL.
Assuming that jobentry is NOT NULL, count(*) should not make a difference.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #2

John A Fotheringham

Erland Sommarskog <es****@sommarskog.se> wrote:

John A Fotheringham (ja*@jafsoft.com) writes:
I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

However I'm now finding that this is quite slow now that my teat table
has around 400,000 entries (Even though I've added indexes to most
relevant fields). In particular it's much slower than
A non-clustered index on (jobumber, time) should do wonders.

I have separate indexes on jobnumber, ts etc. In particular I have
added an index to each column mentioned in the "logic" of the select
but probably not on all the columns that are selected as data from the
table. Would that make a difference? It doesn't seem to be as big an
impact as using the aggregate function in the "order by" clause.

Also, what is the difference between creating

index (jobnumber, time)

and creating separate indexes on jobnumber and time? I should restate
that I'm a comparative newbie here, so still getting to grips with
these sorts of performance tuning issues, so any pointers are
appreciated.
Replace
count(jobentry) with count(*). The difference is that count(*) counts
all rows, count(jobentry) only count rows where jobentry is non-NULL.
Assuming that jobentry is NOT NULL, count(*) should not make a difference.

Probably not an issue here as jobentry is never NULL (although that's
more by implementation than table design :-)

--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/

Jul 23 '05 #3

Erland Sommarskog

John A Fotheringham (ja*@jafsoft.com) writes:

Also, what is the difference between creating

index (jobnumber, time)

and creating separate indexes on jobnumber and time? I should restate
that I'm a comparative newbie here, so still getting to grips with
these sorts of performance tuning issues, so any pointers are
appreciated.

Well, try this:

SELECT jobnumber FROM tbl ORDER BY jobnumber
SELECT time FROM tbl ORDER BY time

Run these queries, and the print the result. Then use the printouts to
find the MAX(time) for each job manually. (Or at least try to imagine
how you would do it!)

When you have grown tired of this, try:

SELECT jobnumber, time FROM tbl ORDER BY jobnumber, time

print this as well, and do the same manual exercise. You will find that
what previously was a very labourous and boring task is now suddenly quite
simple.

There are two things that gain with the index (jobnumber, time):
1) The index makes it easy for SQL Server to find the max time for each
job quickly.
2) The index covers the query, which means that the query can be
computed entirely from the index, no need to access the data pages.
The index is much smaller than the data pages, since there are fewer
columns in the index. And since the index is smaller, more rows fits
one page, and SQL Server has to read fewer pages. This was why I removed
jobentry. With jobentry in the query, SQL Server would have to
access the data pages, which could lead to SQL Server opting to scan
the entire table instead.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 23 '05 #4

John A Fotheringham

I'm answering my own post to document the solution I adopted

I can do this by a select command as follows

select top 30 max(ts) as time, jobnumber, count(jobentry)
from Jobs
group by jobnumber
order by time desc

I used a subqeuery as follows

select top 30 max(ts) as time, jobnumber, count(jobentry), ...
from Jobs
where jobnumber in
(select top 1000 jobnumber from Jobs
order by ts desc)
order by time desc

In essence I grab the last 1000 records from the Jobs table (which has
nearly 500,000), and then execute the aggregate query against the
results. This is much quicker than the original query, completing in
under a second compared to 20 seconds.

The only problem is that the choice of 1000 records is a little
arbitrary as it's difficult to know what number is required to get
the 30 results wanted. In my situation I know the average number of
entries per job, and so can be confident that 1000 is ample, but the
situation becomes more complex if I want to add a "where" clause to
query (e.g. to limit to jobs allocated to a particular user), as the
number of records that need to be scanned to get 30 results will then
be larger, but in this limited and key situation this solution seems
to work for me.
--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/

Jul 23 '05 #5

Similar topics

fastest table lookup

by: Neal D. Becker | last post by:

I need a fairly small lookup table, and I'm wondering which data python data structure would be fastest. I could use a list, tuple, dictionary, numeric array, or maybe plain python array. The...

Python

help with mysql cursor.execute()

by: William Gill | last post by:

I have been trying to pass parameters as indicated in the api. when I use: sql= 'select * from %s where cusid = %s ' % name,recID) Cursor.execute(sql) it works fine, but when I try : sql=...

Python

Determine fastest query in Query Analyzer

by: laurenq uantrell | last post by:

I am trying to determine which of three stored procedure designs are fastest in the Query Analyzer: One query is a straight SELECT query with all desired rows and a dozen (tblName.RowName =...

Microsoft SQL Server

fastest way to determine if any records exist in table?

by: adm | last post by:

There are a few ways to go about this, but what is the fastest when called from VBA? This if for an ADP with a SQL Server back-end.

Microsoft Access / VBA

fastest way to parse a file; Most efficient way to store the data?

by: hoopsho | last post by:

Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a...

C# / C Sharp

What is the fastest way...

by: Dominik Czechowski | last post by:

Hi everyone! I have two tables, T1 and T2 defined as follows: create table T1(c1 integer not null, primary key (c1)); create table T2(c1 integer not null, c2 varchar(100) not null, primary...

DB2 Database

fastest way to deduplicate a list

by: Michael Evanchik | last post by:

Im trying to dedupe a table with only one field on it. The table has 40 million records in it. What is the fastest way? 1) create a table with a unque constraint on it insert into that table?...

Microsoft SQL Server

What's the best (fastest to execute) way to handle this?

by: aspdevguy | last post by:

I need to put together a report that will pull data from the following tables: Customers, Orders. The Customers table contains about 11,000 customer records and the Orders table contains about...

ASP / Active Server Pages

Fastest way to convert sql result into a dict or list ?

by: rewonka | last post by:

Hello, I'm trying to find the fastest way to convert an sql result into a dict or list. What i mean, for example: my sql result: contact_id, field_id, field_name, value sql_result=, , ,

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice