470,648 Members | 1,602 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Share your developer knowledge by writing an article on Bytes.

How To: Optimize SQL Queries

Rabbit
12,516 Expert Mod 8TB
Introduction
In this article, I discuss the basic methods I use to analyse and optimize SQL queries. A properly written SQL with appropriate indices can turn a query that runs in hours into one that runs in seconds. While the code in this article is rooted in SQL Server, most of the concepts will apply to other databases.

We will go over:
  1. Analyzing Performance of Queries
  2. Optimizing SQL Syntax
  3. Query Execution Plans
  4. Creating Proper Indices

Analyzing Performance of Queries
SQL Server comes with a few options to analyse the performance of a query. My basic barrage of tests that I use on a query is this:
Expand|Select|Wrap|Line Numbers
  1. CHECKPOINT
  2. DBCC DROPCLEANBUFFERS
  3. DBCC FREEPROCCACHE
  4.  
  5. SET STATISTICS IO ON
  6. SET STATISTICS TIME ON
  7. SET STATISTICS PROFILE ON
  8. GO
  9.  
  10. -- SQL GOES HERE
  11. GO
  12.  
  13. SET STATISTICS IO OFF
  14. SET STATISTICS TIME OFF
  15. SET STATISTICS PROFILE OFF
  16. GO
The first thing I do is to clear all the caches so that I can the SQL can run fresh without help from previous runs.

CHECKPOINT - writes any unwritten buffers.
DBCC DROPCLEANBUFFERS - clears out the buffer cache.
DBCC FREEPROCCACHE - clears out the plan cache.

After that, I turn on some statistical options.

SET STATISTICS IO - shows you information on the disk activity that occurs with a query. The lower the numbers, the faster it is.
SET STATISTICS TIME ON - shows you how long, in milliseconds, it takes a query to run.
SET STATISTICS PROFILE ON - shows you a text version of the query execution plan. This plan reveals every step the database uses to run the query and how long it takes to run that step.

Optimizing SQL Syntax
There many syntactical choices you can make that will speed up a query. I will discuss some of those choices here. The first eight are relatively self explanatory but the last three merit further discussion.
  • Avoid using views.
  • Use an INNER JOIN as opposed to one of the OUTER JOINs whenever possible.
  • Avoid the SELECT *, return only the fields that you need.
  • Don't SORT or ORDER BY unless it's needed.
  • Don't JOIN to tables that you don't need for the query.
  • Avoid NOT, i.e. NOT IN and NOT EXISTS.
  • EXISTS is quicker than IN.
  • Use UNION ALL rather than UNION if you don't need DISTINCT records.
  • Sometimes a UNION is quicker than OR. To know which one is quicker, try both and look at the statistics.
  • When filtering records in the WHERE clause, try to avoid calculations on fields. Move them to the constant side of the equation. If it's not possible to avoid, use an index that includes the calculation.
  • Ninety percent of the time, filtering using a subquery in the WHERE clause is slower than filtering by JOINing to a subquery. To know which one is quicker, try both and look at the statistics.

In the WHERE clause, using an OR condition can sometimes slow down a query by a lot. If this happens, try a UNION or UNION ALL instead.
Expand|Select|Wrap|Line Numbers
  1. -- OR syntax
  2. SELECT someField
  3. FROM someTable
  4. WHERE otherField = 5 OR otherField = 6;
  5.  
  6. -- UNION syntax
  7. SELECT someField
  8. FROM someTable
  9. WHERE otherField = 5
  10.  
  11. UNION ALL
  12.  
  13. SELECT someField
  14. FROM someTable
  15. WHERE otherField = 6;
  16.  
It makes for much longer code but the time savings could be huge. I once had a query that took hours using the OR syntax but under 5 seconds using the UNION. Test both and look at the statistics to see which would be faster in your case.

When you do calculations on a field in the WHERE clause, the query can no longer just scan an index because it now must perform the calculation on every record to see if it should be returned. For example, if I had a date field and I wanted to know if it was less than seven days old from the current date, I have a couple of syntactical choices:
Expand|Select|Wrap|Line Numbers
  1. DATEDIFF(D, dateField, GETDATE()) <= 5
  2.  
  3. dateField BETWEEN DATEADD(D, -5, GETDATE()) AND GETDATE()
The first is slower because an index on dateField can't be used since it must calculate for every record. And in a similar vein but with drastically different syntax:
Expand|Select|Wrap|Line Numbers
  1. Left(LastName, 1) = 'T'
  2.  
  3. LastName LIKE 'T%'
More often than not, if you need to filter records based on another table or the same table, using a subquery in a JOIN is quicker than using a subquery in the WHERE clause. For example, if I wanted all the fields for a record but only wanted the most current record based on a category and date field, I could do it one of two ways:
Expand|Select|Wrap|Line Numbers
  1. SELECT *
  2. FROM someTable AS t
  3. WHERE dateField = (
  4.    SELECT MAX(dateField)
  5.    FROM someTable
  6.    WHERE categoryField = t.categoryField
  7. );
  8.  
  9. SELECT *
  10. FROM someTable AS t1
  11. INNER JOIN (
  12.    SELECT categoryField, MAX(dateField) AS dateField
  13.    FROM someTable
  14.    GROUP BY categoryField
  15. ) AS t2
  16. ON t1.categoryField = t2.categoryField
  17.    AND t1.dateField = t2.dateField;
Test both and look at the statistics to see which would be faster in your case. I've had situations where a query is much faster using the WHERE clause and situations where a query is much faster in the JOIN. But the JOIN approach is correct most of the time.

Query Execution Plans
The query execution plan tells you the steps the database is going to take to run a query along with the estimated/actual time it took. If a query is running slowly, this will show which step it's getting hung up on. There are a lot of different operators that the database can use to run a query but the main culprits of a slow query are Table Scans and RID Lookups. Table scans read every row of a table to find the correct records to return. RID lookups are used when a non-clustered index is being utilized to find a matching row. It merely points to a location on a heap of data and has to go to the heap to look up the value rather than the data being stored in the index. If you see either of these, you should create the appropriate indices to get rid of them. However, there's a caveat regarding table scans, for tables with a small amount of rows, table scans are quicker than index scans.

Creating Proper Indices
Using a clustered index on a primary key field is preferable. Clustered indices significantly speed up searching on a field. A clustered index determines the physical order of data in a table. Therefore, you can only have one. However, you can have as many non-clustered indices as you want.

When creating an index, order matters. Put the fields in the following order: WHERE clause, JOIN clause, ORDER BY clause, SELECT clause. So if I wanted to optimize the following query:
Expand|Select|Wrap|Line Numbers
  1. SELECT t1.someField
  2. FROM someTable t1
  3. INNER JOIN otherTable t2
  4. ON t1.PK = t2.FK
  5. WHERE t1.otherField = 15
  6. ORDER BY t1.dateField
Then I would create an index on someTable with the fields in the following order: otherField, PK, dateField, someField. And an index on otherTable on the field FK.

Incidentally, the previous example was also a fully covered query. What that means is the index contains all fields required for the query. This means that the database does not have to refer back to the data in the table to find additional data for the query. When feasible, you should create an index to fully cover a query you want to run often.

However, it is important to note that an index can be "too large". Both in terms of technical limitations and usability. The more fields and the larger the fields included in an index, the longer it takes to run a query using the index.

Conclusion
A quick note about hardware. No matter how much optimization you do, sometimes you just need pure power. If you've optimized a query as much as you can but it's still too slow, you may have to upgrade your hardware.

I hope this article has armed you with the basic tools you need to take a query that runs in hours and make it run in seconds. Remember, optimization is a very deep subject and these are just the basic tools that you can use to speed up your query. The next time you have a long running query, analyse the statistics and execution plans, create the proper indices with help from the execution plan, and test different versions of the SQL syntax.
Mar 12 '12 #1
7 12493
r035198x
13,262 8TB
This is very useful. Good work.
Mar 13 '12 #2
ck9663
2,878 Expert 2GB
Rabbit,

Very nicely done and a very helpful article. I have some other ideas and will keep adding them on this post as I remember or found them.

For now here are some of them.

These are all based on my experience and could be different from other cases.

If OR or AND is necessary
If you you really need these logical operators, arrange your conditions properly.

If you are using AND, pick the most probably False condition first. On the other hand, if you are using OR, pick the most probable True condition first.

An AND operator will return False if any of the condition is false. While an OR operator will return True if any of the condition is True.

There is actually no guarantee that SQL Server will not perform the rest of the conditions, but in some cases it stops checking the other tests when the condition is or not satisfied.

Arrange the conditions in CASE WHEN
In a CASE WHEN statement/function, place the most probable TRUE condition first so the command does not have to parse through the entire statement only to find the most use condition at the bottom. This might be a small improvement if it's a simple condition on the WHEN clause. But if you have a lot of WHEN clauses or just a few but the condition on those WHENs requires complex calculation or even referring to columns on other tables, you will experience improvement if conditions are properly arranged.

Index Seek vs Index Scan
Remember that not because you have an index on your table, SQL Server will always use it. SQL Server will figure out for itself if it's better to use Index Seek or Index Scan. Because there are times that statistics are not updated in the database, the server will use of Index Scan instead of Index Seek.

Use the statistics and Execution Plan as describe by Rabbit above. As much as possible, try to eliminate Index Scan.

Use CURSORs and RBARs as the last option
RBAR (Row By Agonizing Row) is a process where the query is forcing the server to process the data by row and not by result set. Much like how a CURSOR is processed. Although these technique have their own use, you must, as much as possible, use a result based technique than a Row by Row process. If you have to, use temporary or permanent tables.

A table must have at least one index
Although it's not required, this mindset will help you make sure that all tables have an index before you use them. And if a table will only have one index, make it a CLUSTERED. Of course you still have to check how the table is being used and/or updated.

More to come, work for now...
Mar 13 '12 #3
Rabbit
12,516 Expert Mod 8TB
Thanks for those additional ideas ck. We could make this thread the go to place for optimizing queries.

If any one has additional optimizations, please feel free to add your two cents.
Mar 13 '12 #4
Rabbit, Great article. Anyone have any advice on how to improve performance on a non-deterministic computed column? I have some of those, and they usually are the most costly operation in the execution plan.
Mar 13 '12 #5
ck9663
2,878 Expert 2GB
Try to see if you can execute the function at run time and see how the execution plan looks. Depending on how you use the function, you might be able to refactor and restructure your query to utilize Index Seek during execution instead of Index Scan.

Most of the time a non-deterministic function uses an Index Scan and not Index Seek.


Good Luck!!!


~~ CK
Mar 14 '12 #6
Rabbit
12,516 Expert Mod 8TB
My first piece of advice would be to not use it. If we know what the calculation is doing, we may be able to substitute it with something else. If there's no way around it, make sure it's persistent and put it in an index.
Mar 14 '12 #7
mafaisal
142 100+
Very Good Article.
Thanks Rabbit, CK

By Faisal
Oct 20 '13 #8

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

4 posts views Thread by jy2003 | last post: by
1 post views Thread by Bruce MacDonald | last post: by
5 posts views Thread by PeteCresswell | last post: by
5 posts views Thread by Martin Lacoste | last post: by
13 posts views Thread by Frank Swarbrick | last post: by
9 posts views Thread by jehugaleahsa | last post: by
1 post views Thread by Korara | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.