468,507 Members | 1,558 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,507 developers. It's quick & easy.

Indexing and Queries

Hi everybody,

After days reading stuff about indexing, extents, pages, 8KB, 64 KB,
IGNORE_DUP_KEY, CREATE INDEX, bla bla, lalalala, lllllll, grrrrrrr and
testing with different kinds of queries, indexes, situations i'm
getting confused, irritated, etc.

SITUATION
I have the following situation. We have a fact table with 1,8 million
records (Testsituation) and i'm inserting 100000 records and i want to
make it faster. Records can only be inserted when it's different from
the one in the Fact table. Currently we don't have any index on the
table. So i thought that could be quicker when i build some indexes. So
i started experimenting, lalala, and some further more and more...

The facttable has Foreign keys and measures. The foreign keys are
referenced with the primary keys in the dimensions. Also i have a field
called Source_key which indicates an unique value (This could be an
integer, but also a varchar). In this case its a varchar. Also i have
an Checksum field. An example of the query is like this (based on the
northwind database):

INSERT INTO DYN.dbo.SAW_Fact_Order_Detail
SELECT 20051130, 62667,
Customer_Dim_Key =
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
Product_Dim_Key = ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
FK_Order_Date,
FK_Required_Date,
FK_Shipped_Date,
Shipped_Flag,
Order_Duration_Open_Days,
Order_Required_Exceed_Days,
Order_Detail_Unit_Price,
Order_Detail_Quantity,
Order_Detail_Discount,
ExtendedPrice,
Order_Number,
Order_Detail_Number,
Order_Detail_Count,
binary_checksum(
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Order_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Required_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Shipped_Date,
DYN.dbo.TR_Fact_Order_Detail_V.Shipped_Flag,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Duration_Open _Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Required_Exce ed_Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Unit_P rice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Quanti ty,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Discou nt,
DYN.dbo.TR_Fact_Order_Detail_V.ExtendedPrice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Number,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number ,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Count) ,
14,
'N',
getdate(),
'1/1/2049'
FROM DYN.dbo.TR_Fact_Order_Detail_V
LEFT OUTER JOIN DYN.dbo.SAW_B_LU_Customer ON
DYN.dbo.TR_Fact_Order_Detail_V.Customer_Code =
DYN.dbo.SAW_B_LU_Customer.Customer_Code
LEFT OUTER JOIN DYN.dbo.SAW_B_LU_Product ON
DYN.dbo.TR_Fact_Order_Detail_V.Product_Code =
DYN.dbo.SAW_B_LU_Product.Product_Code
WHERE NOT EXISTS
(Select *
From DYN.dbo.SAW_Fact_Order_Detail
Where Checksum_Field = binary_checksum(
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Order_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Required_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Shipped_Date,
DYN.dbo.TR_Fact_Order_Detail_V.Shipped_Flag,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Duration_Open _Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Required_Exce ed_Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Unit_P rice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Quanti ty,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Discou nt,
DYN.dbo.TR_Fact_Order_Detail_V.ExtendedPrice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Number,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number ,
AND DYN.dbo.TR_Fact_Order_Detail_V.Order_Number =
DYN.dbo.SAW_Fact_Order_Detail.Order_Number
AND DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number =
DYN.dbo.SAW_Fact_Order_Detail.Order_Detail_Number
AND Expired_Date = '1/1/2049')
EXPERIMENTS
So i focused on the NOT Exists clause. My thought was that when an
index is created on the fields in WHERE clause of the SELECT statement
in the NOT EXISTS part it would be quicker. Because SQL Server should
be quicker decisioning whether a record existed or not. So far theory.
So i started experimenting.

I No Index.
It took about 118 seconds.

II. I created an referencetable (took some time but not relevant here)
so the insert table and the comparetable were not the same anymore. The
fields in the reference table are Checksumfield, Dossier_Source_Code
(in query example Order_Number and Order_Detail_Number) and
expired_date. So the not exists clause rebuild to reference this newly
created table and now it took about 85 seconds.

III. The part i don't understand is this. So the prior step was a
perfomance gain so i decided to build indexes on the reference table. I
tried al types of different indexes:
* index on checksum (clustered), source_key (Non clustered, unique) and
on expired date (non clustered) --> 99 seconds
* Index on source_key (clustered, unique), checksum_field (non
clustered) and on expired date (non clustered) --> 91 seconds
* the Source key (in the example Order_Number and Order_Detail_Number)
was unique so i decided to build a checksum on these fields and build
an index on the checksum and of course on the Source_key and
expired_date. This took about 101 seconds. What?

So as you can see it took only longer when i build a index. So why? And
has someone any clues to make the query faster?
Thanx ,
Hennie

Nov 30 '05 #1
6 1435
Hennie,

I can't give you the ONE answer 'how to improve the query, but you
should keep in mind that while indexes (should) improve SELECT queries,
they can also slow down INSERTS, because the new records can cause page
splits.
So you have to find a compromise between good SELECT performance and
Insert performance.
By the way, did you try to analyze this query with the Index Tuning
wizard ?

Markus

Dec 1 '05 #2
Hennie,

I definitely hear the irritation in your posting. First it sounds like you
are using straight TSQL for your ETL. While this works, it's hard to
maintain and lacks some key features for performance. I encourage you to
look at some of the industry tools from Informatica, Ascential, or the new
SSIS in 2005. If using only TSQL is your goal here is a tip.

Transactions in SQL is your performance enemy. This is a data warehouse so
transactions aren't very essential. Put your DB in Bulk Load or Simple
recovery mode. If you know each day the number of inserts and deletes will
be high, try writing a new table with all the records out, drop the old
table and rename the new table. A select into should do the trick. You may
even abandon the idea of identifying what changed since writing the entire
table without transactions takes less time. Of course this approach has a
long term issue. As the source grows the time to write larger tables will
always increase.

As for adding an index to the reference table slowing things down. Adding
indexes can slow queries down. Since the optimizer puts more weight on IO
than CPU and memory. The optimizer chooses to use the new index which
reduces IO but in turn changes the plan to do a large hash join. The lowest
IO is not always the best plan.

"Hennie7863" <hd********@hotmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
Hi everybody,

After days reading stuff about indexing, extents, pages, 8KB, 64 KB,
IGNORE_DUP_KEY, CREATE INDEX, bla bla, lalalala, lllllll, grrrrrrr and
testing with different kinds of queries, indexes, situations i'm
getting confused, irritated, etc.

SITUATION
I have the following situation. We have a fact table with 1,8 million
records (Testsituation) and i'm inserting 100000 records and i want to
make it faster. Records can only be inserted when it's different from
the one in the Fact table. Currently we don't have any index on the
table. So i thought that could be quicker when i build some indexes. So
i started experimenting, lalala, and some further more and more...

The facttable has Foreign keys and measures. The foreign keys are
referenced with the primary keys in the dimensions. Also i have a field
called Source_key which indicates an unique value (This could be an
integer, but also a varchar). In this case its a varchar. Also i have
an Checksum field. An example of the query is like this (based on the
northwind database):

INSERT INTO DYN.dbo.SAW_Fact_Order_Detail
SELECT 20051130, 62667,
Customer_Dim_Key =
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
Product_Dim_Key = ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
FK_Order_Date,
FK_Required_Date,
FK_Shipped_Date,
Shipped_Flag,
Order_Duration_Open_Days,
Order_Required_Exceed_Days,
Order_Detail_Unit_Price,
Order_Detail_Quantity,
Order_Detail_Discount,
ExtendedPrice,
Order_Number,
Order_Detail_Number,
Order_Detail_Count,
binary_checksum(
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Order_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Required_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Shipped_Date,
DYN.dbo.TR_Fact_Order_Detail_V.Shipped_Flag,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Duration_Open _Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Required_Exce ed_Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Unit_P rice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Quanti ty,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Discou nt,
DYN.dbo.TR_Fact_Order_Detail_V.ExtendedPrice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Number,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number ,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Count) ,
14,
'N',
getdate(),
'1/1/2049'
FROM DYN.dbo.TR_Fact_Order_Detail_V
LEFT OUTER JOIN DYN.dbo.SAW_B_LU_Customer ON
DYN.dbo.TR_Fact_Order_Detail_V.Customer_Code =
DYN.dbo.SAW_B_LU_Customer.Customer_Code
LEFT OUTER JOIN DYN.dbo.SAW_B_LU_Product ON
DYN.dbo.TR_Fact_Order_Detail_V.Product_Code =
DYN.dbo.SAW_B_LU_Product.Product_Code
WHERE NOT EXISTS
(Select *
From DYN.dbo.SAW_Fact_Order_Detail
Where Checksum_Field = binary_checksum(
ISNULL(DYN.dbo.SAW_B_LU_Customer.Customer_Dim_Key, 0),
ISNULL(DYN.dbo.SAW_B_LU_Product.Product_Dim_Key,0) ,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Order_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Required_Date,
DYN.dbo.TR_Fact_Order_Detail_V.FK_Shipped_Date,
DYN.dbo.TR_Fact_Order_Detail_V.Shipped_Flag,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Duration_Open _Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Required_Exce ed_Days,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Unit_P rice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Quanti ty,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Discou nt,
DYN.dbo.TR_Fact_Order_Detail_V.ExtendedPrice,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Number,
DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number ,
AND DYN.dbo.TR_Fact_Order_Detail_V.Order_Number =
DYN.dbo.SAW_Fact_Order_Detail.Order_Number
AND DYN.dbo.TR_Fact_Order_Detail_V.Order_Detail_Number =
DYN.dbo.SAW_Fact_Order_Detail.Order_Detail_Number
AND Expired_Date = '1/1/2049')
EXPERIMENTS
So i focused on the NOT Exists clause. My thought was that when an
index is created on the fields in WHERE clause of the SELECT statement
in the NOT EXISTS part it would be quicker. Because SQL Server should
be quicker decisioning whether a record existed or not. So far theory.
So i started experimenting.

I No Index.
It took about 118 seconds.

II. I created an referencetable (took some time but not relevant here)
so the insert table and the comparetable were not the same anymore. The
fields in the reference table are Checksumfield, Dossier_Source_Code
(in query example Order_Number and Order_Detail_Number) and
expired_date. So the not exists clause rebuild to reference this newly
created table and now it took about 85 seconds.

III. The part i don't understand is this. So the prior step was a
perfomance gain so i decided to build indexes on the reference table. I
tried al types of different indexes:
* index on checksum (clustered), source_key (Non clustered, unique) and
on expired date (non clustered) --> 99 seconds
* Index on source_key (clustered, unique), checksum_field (non
clustered) and on expired date (non clustered) --> 91 seconds
* the Source key (in the example Order_Number and Order_Detail_Number)
was unique so i decided to build a checksum on these fields and build
an index on the checksum and of course on the Source_key and
expired_date. This took about 101 seconds. What?

So as you can see it took only longer when i build a index. So why? And
has someone any clues to make the query faster?
Thanx ,
Hennie

Dec 1 '05 #3
>tried al types of different indexes:
* index on checksum (clustered), source_key (Non clustered, unique) and
on expired date (non clustered) --> 99 seconds
* Index on source_key (clustered, unique), checksum_field (non
clustered) and on expired date (non clustered) --> 91 seconds
that is normal. accessing via a non-clustered index (thru a bookmark)
may be quite slow. Having all the columns in just one index might be
better.
* the Source key (in the example Order_Number and Order_Detail_Number)
was unique so i decided to build a checksum on these fields and build
an index on the checksum and of course on the Source_key and
expired_date. This took about 101 seconds. What?


are you speaking about one index containing all those columns?

Also note that if all the columns are not nullable, not exists()
queries may be rewritten as not in() ones.
Also try rewriting you not exists() query as an outer join.

try all 3 ways, they are equivalent if the columns you join on are not
nullable:

1.
select c.left_PK
from left_table c
where left_PK not in(select left_PK from right_table t
where c.left_PK=t.left_PK)

2.
select c.left_PK
from left_table c
where not exists(select 1 from right_table t
where c.left_PK=t.left_PK)

3.
select c.left_PK
from left_table c left outer join right_table t
on c.left_PK=t.left_PK
where t.left_PK is null

Dec 1 '05 #4
oops I meant

3.
select c.left_PK
from left_table c left outer join right_table t
on c.left_PK=t.left_PK
where t.RIGHT_PK is null

Dec 1 '05 #5
Hi Markus,

Thanx for your answer. Your answer says exactly what i wanted to do. I
wanted an insert table and a reference table so that the query would
function at its best: one table for inserting and one for referencing
and this would make it quicker (but it takes some to build this
reference table). The next step is building indexes on this referencing

I did use the index tuning wizard and its said that i should build an
index on the checksum field en the fields (ordernumber and order
detail) which is the unique identification of a record.

Hennie

Dec 6 '05 #6
put a single index on the main fact table on "Order_num" assuming this
is a somewhat unique piece of data.
Then, DELETE from your temp table everything that already exists in the
main fact table. This will allow the engine to do a single scan on the
temp table, efficiently look it up in the fact table using a FAST
index, and get rid of dups.

Then, just add what is left of the temp table to the main fact table.
Using this approach, you will never have to do a scan of the main fact
table, ESPECIALLY not a scan for EVERY record in the temp table.

Throwing indexes onto the fact table for non-unique items is generally
a bad thing, and slows things down.

Dec 12 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Thomas Bartkus | last post: by
2 posts views Thread by Kent.Brooke | last post: by
7 posts views Thread by Ryan | last post: by
reply views Thread by NPC403 | last post: by
3 posts views Thread by gieforce | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.