behavior of SQL on joined queries - Microsoft SQL Server

Metal Dave

Hi all,

Currently our product has a setup that stores information about
transactions in a transaction table. Additionally, certain transactions
pertain to specific people, and extra information is stored in another
table. So for good or ill, things look like this right now:

create table TransactionHist ory (
TrnID int identity (1,1),
TrnDT datetime,
--other information about a basic transaction goes here.
--All transactions have this info
Primary Key Clustered (TrnID)
)

Create Index TrnDTIndex on TransactionHist ory(TrnDT)

create table PersonTransacti onHistory (
TrnID int,
PersonID int,
--extended data pertaining only to "person" transactions goes
--here. only Person transactions have this
Primary Key Clustered(TrnID ),
Foreign Key (TrnID) references TransactionHist ory (TrnID)
)

Create Index TrnPersonIDInde x on PersonTransacti onHistory(Perso n)
A query about a group of people over a certain date range might fetch
information like so:
select * from TransactionHist ory TH
inner join PersonTransacti onHistory PTH
on TH.TrnID = PTH.TrnID
where PTH.PersonID in some criteria
and TH.TrnDT between some date and some date
In my experience, this poses a real problem when trying to run queries
that uses both date and personID criteria. If my guesses are correct this
is because SQL is forced to do one of two things:

1 - Use TrnPersonIDInde x to find all transactions which match the person
criteria, then for each do a lookup in the PersonTransacti onHistory to
fetch the TrnID and subsequently do a lookup of the TrnID in the clustered
index of the TransactionHist ory Table, and finally determine if a given
transaction also matches the date time criteria.

2 - Use TrnDTIndex to final all transaction matching the date criteria,
and then perform lookups similar to the above, except for personID instead
of datetime.

Compounding this is my suspicion (based on performance comparison of when
I specify which indexes to use in the query vs when I let SQL Server
decide itself) that SQL sometimes chooses a very non optimal course. (Of
course, sometimes it chooses a better course than me - the point is I want
it to always be able to pick a good enough course such that I don't have
to bother specifying). Perhaps the table layout is making it difficult for
SQL Server to find a good query plan in all cases.

Basically I'm trying to determine ways to improve our table design here to
make reporting easier, as this gets painful when running report for
large groups of people during large date ranges. I see a few options based
on my above hypothesis, and am looking for comments and/or corrections.

1 - Add the TrnDT column to the PersonTransacti onHistory Table as
well. Then create a foreign key relationship of PersonTransacti onHistory
(TrnID, TrnDT) references TransactionHist ory (TrnID, TrnDT) and create
indexes on PersonTransacti onHistory with (TrnDT, PersonID) and
(PersonID, TrnDT). This seems like it would let SQL Server make
much more efficient execution plans. However, I am unsure if SQL server
can leverage the FK on TrnDT to use those new indexes if I give it a query
like:

select * from TransactionHist ory TH
inner join PersonTransacti onHistory PTH
on TH.TrnID = PTH.TrnID
where PTH.PersonID in some criteria
and TH.TrnDT between some date and some date

The trick being that SQL server would know that it can use PTH.TrnDT and
TH.TrnDT interchangably because of the foreign key (this would support all
the preexisting existing queries that explicitly named TH.TrnDT - any that
didn't explicitly specify the table would now have ambigious column
names...)

2 - Just coalesce the two tables into one. The original intent was to save
space by not requiring extra columns about Persons for all rows, many of
which did not have anything to do with a particular person (for instance a
contact point going active). In my experience with our product, the end
user's decisions about archiving and purging have a much bigger impact
than this, so in my opinion efficient querying is more important than
space. However I'm not sure if this is an elegant solution either. It also
might require more changes to existing code, although the use of views
might help.

We also run reports based on other criteria (columns I replaced with
comments above) but none of them are as problematic as the situation
above. However, it seems that if I can understand the best way to solve
this, I will be able to leverage that approach if other types of reports
become problematic.

Any opinions would be greatly appreciated. Also any references to good
sources regarding table and index design would be helpful as well (online
or offline references...)

thanks,
Dave

Jul 20 '05 #1

Subscribe Reply

1517

Erland Sommarskog

Metal Dave (me***@spam.spa m) writes:

create table TransactionHist ory (
TrnID int identity (1,1),
TrnDT datetime,
--other information about a basic transaction goes here.
--All transactions have this info
Primary Key Clustered (TrnID)
)

Create Index TrnDTIndex on TransactionHist ory(TrnDT)

create table PersonTransacti onHistory (
TrnID int,
PersonID int,
--extended data pertaining only to "person" transactions goes
--here. only Person transactions have this
Primary Key Clustered(TrnID ),
Foreign Key (TrnID) references TransactionHist ory (TrnID)
)

Create Index TrnPersonIDInde x on PersonTransacti onHistory(Perso n)
Given your query, it could be a good idea to have the clustered index
on TrnDT and PersonID instead. The main problem now with the queries
is that SQL Server will have to make a choice between Index Seek +
Bookmark Lookup on the one hand, and Clustered Index Scan on the other.
This is a guessing game that does not always end up the best way.

Of course, you may have other queries that are best off with clustering
on the Pkey, but this does not seem likely. (Insertion may however
benefit from a montonically increasing index. A clustered index on
PersonID may cause fragmentation.)
1 - Add the TrnDT column to the PersonTransacti onHistory Table as
well. Then create a foreign key relationship of PersonTransacti onHistory
(TrnID, TrnDT) references TransactionHist ory (TrnID, TrnDT) and create
indexes on PersonTransacti onHistory with (TrnDT, PersonID) and
(PersonID, TrnDT). This seems like it would let SQL Server make
much more efficient execution plans. However, I am unsure if SQL server
can leverage the FK on TrnDT to use those new indexes if I give it a query
like:

select * from TransactionHist ory TH
inner join PersonTransacti onHistory PTH
on TH.TrnID = PTH.TrnID
where PTH.PersonID in some criteria
and TH.TrnDT between some date and some date
Well, take a copy of the database and try it!

(But first try changing the clustered index.)
2 - Just coalesce the two tables into one. The original intent was to save
space by not requiring extra columns about Persons for all rows, many of
which did not have anything to do with a particular person (for instance a
contact point going active).

Depends a little on the ration. If the PersonTransacti onHistory is 50%
of all rows in the main table, collapsing into one is probably the best.
If it's 5%, I don't think it is.

--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #2

Metal Dave

On Tue, 26 Oct 2004, Erland Sommarskog wrote:

Given your query, it could be a good idea to have the clustered index
on TrnDT and PersonID instead. The main problem now with the queries
is that SQL Server will have to make a choice between Index Seek +
Bookmark Lookup on the one hand, and Clustered Index Scan on the other.
This is a guessing game that does not always end up the best way.

Of course, you may have other queries that are best off with clustering
on the Pkey, but this does not seem likely. (Insertion may however
benefit from a montonically increasing index. A clustered index on
PersonID may cause fragmentation.)
My intuition agrees with you regarding the index in this case. I'm
pretty sure the clustered bookmark scan kills us on many reports. However
I haven't looked with enough depth at the wide variety of queries we use
to know for sure where I should put the clustered index so I'm reserving
judgement for now. I'd also like to study a bit more first so that I
don't replace one hasty decision with another - it might solve ad
individual problem but exacerbate others.

For instance, I think

select * from PersonTransacti onHistory PTH
inner join TransactionHist ory TH on PTH.TrnID = TH.TrnID
where PTH.PersonID = 12345

would be harmed by moving the TH clustered index from TH.TrnID to
TH.TrnDT, as it would now have to make the same lookup vs scan choice in
order to perform the join. Does that make sound reasonable? And since it's
rare for us to access PTH without the inner join to TH, there are probably
many queries like this.

1 - Add the TrnDT column to the PersonTransacti onHistory Table as
well. Then create a foreign key relationship of PersonTransacti onHistory
(TrnID, TrnDT) references TransactionHist ory (TrnID, TrnDT) and create
indexes on PersonTransacti onHistory with (TrnDT, PersonID) and
(PersonID, TrnDT). This seems like it would let SQL Server make
much more efficient execution plans. However, I am unsure if SQL server
can leverage the FK on TrnDT to use those new indexes if I give it a query
like:

select * from TransactionHist ory TH
inner join PersonTransacti onHistory PTH
on TH.TrnID = PTH.TrnID
where PTH.PersonID in some criteria
and TH.TrnDT between some date and some date

Well, take a copy of the database and try it!

I appreciate the value of experimentation and normally would do that but
if it didn't work that wouldn't necesarily prove to me that I wasn't
simply doing something wrong like not making the foreign key specific
enough or putting something in my query which made SQL server ignore this
potential valuable relationship. So I was basically wondering if there
were any good docs regarding what types of information SQL Server will and
will no leverage in its choices or whether someone familiar with those
rules had some feedback off the top of their head.

2 - Just coalesce the two tables into one. The original intent was to save
space by not requiring extra columns about Persons for all rows, many of
which did not have anything to do with a particular person (for instance a
contact point going active).

Depends a little on the ration. If the PersonTransacti onHistory is 50%
of all rows in the main table, collapsing into one is probably the best.
If it's 5%, I don't think it is.

It's probably between 20% and 40% depending on the particular
installation. It's your rationale that for 50% the space saved is
negligible whereas for 5% is is not? For me it's as more about limiting
the changes to the client software (definitely keeping the tables
separate) vs speeding up queries (possible coalescing) rather than a space
consideration. I did a test once and recall discovering we took up nearly
as much or more space with our indexes than our tables anyway, so
coalescing might make a big space difference anyway. (This amount of index
space suprised me but I'm not sure if there is a good rule of thumb for
how much space indexes should take.)

Rereading the post I probably should have just asked for good table design
references right up front. Any takers?

Thanks for the feedback.

Dave

Jul 20 '05 #3

Erland Sommarskog

Metal Dave (me***@spam.spa m) writes:

For instance, I think

select * from PersonTransacti onHistory PTH
inner join TransactionHist ory TH on PTH.TrnID = TH.TrnID
where PTH.PersonID = 12345

would be harmed by moving the TH clustered index from TH.TrnID to
TH.TrnDT, as it would now have to make the same lookup vs scan choice in
order to perform the join. Does that make sound reasonable? And since it's
rare for us to access PTH without the inner join to TH, there are probably
many queries like this.
Let's assume for the example that the clustered index in FTH is on PersonID.
Then the join against TH on TrnID will be akin to Index Seek + Bookmark
Lookup, no matter if the index on TrnID is clustered or not. In both
cases you would expect a plan with a Nested Loop join which means that
for each in FTH you look up a row in TH. The only difference if the index
on TrnID is non-clustered, is that you will get a few more reads for
each access. Which indeed is not neglible, since it multiplies with the
number of rows for PersonID.

And just like "SELECT * FROM tbl WHERE nonclusteredcol = @val" has a
choice between index seek and scan, so have this query. Rather than
nested loop, the optimizer could go for hash or merge join which would
mean a single scan of TH. I would guess that the probability for this is
somewhat higher with a NC index on TrnID.

Of course, you opt to change only FTH, if you like.
I appreciate the value of experimentation and normally would do that but
if it didn't work that wouldn't necesarily prove to me that I wasn't
simply doing something wrong like not making the foreign key specific
enough or putting something in my query which made SQL server ignore this
potential valuable relationship. So I was basically wondering if there
were any good docs regarding what types of information SQL Server will and
will no leverage in its choices or whether someone familiar with those
rules had some feedback off the top of their head.
SQL Server does look at constraints, but really how intelligent it is,
I have not dug into. Thus, my encouragement of experimentation .
It's probably between 20% and 40% depending on the particular
installation. It's your rationale that for 50% the space saved is
negligible whereas for 5% is is not?

Actually, I was more thinking in terms of performance, but space and
performance are related. My idea was that with 50%, the space saved is not
worth the extra complexity, and performance may suffer. With 5%, you save a
lot of space, since FTH would be a small table.

Your concern of having to change the client is certainly not one to be
neglected, and if this is costly in development time, I don't think it's
worth it.
--
Erland Sommarskog, SQL Server MVP, es****@sommarsk og.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #4

Similar topics

732

MS Excel /myodbc drops column aliases (with joined tables)

by: Andrew | last post by:

With command-line interface ( 3.23.37, UNIX Socket ) all is well with column aliasing. However, column aliases disappear in Excel, over ODBC, when there are multiple (joined) tables in the query. I run the following query with aliased columns with CLI: select table1.c1 as 'pet category', table1.c2 as 'item', table1.c3 as 'quantity',...

MySQL Database

1513

Need SQL for Joined Tables

by: GGerard | last post by:

Hello I have the following tables joined on a one (Field4) to many (Field3) relationship Table1 Field1 / Field2 / Field3 1 Bob 2 2 Fred 4 3 Paul 2

Microsoft Access / VBA

3175

how to fill typed dataset with joined queries ?

by: Ersin Gençtürk | last post by:

hi , I have 2 tables , tUser and tUserRole and I have a query like : select * from tUser inner join tUserRole on tUserRole.UserId=tUser.UserId where tUser.UserId=1 this will return a user with associated roles. Also I have a typed dataset with two tables inside : tUser and tUserRole

ASP.NET

112

3974

Wierd For...Each behavior

by: Tom | last post by:

This is very strange: I have a Windows Form with a Panel on it. In that panel I dynamically (at run time) create some labels, as so: for i=1 to x dim ctlNew as New Label() with ctlNew .Name="Whatever" & trim(cstr(i)) .Text=.Name .Visible=True ... etc etc etc ...

Visual Basic .NET

1743

Help with blocking on querying two joined tables

by: loosecannon_1 | last post by:

I get a 90-120 second blocking when send 15 or so simultaneous queries to SQL Server 2000 that query a view made up of two joined tables. After each query is blocking for the same amount of time they all return. Further identical queries of this type work in 3-4 seconds (caching?) until hours later where it happens again. If I query the...

Microsoft SQL Server

2663

help with main query using two inner joined subqueries

by: psuaudi | last post by:

I have a main query that I would like to call two different subqueries. In MS Access, I usually just save the two subqueries as separate queries which are then called by a third separate and main query. However, I'd like to put them all into one SQL command. Is this possible? Here are the queries: -This query calls the other two queries...

Microsoft Access / VBA

2434

Advance joined query, need help

by: Chuck36963 | last post by:

Hi all, I've been working on a listing problem and I can't figure out how to work it out. I have looked far and wide on the web to find answers, but I'd like other peoples input on my project in the whole. I really need MySQL wizz to give me a hand (and maybe refer me to books to get me to the wizz level myself). First off, english is a...

MySQL Database

1359

Undefined behavior - 2 queries

by: Tommy Vercetti | last post by:

Hi - Great group! I have 2 queries about undefined behavior: 1) Is the following code undefined? float myfunction(float f) {

C / C++

7695

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7922

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

7668

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

7964

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...

General

6281

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5509

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

5218

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3653

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

2111

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp