473,698 Members | 2,409 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Why Cluster a Primary Key?

I'm probably going to get shot down with thousands of reasons for
this, but I've never really heard or read a convincing explanation, so
here goes ...

Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?

Also, if you are using a sequential key, clustering this will cause an
insert hotspot on the last page of the table, which can cause
concurrency problems if you aren't using row-level locking. If you're
using a random clustered key then inserts will generally be improved,
assuming you're using a sensible fillfactor, but you still lose the
advantage of using the clustered index for multi-record retrieval.

I'd be very interested to hear other peoples' views on this.

Phil
Jul 20 '05
17 49825
Philip Yale (ph********@bto penworld.com) writes:
Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?


This is an old discussion, and often goes along with "why is the
primary key by default clustered"?

Well, to give my opinion on the last question: I think it is better
if a table has a clustered index than it does not, so if the only
index you have on the table, it should be clustered. And if you only
have one index on the table, in many cases that will be the PK constraint.

As for clustering PKs at all, Dan gave the example that I immediately came
to think of. In the Orders table, OrderID is probably not the best column
to cluster on. Depending on your business, CustomerID or OrderDate are
better choices. But in OrderDetails, what on Earth would you cluster on,
if not the PK?

And this is the failure of people who question why it is good to cluster
the primary key: they forgot that many tables have multiple-column keys,
and range queries on such tables often relates to the top-most columns
in this index.

Daniel Morgan's observation that these tables are not as common they should
be, because some people stick in an artificial identity PK in all their
tables is of course relevant. And that is true, if you live by that design
pattern, *then* you rarely have reason to cluster your PK. Then you should
cluster what should have been your primary key.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #11
Steve Jorgensen (no****@nospam. nospam) writes:
The main reason I've found for clustering the primary key is that
clustering anything else will mess up front-end libraries including DAO
and ADO, and sometimes clustering the primary key seems to at least keep
records together that were entered close together in time, and those
happen to be the ones close tegether by date which reduces the number of
pages hit in date range queries.
But in such case you need a clustered index on the date. If you have a
non-clustered index on date, there will be one bookmark lookup for each
row. Sure, since they all go the same pages, most lookups will be to
cache, but it's still significantly less effecient than a clustered index
seek.
Personally, I almost always have something I'd rather cluster than the
primary key, but with DAO and ADO both assuming the clustered index is
the primary key even when something else actually is, it's just not
workable. Either the clustered index is unique and much larger than the
PK leading to unnecessary network traffic, or the clustered index is not
unique, and the front-end becomes confused that there seems to be more
than one record with the same key.


Huh? I don't know DAO, but that ADO should care about the physical
implementation is news to me. Care to elaborate this? Maybe a repro?

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #12
Daniel Morgan <da******@x.was hington.edu> wrote in message news:<107862212 4.796146@yasure >...
Philip Yale wrote:
My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.


It wouldn't be if more people paid attention to relational database
theory and Joe Celko rather than have a knee-jerk reaction that every
table needs a surrogate key.


I agree with you to a point. I hope I didn't create the impression
that I was having a "knee-jerk" reaction, and I do take serious issue
with the instruction in almost all SQLServer manuals that "every table
should have a primary key". I wasn't actually referring to surrogate
keys, but to real keys, and I'm sure you'll agree that they are valid
in very many cases due to the way that the SQLServer optimizer (and
Sybase, and I would guess Oracle, too, although I'm no Oracle expert)
is designed and built. All I was questioning was the automatic
tendency to make primary keys (even valid ones) clustered when they
would very often be just as well-served by a NC index, so leaving the
clustered index free for a more suitable key.

Incidentlly, I stumbled across another surrogate-key discussion,
featuring Joe, Dan and others, at
http://www.mcse.ms/message296826.html . Clearly people have very
different and strongly-held views on all this!
Jul 20 '05 #13
Erland Sommarskog <so****@algonet .se> wrote in message news:<Xn******* *************** @127.0.0.1>...
Philip Yale (ph********@bto penworld.com) writes:
Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?
This is an old discussion, and often goes along with "why is the
primary key by default clustered"?

Well, to give my opinion on the last question: I think it is better
if a table has a clustered index than it does not, so if the only
index you have on the table, it should be clustered.
And if you only
have one index on the table, in many cases that will be the PK constraint.

As for clustering PKs at all, Dan gave the example that I immediately came
to think of. In the Orders table, OrderID is probably not the best column
to cluster on. Depending on your business, CustomerID or OrderDate are
better choices. But in OrderDetails, what on Earth would you cluster on,
if not the PK?


I quite agree. In that example, clustering the PK seems fine. The
essential word in my original question was "generally" ; I question
whether it isn't generally better to use a NC index unless the
specific criteria of data and usage dictate otherwise.

And this is the failure of people who question why it is good to cluster
the primary key: they forgot that many tables have multiple-column keys,
and range queries on such tables often relates to the top-most columns
in this index.

Fair point. I've just spent some time composing a rebuttal of this,
but now I think I see what you mean:

e.g. unique PK_constraint (DateField datetime not null,
OtherField char(3) not null,
ID int not null)

Something like that? The only comment I'd make is that I was always
taught to make the leading column of any multi-column index the most
restrictive one, but since the overall combination here can be
declared UNIQUE I guess the optimizer will take that into account and
still use the index, even if the datefield itself is very
non-restrictive.
Daniel Morgan's observation that these tables are not as common they should
be, because some people stick in an artificial identity PK in all their
tables is of course relevant. And that is true, if you live by that design
pattern, *then* you rarely have reason to cluster your PK. Then you should
cluster what should have been your primary key.

Jul 20 '05 #14
Philip Yale (ph********@bto penworld.com) writes:
I quite agree. In that example, clustering the PK seems fine. The
essential word in my original question was "generally" ; I question
whether it isn't generally better to use a NC index unless the
specific criteria of data and usage dictate otherwise.
Then again, in many databases most of the tables only have one index and
that is the PK. And these tables does not need any more indexes, because
they are small lookup-tables.

I would like to put it in another way: if you can find another column to
cluster on it, do it. But if you can't think of one, let the PK be
clustered.

But it should certainly be part of the data-modelling phase to identify
good columns to cluster on.
Fair point. I've just spent some time composing a rebuttal of this,
but now I think I see what you mean:

e.g. unique PK_constraint (DateField datetime not null,
OtherField char(3) not null,
ID int not null)

Something like that? The only comment I'd make is that I was always
taught to make the leading column of any multi-column index the most
restrictive one, but since the overall combination here can be
declared UNIQUE I guess the optimizer will take that into account and
still use the index, even if the datefield itself is very
non-restrictive.


I don't really see where you are going here, but generally, if you need
to index on a non-restrictive column, a clustered index is probably better.

Say that you have a table with a status column, and a very common query
is to find all rows with status = 'N' (New). The table would have million
of rows, but normally only a few hundred at a time may have N. A non-
clustered index may work, but if you have no other index to cluster on,
this status column (possibly in combination with some other column) is
a very good candidate. Since the column is always updated during the
life-time of the row, there is a cost for moving the data. But for finding
all newly arrived rows, this is great.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #15
Philip Yale wrote:
Daniel Morgan <da******@x.was hington.edu> wrote in message news:<107862212 4.796146@yasure >...
Philip Yale wrote:

My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.


It wouldn't be if more people paid attention to relational database
theory and Joe Celko rather than have a knee-jerk reaction that every
table needs a surrogate key.

I agree with you to a point. I hope I didn't create the impression
that I was having a "knee-jerk" reaction, and I do take serious issue
with the instruction in almost all SQLServer manuals that "every table
should have a primary key". I wasn't actually referring to surrogate
keys, but to real keys, and I'm sure you'll agree that they are valid
in very many cases due to the way that the SQLServer optimizer (and
Sybase, and I would guess Oracle, too, although I'm no Oracle expert)
is designed and built. All I was questioning was the automatic
tendency to make primary keys (even valid ones) clustered when they
would very often be just as well-served by a NC index, so leaving the
clustered index free for a more suitable key.


Unless you are one of those people that throws a surrogate key at
everything then the discussion about single column vs. multi-column
primary keys is a discussion about nothing. The natural key to a
record is the natural key whether that is a single column or ten.

One example we deal with repeatedly is the payroll record that often
looks something like this:

employee_id
date
project
task
hours

It takes four, often more, columns to define the natural primary key.
That's just the reality of the data. So do you throw a surrogate at it?
I hope not. A surrogate allows duplicate entries and the only way to
avoid the possibility of duplicates is to create a unique constraint
that duplicates the natural key.

There are places where a surrogate key is the best solution. But
it should be chosen as Option B after considering the implications on
scalability, performance, and data integrity of Option A: A natural
key.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.wash ington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #16
Well, it seems I owe y'all an apology. I'm 100% certain that what I said here
was true not very long ago because I had to debug the problems it caused, but
I see that with current versions of SQL Server, MDAC, and DAO, this is no
longer a problem with either DAO nor ADO. I just ran tests both with an MDB
and ADP front-end and could not duplicate the issue.

That means you should ignore my earlier post, and it means I can go back and
add appropriate clustered indexes to all my tables that currently don't have
them and could use them.

On Sun, 7 Mar 2004 11:42:00 +0000 (UTC), Erland Sommarskog <so****@algonet .se>
wrote:
Steve Jorgensen (no****@nospam. nospam) writes:
The main reason I've found for clustering the primary key is that
clustering anything else will mess up front-end libraries including DAO
and ADO, and sometimes clustering the primary key seems to at least keep
records together that were entered close together in time, and those
happen to be the ones close tegether by date which reduces the number of
pages hit in date range queries.


But in such case you need a clustered index on the date. If you have a
non-clustered index on date, there will be one bookmark lookup for each
row. Sure, since they all go the same pages, most lookups will be to
cache, but it's still significantly less effecient than a clustered index
seek.
Personally, I almost always have something I'd rather cluster than the
primary key, but with DAO and ADO both assuming the clustered index is
the primary key even when something else actually is, it's just not
workable. Either the clustered index is unique and much larger than the
PK leading to unnecessary network traffic, or the clustered index is not
unique, and the front-end becomes confused that there seems to be more
than one record with the same key.


Huh? I don't know DAO, but that ADO should care about the physical
implementati on is news to me. Care to elaborate this? Maybe a repro?

Jul 20 '05 #17
Daniel Morgan (da******@x.was hington.edu) writes:
One example we deal with repeatedly is the payroll record that often
looks something like this:

employee_id
date
project
task
hours

It takes four, often more, columns to define the natural primary key.
That's just the reality of the data. So do you throw a surrogate at it?
I hope not. A surrogate allows duplicate entries and the only way to
avoid the possibility of duplicates is to create a unique constraint
that duplicates the natural key.


So in my database I have a table for collateral claims which has a four-
column key:

accno - the account.
depno - the depot for the account (most accounts have only 1)
currency - which currency the claim is in
place - the place that holds the claim.

So far so good, but the came the requirement to keep track of the positions
to which the claim related. That calls for two more key columns in the
sub-table:

insid - The instrument which is cause for the claim.
poseffect - Whether the claim is for held or written options, bought/sold
futures etc.

A six-column key looked frightening enough, so added a surrogate key to
the main table (with a UNIQUE constraint on the original PK) and used
this as a FK in the sub-table. But having programmed some against these
table recently, I'm starting to think that was a mistake. Often you
want to manipulate the claim for a certain place. But to do that for
the sub-table, you always to join with the main table, leading to more
complex SQL.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 20 '05 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4053
by: gotdough | last post by:
Cluster services gives the high availability needed - that is great. But I have never seen any discussion about what happens when a node fails - what do you do to get everything back to the active-passive tandem. I imagine there is not much difference in terms of recovery procedure for either active or passive node. So I'm just going to make up a scenario that we have encountered. The system hard drive (not the shared disk) on primary...
17
25492
by: Peter Ericsson | last post by:
Does anyone know if Microsoft plans to implement an C# (.net) API for MS Cluster like the one in C++? Or has someone else written a C# wrapper for the C++ API?
0
980
by: Neil | last post by:
Hello: The information on DB mirroring states that the databases are primary/secondary and that the secondary cannot be used except for limited reporting. I dont see anything about clustering that states that it is primary/secondary. Can we issue transactions (including updates) against both DBs in a
4
8594
by: ThunderMusic | last post by:
Hi, We have many servers setup as a cluster. When one server crashes, another one take the relay... We want to know if it's possible (I suppose it is) to make a Windows service developed with .NET 2.0 work in a cluster environment. I mean, how to make sure the service will take the relay if one server fails. Is there something special to do or it will work by itself? (I'm not personally used to clusters, we need it at my office and they...
1
2425
by: Mark D Powell | last post by:
I have a production failover cluster running SQL Server 2000 at SP3 that I want to upgrade to SP4. I do not have a test failover cluster to test with so I need the install on the primary server to work the first time. Per the information I have I just install the patch on the primary server and it will install both on the primary and on the secondary. However, I remember when I did the initial install and it failed. The first problem...
3
5420
by: Simon | last post by:
Hi All, I'm hoping someone will have some words of wisdom for me regarding MS Clustering on Windows 2003. I have a service that runs on a cluster. During invocation it's supposed to determine from the cluster which node is active (this is a active/standby configuration) and either proceed or sleep depending on the status. The interface to the cluster is that advertised by the standard interop layer built by Visual Studio for the...
1
1848
by: dunleav1 | last post by:
The table is a many row and many column table that is in a 16K page size. I am running row compression on the table. A have an index that over time have come to have a low index cluster ratio. The table does not have a primary key. I have 3 other indexes on the table and the index cluster ratio is 99% on the other three.
2
4777
by: dunleav1 | last post by:
I have a many row and many column table that is in a 16K page size. I have four indexes on the table. I am running row compression on the table. The table does not have a primary key. The table does not have a clustered index. I ran a reorg on the table and the indexes. I ran runstats on the table and the indexes after the reorg. Three indexes on the table have an index cluster ratios of 99,99,100 respectively. The fourth index has a...
10
3154
by: Ian | last post by:
Henry J. wrote: MDC *guarantees* clustering, whereas a table with a clustering index will eventually require maintenance (a.k.a. reorg) to maintain the cluster ratio. That's not to say that a clustering index isn't still valuable (especially for high cardinality columns that aren't a reasonable candidate as an MDC dimension).
0
8678
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8609
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9166
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8871
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7737
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6525
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4371
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4621
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3052
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.