Why Cluster a Primary Key? - Page 2

Philip Yale

I'm probably going to get shot down with thousands of reasons for
this, but I've never really heard or read a convincing explanation, so
here goes ...

Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?

Also, if you are using a sequential key, clustering this will cause an
insert hotspot on the last page of the table, which can cause
concurrency problems if you aren't using row-level locking. If you're
using a random clustered key then inserts will generally be improved,
assuming you're using a sensible fillfactor, but you still lose the
advantage of using the clustered index for multi-record retrieval.

I'd be very interested to hear other peoples' views on this.

Phil

Jul 20 '05

Subscribe Reply

49831

Erland Sommarskog

Philip Yale (ph********@bto penworld.com) writes:

Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?

This is an old discussion, and often goes along with "why is the
primary key by default clustered"?

Well, to give my opinion on the last question: I think it is better
if a table has a clustered index than it does not, so if the only
index you have on the table, it should be clustered. And if you only
have one index on the table, in many cases that will be the PK constraint.

As for clustering PKs at all, Dan gave the example that I immediately came
to think of. In the Orders table, OrderID is probably not the best column
to cluster on. Depending on your business, CustomerID or OrderDate are
better choices. But in OrderDetails, what on Earth would you cluster on,
if not the PK?

And this is the failure of people who question why it is good to cluster
the primary key: they forgot that many tables have multiple-column keys,
and range queries on such tables often relates to the top-most columns
in this index.

Daniel Morgan's observation that these tables are not as common they should
be, because some people stick in an artificial identity PK in all their
tables is of course relevant. And that is true, if you live by that design
pattern, *then* you rarely have reason to cluster your PK. Then you should
cluster what should have been your primary key.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #11

Erland Sommarskog

Steve Jorgensen (no****@nospam. nospam) writes:

The main reason I've found for clustering the primary key is that
clustering anything else will mess up front-end libraries including DAO
and ADO, and sometimes clustering the primary key seems to at least keep
records together that were entered close together in time, and those
happen to be the ones close tegether by date which reduces the number of
pages hit in date range queries.
But in such case you need a clustered index on the date. If you have a
non-clustered index on date, there will be one bookmark lookup for each
row. Sure, since they all go the same pages, most lookups will be to
cache, but it's still significantly less effecient than a clustered index
seek.
Personally, I almost always have something I'd rather cluster than the
primary key, but with DAO and ADO both assuming the clustered index is
the primary key even when something else actually is, it's just not
workable. Either the clustered index is unique and much larger than the
PK leading to unnecessary network traffic, or the clustered index is not
unique, and the front-end becomes confused that there seems to be more
than one record with the same key.

Huh? I don't know DAO, but that ADO should care about the physical
implementation is news to me. Care to elaborate this? Maybe a repro?

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #12

Philip Yale

Daniel Morgan <da******@x.was hington.edu> wrote in message news:<107862212 4.796146@yasure >...

Philip Yale wrote:
My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.

It wouldn't be if more people paid attention to relational database
theory and Joe Celko rather than have a knee-jerk reaction that every
table needs a surrogate key.

I agree with you to a point. I hope I didn't create the impression
that I was having a "knee-jerk" reaction, and I do take serious issue
with the instruction in almost all SQLServer manuals that "every table
should have a primary key". I wasn't actually referring to surrogate
keys, but to real keys, and I'm sure you'll agree that they are valid
in very many cases due to the way that the SQLServer optimizer (and
Sybase, and I would guess Oracle, too, although I'm no Oracle expert)
is designed and built. All I was questioning was the automatic
tendency to make primary keys (even valid ones) clustered when they
would very often be just as well-served by a NC index, so leaving the
clustered index free for a more suitable key.

Incidentlly, I stumbled across another surrogate-key discussion,
featuring Joe, Dan and others, at
http://www.mcse.ms/message296826.html . Clearly people have very
different and strongly-held views on all this!

Jul 20 '05 #13

Philip Yale

Erland Sommarskog <so****@algonet .se> wrote in message news:<Xn******* *************** @127.0.0.1>...

Philip Yale (ph********@bto penworld.com) writes:
Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?
This is an old discussion, and often goes along with "why is the
primary key by default clustered"?

Well, to give my opinion on the last question: I think it is better
if a table has a clustered index than it does not, so if the only
index you have on the table, it should be clustered.
And if you only
have one index on the table, in many cases that will be the PK constraint.

As for clustering PKs at all, Dan gave the example that I immediately came
to think of. In the Orders table, OrderID is probably not the best column
to cluster on. Depending on your business, CustomerID or OrderDate are
better choices. But in OrderDetails, what on Earth would you cluster on,
if not the PK?

I quite agree. In that example, clustering the PK seems fine. The
essential word in my original question was "generally" ; I question
whether it isn't generally better to use a NC index unless the
specific criteria of data and usage dictate otherwise.

And this is the failure of people who question why it is good to cluster
the primary key: they forgot that many tables have multiple-column keys,
and range queries on such tables often relates to the top-most columns
in this index.

Fair point. I've just spent some time composing a rebuttal of this,
but now I think I see what you mean:

e.g. unique PK_constraint (DateField datetime not null,
OtherField char(3) not null,
ID int not null)

Something like that? The only comment I'd make is that I was always
taught to make the leading column of any multi-column index the most
restrictive one, but since the overall combination here can be
declared UNIQUE I guess the optimizer will take that into account and
still use the index, even if the datefield itself is very
non-restrictive.
Daniel Morgan's observation that these tables are not as common they should
be, because some people stick in an artificial identity PK in all their
tables is of course relevant. And that is true, if you live by that design
pattern, *then* you rarely have reason to cluster your PK. Then you should
cluster what should have been your primary key.

Jul 20 '05 #14

Erland Sommarskog

Philip Yale (ph********@bto penworld.com) writes:

I quite agree. In that example, clustering the PK seems fine. The
essential word in my original question was "generally" ; I question
whether it isn't generally better to use a NC index unless the
specific criteria of data and usage dictate otherwise.
Then again, in many databases most of the tables only have one index and
that is the PK. And these tables does not need any more indexes, because
they are small lookup-tables.

I would like to put it in another way: if you can find another column to
cluster on it, do it. But if you can't think of one, let the PK be
clustered.

But it should certainly be part of the data-modelling phase to identify
good columns to cluster on.
Fair point. I've just spent some time composing a rebuttal of this,
but now I think I see what you mean:

e.g. unique PK_constraint (DateField datetime not null,
OtherField char(3) not null,
ID int not null)

Something like that? The only comment I'd make is that I was always
taught to make the leading column of any multi-column index the most
restrictive one, but since the overall combination here can be
declared UNIQUE I guess the optimizer will take that into account and
still use the index, even if the datefield itself is very
non-restrictive.

I don't really see where you are going here, but generally, if you need
to index on a non-restrictive column, a clustered index is probably better.

Say that you have a table with a status column, and a very common query
is to find all rows with status = 'N' (New). The table would have million
of rows, but normally only a few hundred at a time may have N. A non-
clustered index may work, but if you have no other index to cluster on,
this status column (possibly in combination with some other column) is
a very good candidate. Since the column is always updated during the
life-time of the row, there is a cost for moving the data. But for finding
all newly arrived rows, this is great.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #15

Daniel Morgan

Philip Yale wrote:

Daniel Morgan <da******@x.was hington.edu> wrote in message news:<107862212 4.796146@yasure >...
Philip Yale wrote:

My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.

It wouldn't be if more people paid attention to relational database
theory and Joe Celko rather than have a knee-jerk reaction that every
table needs a surrogate key.

I agree with you to a point. I hope I didn't create the impression
that I was having a "knee-jerk" reaction, and I do take serious issue
with the instruction in almost all SQLServer manuals that "every table
should have a primary key". I wasn't actually referring to surrogate
keys, but to real keys, and I'm sure you'll agree that they are valid
in very many cases due to the way that the SQLServer optimizer (and
Sybase, and I would guess Oracle, too, although I'm no Oracle expert)
is designed and built. All I was questioning was the automatic
tendency to make primary keys (even valid ones) clustered when they
would very often be just as well-served by a NC index, so leaving the
clustered index free for a more suitable key.

Unless you are one of those people that throws a surrogate key at
everything then the discussion about single column vs. multi-column
primary keys is a discussion about nothing. The natural key to a
record is the natural key whether that is a single column or ten.

One example we deal with repeatedly is the payroll record that often
looks something like this:

employee_id
date
project
task
hours

It takes four, often more, columns to define the natural primary key.
That's just the reality of the data. So do you throw a surrogate at it?
I hope not. A surrogate allows duplicate entries and the only way to
avoid the possibility of duplicates is to create a unique constraint
that duplicates the natural key.

There are places where a surrogate key is the best solution. But
it should be chosen as Option B after considering the implications on
scalability, performance, and data integrity of Option A: A natural
key.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.wash ington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #16

Steve Jorgensen

Well, it seems I owe y'all an apology. I'm 100% certain that what I said here
was true not very long ago because I had to debug the problems it caused, but
I see that with current versions of SQL Server, MDAC, and DAO, this is no
longer a problem with either DAO nor ADO. I just ran tests both with an MDB
and ADP front-end and could not duplicate the issue.

That means you should ignore my earlier post, and it means I can go back and
add appropriate clustered indexes to all my tables that currently don't have
them and could use them.

On Sun, 7 Mar 2004 11:42:00 +0000 (UTC), Erland Sommarskog <so****@algonet .se>
wrote:

Steve Jorgensen (no****@nospam. nospam) writes:
The main reason I've found for clustering the primary key is that
clustering anything else will mess up front-end libraries including DAO
and ADO, and sometimes clustering the primary key seems to at least keep
records together that were entered close together in time, and those
happen to be the ones close tegether by date which reduces the number of
pages hit in date range queries.

But in such case you need a clustered index on the date. If you have a
non-clustered index on date, there will be one bookmark lookup for each
row. Sure, since they all go the same pages, most lookups will be to
cache, but it's still significantly less effecient than a clustered index
seek.
Personally, I almost always have something I'd rather cluster than the
primary key, but with DAO and ADO both assuming the clustered index is
the primary key even when something else actually is, it's just not
workable. Either the clustered index is unique and much larger than the
PK leading to unnecessary network traffic, or the clustered index is not
unique, and the front-end becomes confused that there seems to be more
than one record with the same key.

Huh? I don't know DAO, but that ADO should care about the physical
implementati on is news to me. Care to elaborate this? Maybe a repro?

Jul 20 '05 #17

Erland Sommarskog

Daniel Morgan (da******@x.was hington.edu) writes:

One example we deal with repeatedly is the payroll record that often
looks something like this:

employee_id
date
project
task
hours

It takes four, often more, columns to define the natural primary key.
That's just the reality of the data. So do you throw a surrogate at it?
I hope not. A surrogate allows duplicate entries and the only way to
avoid the possibility of duplicates is to create a unique constraint
that duplicates the natural key.

So in my database I have a table for collateral claims which has a four-
column key:

accno - the account.
depno - the depot for the account (most accounts have only 1)
currency - which currency the claim is in
place - the place that holds the claim.

So far so good, but the came the requirement to keep track of the positions
to which the claim related. That calls for two more key columns in the
sub-table:

insid - The instrument which is cause for the claim.
poseffect - Whether the claim is for held or written options, bought/sold
futures etc.

A six-column key looked frightening enough, so added a surrogate key to
the main table (with a UNIQUE constraint on the original PK) and used
this as a FK in the sub-table. But having programmed some against these
table recently, I'm starting to think that was a mistake. Often you
want to manipulate the claim for a certain place. But to do that for
the sub-table, you always to join with the main table, leading to more
complex SQL.

--
Erland Sommarskog, SQL Server MVP, so****@algonet. se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp

Jul 20 '05 #18

Similar topics

4053

Cluster recovery from node failure

by: gotdough | last post by:

Cluster services gives the high availability needed - that is great. But I have never seen any discussion about what happens when a node fails - what do you do to get everything back to the active-passive tandem. I imagine there is not much difference in terms of recovery procedure for either active or passive node. So I'm just going to make up a scenario that we have encountered. The system hard drive (not the shared disk) on primary...

Microsoft SQL Server

25501

Cluster API

by: Peter Ericsson | last post by:

Does anyone know if Microsoft plans to implement an C# (.net) API for MS Cluster like the one in C++? Or has someone else written a C# wrapper for the C++ API?

C# / C Sharp

981

Use both DBs in cluster?

by: Neil | last post by:

Hello: The information on DB mirroring states that the databases are primary/secondary and that the secondary cannot be used except for limited reporting. I dont see anything about clustering that states that it is primary/secondary. Can we issue transactions (including updates) against both DBs in a

Microsoft SQL Server

8594

Windows service on a cluster

by: ThunderMusic | last post by:

Hi, We have many servers setup as a cluster. When one server crashes, another one take the relay... We want to know if it's possible (I suppose it is) to make a Windows service developed with .NET 2.0 work in a cluster environment. I mean, how to make sure the service will take the relay if one server fails. Is there something special to do or it will work by itself? (I'm not personally used to clusters, we need it at my office and they...

.NET Framework

2428

Upgrade SQL Server 2000 Failover Cluster from SP3 to SP4

by: Mark D Powell | last post by:

I have a production failover cluster running SQL Server 2000 at SP3 that I want to upgrade to SP4. I do not have a test failover cluster to test with so I need the install on the primary server to work the first time. Per the information I have I just install the patch on the primary server and it will install both on the primary and on the secondary. However, I remember when I did the initial install and it failed. The first problem...

Microsoft SQL Server

5425

Cluster Automation Server: Access Denied for [NT AUTHORITY\SYSTEM]

by: Simon | last post by:

Hi All, I'm hoping someone will have some words of wisdom for me regarding MS Clustering on Windows 2003. I have a service that runs on a cluster. During invocation it's supposed to determine from the cluster which node is active (this is a active/standby configuration) and either proceed or sleep depending on the status. The interface to the cluster is that advertised by the standard interop layer built by Visual Studio for the...

C# / C Sharp

1850

Low cluster ratio - row compression - V9.1

by: dunleav1 | last post by:

The table is a many row and many column table that is in a 16K page size. I am running row compression on the table. A have an index that over time have come to have a low index cluster ratio. The table does not have a primary key. I have 3 other indexes on the table and the index cluster ratio is 99% on the other three.

DB2 Database

4783

Low cluster ratio - row compression - reorg - V9.1 question.

by: dunleav1 | last post by:

I have a many row and many column table that is in a 16K page size. I have four indexes on the table. I am running row compression on the table. The table does not have a primary key. The table does not have a clustered index. I ran a reorg on the table and the indexes. I ran runstats on the table and the indexes after the reorg. Three indexes on the table have an index cluster ratios of 99,99,100 respectively. The fourth index has a...

DB2 Database

3158

Re: MDC vs. cluster index: which to use?

by: Ian | last post by:

Henry J. wrote: MDC *guarantees* clustering, whereas a table with a clustering index will eventually require maintenance (a.k.a. reorg) to maintain the cluster ratio. That's not to say that a clustering index isn't still valuable (especially for high cardinality columns that aren't a reasonable candidate as an MDC dimension).

DB2 Database

8946

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9449

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9310

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9236

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9182

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6735

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6031

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4809

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2180

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General