473,698 Members | 2,170 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Why Cluster a Primary Key?

I'm probably going to get shot down with thousands of reasons for
this, but I've never really heard or read a convincing explanation, so
here goes ...

Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?

Also, if you are using a sequential key, clustering this will cause an
insert hotspot on the last page of the table, which can cause
concurrency problems if you aren't using row-level locking. If you're
using a random clustered key then inserts will generally be improved,
assuming you're using a sensible fillfactor, but you still lose the
advantage of using the clustered index for multi-record retrieval.

I'd be very interested to hear other peoples' views on this.

Phil
Jul 20 '05 #1
17 49824
The main reason I've found for clustering the primary key is that clustering
anything else will mess up front-end libraries including DAO and ADO, and
sometimes clustering the primary key seems to at least keep records together
that were entered close together in time, and those happen to be the ones
close tegether by date which reduces the number of pages hit in date range
queries.

Personally, I almost always have something I'd rather cluster than the primary
key, but with DAO and ADO both assuming the clustered index is the primary key
even when something else actually is, it's just not workable. Either the
clustered index is unique and much larger than the PK leading to unnecessary
network traffic, or the clustered index is not unique, and the front-end
becomes confused that there seems to be more than one record with the same
key.

On 5 Mar 2004 03:56:38 -0800, ph********@btop enworld.com (Philip Yale) wrote:
I'm probably going to get shot down with thousands of reasons for
this, but I've never really heard or read a convincing explanation, so
here goes ...

Clustered indexes are more efficient at returning large numbers of
records than non-clustered indexes. Agreed? (Assuming the NC index
doesn't cover the query, of course)

Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record?

Isn't it generally better to specify a non-clustered index for the
primary key, and reserve the clustered index for a column which will
most likely be used for queries that return multi-row data sets (e.g.
date columns)?

Also, if you are using a sequential key, clustering this will cause an
insert hotspot on the last page of the table, which can cause
concurrency problems if you aren't using row-level locking. If you're
using a random clustered key then inserts will generally be improved,
assuming you're using a sensible fillfactor, but you still lose the
advantage of using the clustered index for multi-record retrieval.

I'd be very interested to hear other peoples' views on this.

Phil


Jul 20 '05 #2
>> Since it's only possible to have one clustered index, why is this
almost always used for the primary key, when by definition a primary
key will always return 1 record [sic]? <<

Actually, you hit the nail on the head and did not know it. When SQL
was first implemented, the mental and physical models for data were
based on files (Rows are not records; fields are not columns; tables
are not files). Files with sequential, contigous storage and in
particular, magnetic tape and punch cards (there is no sequential
access or ordering in an RDBMS, so "first", "next" and "last" are
totally meaningless).

A Master mag tape file is sorted on a key, usually at the front of the
records, just after the "deleted" flag. This is so that you can merge
the transaction tapes, also sorted on the same key, into the Master.

Dr. Codd also fell for this and began with the PRIMARY KEY in first
papers on the relational. A bit later, he caught the error and
realized that a relational key is a key is a key and none of them are
"more equal" than the others. Unfortunately, SQL was based on Codd's
first papers and carried the error forward.

Sybase simply used what was there in Unix and the existing file
systems to build SQL Server and Microsoft followed suit.

Are you familiar with the story of how the Roman Empire determined the
size of the Space Shuttle boosters and therefore most of the design of
the shuttle?
Jul 20 '05 #3
jo*******@north face.edu (--CELKO--) wrote in message news:<a2******* *************** ****@posting.go ogle.com>...
Since it's only possible to have one clustered index, why is this

almost always used for the primary key, when by definition a primary
key will always return 1 record [sic]? <<

Actually, you hit the nail on the head and did not know it. When SQL
was first implemented, the mental and physical models for data were
based on files (Rows are not records; fields are not columns; tables
are not files). Files with sequential, contigous storage and in
particular, magnetic tape and punch cards (there is no sequential
access or ordering in an RDBMS, so "first", "next" and "last" are
totally meaningless).

A Master mag tape file is sorted on a key, usually at the front of the
records, just after the "deleted" flag. This is so that you can merge
the transaction tapes, also sorted on the same key, into the Master.

Dr. Codd also fell for this and began with the PRIMARY KEY in first
papers on the relational. A bit later, he caught the error and
realized that a relational key is a key is a key and none of them are
"more equal" than the others. Unfortunately, SQL was based on Codd's
first papers and carried the error forward.

Sybase simply used what was there in Unix and the existing file
systems to build SQL Server and Microsoft followed suit.

Are you familiar with the story of how the Roman Empire determined the
size of the Space Shuttle boosters and therefore most of the design of
the shuttle?


Thanks for that, Celko. It's very interesting, although I must
confess that I'm not sure what it's got to do with my original
question? Whatever the background evolution of RDBMS systems, in the
real world today what people refer to as a "primary key" returns 1
row, and I feel that it's a bit of a waste putting a clustered index
on this.

BTW - I've heard the Roman theory many times, but this really is just
an urban myth. Railway tracks, for example, in the UK, have a gauge
of 4' 8.5" because this was what resulted from a standard axle width
of 5'. There are many other gauges throughout the world, and there's
a very good paper at
http://www.vwl.uni-muenchen.de/ls_komlos/northam.pdf which details
their evolution.
Jul 20 '05 #4
jo*******@north face.edu (--CELKO--) wrote in message news:<a2******* *************** ****@posting.go ogle.com>...
Since it's only possible to have one clustered index, why is this

almost always used for the primary key, when by definition a primary
key will always return 1 record [sic]? <<

Actually, you hit the nail on the head and did not know it. When SQL
was first implemented, the mental and physical models for data were
based on files (Rows are not records; fields are not columns; tables
are not files). Files with sequential, contigous storage and in
particular, magnetic tape and punch cards (there is no sequential
access or ordering in an RDBMS, so "first", "next" and "last" are
totally meaningless).

A Master mag tape file is sorted on a key, usually at the front of the
records, just after the "deleted" flag. This is so that you can merge
the transaction tapes, also sorted on the same key, into the Master.

Dr. Codd also fell for this and began with the PRIMARY KEY in first
papers on the relational. A bit later, he caught the error and
realized that a relational key is a key is a key and none of them are
"more equal" than the others. Unfortunately, SQL was based on Codd's
first papers and carried the error forward.

Sybase simply used what was there in Unix and the existing file
systems to build SQL Server and Microsoft followed suit.

Are you familiar with the story of how the Roman Empire determined the
size of the Space Shuttle boosters and therefore most of the design of
the shuttle?


Sorry, Joe - didn't mean to call you "Celko" in the previous reply; I
instinctively used your sign-on name!
Jul 20 '05 #5
> what people refer to as a "primary key" returns 1
row, and I feel that it's a bit of a waste putting a clustered index
on this.
Consider the "Orders" and "Order Details" tables in the sample Northwind
database. The primary keys are respectively OrderID and OrderID/ProductID.
Assuming these tables are frequently joined on OrderID, the Order Details
clustered primary key index reduces i/o and enhances join performance of
these queries.

Of course, there my be a better choice than the primary key for the
clustered index . It all depends on how the data are normally accessed and
there are often trade-offs involved.

--
Hope this helps.

Dan Guzman
SQL Server MVP

"Philip Yale" <ph********@bto penworld.com> wrote in message
news:e9******** *************** **@posting.goog le.com... jo*******@north face.edu (--CELKO--) wrote in message

news:<a2******* *************** ****@posting.go ogle.com>...
> Since it's only possible to have one clustered index, why is this

almost always used for the primary key, when by definition a primary
key will always return 1 record [sic]? <<

Actually, you hit the nail on the head and did not know it. When SQL
was first implemented, the mental and physical models for data were
based on files (Rows are not records; fields are not columns; tables
are not files). Files with sequential, contigous storage and in
particular, magnetic tape and punch cards (there is no sequential
access or ordering in an RDBMS, so "first", "next" and "last" are
totally meaningless).

A Master mag tape file is sorted on a key, usually at the front of the
records, just after the "deleted" flag. This is so that you can merge
the transaction tapes, also sorted on the same key, into the Master.

Dr. Codd also fell for this and began with the PRIMARY KEY in first
papers on the relational. A bit later, he caught the error and
realized that a relational key is a key is a key and none of them are
"more equal" than the others. Unfortunately, SQL was based on Codd's
first papers and carried the error forward.

Sybase simply used what was there in Unix and the existing file
systems to build SQL Server and Microsoft followed suit.

Are you familiar with the story of how the Roman Empire determined the
size of the Space Shuttle boosters and therefore most of the design of
the shuttle?


Thanks for that, Celko. It's very interesting, although I must
confess that I'm not sure what it's got to do with my original
question? Whatever the background evolution of RDBMS systems, in the
real world today what people refer to as a "primary key" returns 1
row, and I feel that it's a bit of a waste putting a clustered index
on this.

BTW - I've heard the Roman theory many times, but this really is just
an urban myth. Railway tracks, for example, in the UK, have a gauge
of 4' 8.5" because this was what resulted from a standard axle width
of 5'. There are many other gauges throughout the world, and there's
a very good paper at
http://www.vwl.uni-muenchen.de/ls_komlos/northam.pdf which details
their evolution.

Jul 20 '05 #6
>> .. what people refer to as a "primary key" returns 1
row, and I feel that it's a bit of a waste putting a clustered index on
this. <<

I agree. But this was the default action in the original Sybase product
for the reasons I mentioned and it was carried forward. Programmers are
lazy and don't think; you learn by copying from old code.

Why is "i" used for a loop control variable in procedural langauges?
Because in FORTRAN II, integers began with the letters I thru N.
Wouldn't it be better to come up with a meaning name for the control
within the context of the loop? Sure!

Thanks for the railroad link! Another urban myth bites the dust! Want
to hear the ham bone parable instead :)?

--CELKO--
=============== ============
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 20 '05 #7
>> Sorry, Joe - didn't mean to call you "Celko" in the previous reply; I
instinctively used your sign-on name! <<

That is what I go by; even my wife calls me "Celko" and my column in
INTELLIGENT ENTERPRISE is called "CELKO". My family was military and I
grew up in an environment where you used the last name. And thanks for
the link!

--CELKO--
=============== ============
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 20 '05 #8
"Dan Guzman" <da*******@nosp am-earthlink.net> wrote in message news:<%i******* ***********@new sread1.news.pas .earthlink.net> ...
what people refer to as a "primary key" returns 1
row, and I feel that it's a bit of a waste putting a clustered index
on this.


Consider the "Orders" and "Order Details" tables in the sample Northwind
database. The primary keys are respectively OrderID and OrderID/ProductID.
Assuming these tables are frequently joined on OrderID, the Order Details
clustered primary key index reduces i/o and enhances join performance of
these queries.

Of course, there my be a better choice than the primary key for the
clustered index . It all depends on how the data are normally accessed and
there are often trade-offs involved.

--
Hope this helps.

Dan Guzman
SQL Server MVP


Thanks Dan.

I quite agree that there are occasions where a clustered primary key
is desirable, and that this is a decision which a DBA should take when
designing the physical database based on the data distribution and
access methods. My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.
Jul 20 '05 #9
Philip Yale wrote:
My contention, though, is that this is often the
exception rather than the rule, contrary to the *default* action taken
when defining a primary key constraint or using a database design
package, both of which tend to assume that all primary keys will be
clustered.


It wouldn't be if more people paid attention to relational database
theory and Joe Celko rather than have a knee-jerk reaction that every
table needs a surrogate key.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.wash ington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4053
by: gotdough | last post by:
Cluster services gives the high availability needed - that is great. But I have never seen any discussion about what happens when a node fails - what do you do to get everything back to the active-passive tandem. I imagine there is not much difference in terms of recovery procedure for either active or passive node. So I'm just going to make up a scenario that we have encountered. The system hard drive (not the shared disk) on primary...
17
25492
by: Peter Ericsson | last post by:
Does anyone know if Microsoft plans to implement an C# (.net) API for MS Cluster like the one in C++? Or has someone else written a C# wrapper for the C++ API?
0
980
by: Neil | last post by:
Hello: The information on DB mirroring states that the databases are primary/secondary and that the secondary cannot be used except for limited reporting. I dont see anything about clustering that states that it is primary/secondary. Can we issue transactions (including updates) against both DBs in a
4
8593
by: ThunderMusic | last post by:
Hi, We have many servers setup as a cluster. When one server crashes, another one take the relay... We want to know if it's possible (I suppose it is) to make a Windows service developed with .NET 2.0 work in a cluster environment. I mean, how to make sure the service will take the relay if one server fails. Is there something special to do or it will work by itself? (I'm not personally used to clusters, we need it at my office and they...
1
2425
by: Mark D Powell | last post by:
I have a production failover cluster running SQL Server 2000 at SP3 that I want to upgrade to SP4. I do not have a test failover cluster to test with so I need the install on the primary server to work the first time. Per the information I have I just install the patch on the primary server and it will install both on the primary and on the secondary. However, I remember when I did the initial install and it failed. The first problem...
3
5420
by: Simon | last post by:
Hi All, I'm hoping someone will have some words of wisdom for me regarding MS Clustering on Windows 2003. I have a service that runs on a cluster. During invocation it's supposed to determine from the cluster which node is active (this is a active/standby configuration) and either proceed or sleep depending on the status. The interface to the cluster is that advertised by the standard interop layer built by Visual Studio for the...
1
1848
by: dunleav1 | last post by:
The table is a many row and many column table that is in a 16K page size. I am running row compression on the table. A have an index that over time have come to have a low index cluster ratio. The table does not have a primary key. I have 3 other indexes on the table and the index cluster ratio is 99% on the other three.
2
4777
by: dunleav1 | last post by:
I have a many row and many column table that is in a 16K page size. I have four indexes on the table. I am running row compression on the table. The table does not have a primary key. The table does not have a clustered index. I ran a reorg on the table and the indexes. I ran runstats on the table and the indexes after the reorg. Three indexes on the table have an index cluster ratios of 99,99,100 respectively. The fourth index has a...
10
3154
by: Ian | last post by:
Henry J. wrote: MDC *guarantees* clustering, whereas a table with a clustering index will eventually require maintenance (a.k.a. reorg) to maintain the cluster ratio. That's not to say that a clustering index isn't still valuable (especially for high cardinality columns that aren't a reasonable candidate as an MDC dimension).
0
8668
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8597
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9148
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9012
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
7708
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5857
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4358
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4611
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
1992
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.