UniqueIdentifier pros and cons ??

Oldgradgreg

Hi all
i am building a SQL 2000 database that it is proving a little
challenging, i have companies with multiple addresses, phone numbers,
owning mine sites etc and also joint ventures so maybe you get the
picture with a few design issues that i ma encountering

My queriy is about a primary key identity, and which one to use with
respect to either the identity data type or the unique identifier ,

I am aessentiall building an address table to hold all multiple
addresses as well as phone numbers etc, so my desire to have a unique
identity for each record is very important.

My view is i will run in to violation errors by just using the table
identity data type, i could i suppose use composit primary keys but
that may have a performance impact, although thiis will not be a high
transaction database.

Does anyone know about performance issues regarding each identity
solution, by using a generated 16 bit identifier there are going to be
huge numbers for the DB to verify. or am i worried about nothing?

any views greatly appreciated

regards
Greg

Jul 20 '05 #1

Subscribe Post Reply

10727

Joe Celko

I have several rants about IDENTITY, which cannot be a key be
definition. You can Google them.

building an address table to hold all multiple

addresses as well as phone numbers etc, so my desire to have a unique
identity for each record [sic] is very important. <<

Rows are not records. The addresses you are trying to model already
have a unique identity; you want a unique identifier. You want it to be
verifiable in the reality (which physical locators like proprietary
auto-numberings devices are not).

Have you done your research? Does your industry use a SAN (Standard
Address Number) like the book industry? Does your shipper use some kind
of Address Number (nice to be co-ordinated and let them maintain the
data for you)? Is ZIP+4 good enough for you (if you only deal with
larger commerical addresses, this often works)? Have you looked at
hashing schemes used by bulk mailers (you will need to buy adddress
scrubbing software, so see if they also have a hashing routine as an
option)? Etc.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #2

William Cleveland

Joe Celko wrote:

I have several rants about IDENTITY, which cannot be a key be
definition. You can Google them.

I found two, at:

http://www.intelligententerprise.com...celko1_1.jhtml
http://www.intelligententerprise.com/000626/celko.jhtml

Your complaints here seem to be mostly about using Identity
without explicitly declaring a primary key. If you properly
declare the TableID column as a primary key, what's the
problem?

Bill

Jul 20 '05 #3

Joe Celko

>> If you properly declare the TableID column as a primary key, what's
the problem? <<

Verification and data quality.

If I use a VIN for the key in a table of vehicles, I can verify by
phsycially looking at the engine block.

If I use a UPC code for the key in a table of retail goods, I can verify
it by calling the manufacturer (trusted source) and reading the bar
code.

What the heck is this universal, magic "TableID"? Is it a 17 letter
word in Hebrew which God put on the bottom of everything in creation at
the beginning of the Universe? Do you verify it with prayer?

Let's try this with a very simple table that assigns one driver to one
vehicle in the motorpool.

CREATE Drivers
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times. Your data integrity is trashed. The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));

To demonstrate, here is a typical id-iot newbie schema -- you will them
all over the news groups. I call them "id-iots" because they always
name the IDENTITY property column "id" in EVERY table. They don't
understand basic data modeling -- one and only name for an attribute.
About half the time they don't use any DRI, but let's show it.

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL UNIQUE,
..);

CREATE TABLE Vehicles
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
vin CHAR(17) NOT NULL UNIQUE,
..);

Now change the natural key in Personnel:

UPDATE Personnel
SET ssn = '666666666'
WHERE ssn = '000000000';

Nothing happened in Motorpool, did it? You can do the same thing with a
VIN.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM Transactions
on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #4

William Cleveland

Joe Celko wrote:

If you properly declare the TableID column as a primary key, what's
the problem? <<

Verification and data quality.

If I use a VIN for the key in a table of vehicles, I can verify by
phsycially looking at the engine block.

If you have a non-meaningful key, there's no need to verify the key,
at all. If it's just something you made up, it's correct. If you
need to check that the VIN you have stored is correct, you still can;
it just doesn't happen to be the key.

If I use a UPC code for the key in a table of retail goods, I can verify
it by calling the manufacturer (trusted source) and reading the bar
code.

And when the length of a UPC is increased (they are running out -- if I
recall correctly, they are already longer in Europe than the U.S.),
you're going to have to update every table that uses it as a foreign
key, and not just your product table.

Let's try this with a very simple table that assigns one driver to one
vehicle in the motorpool.

CREATE Drivers
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times. Your data integrity is trashed.
Put a unique constraint on it, then. You're going to have an index
on it, anyway, because the code behind your UI screens (at whatever
tier) will be looking it up.

The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));
How does VIN help identify a driver? Is this intended to support
a many-to-many relationship of some sort?

To demonstrate, here is a typical id-iot newbie schema -- you will them
all over the news groups. I call them "id-iots" because they always
name the IDENTITY property column "id" in EVERY table.
I generally use <table name>ID. That way, if I see a CustomerID field
in a table, I can be reasonably sure that it's a foreign key to a table
called Customer, which has a primary key called CustomerID. It seems
to work well.

They don't
understand basic data modeling -- one and only name for an attribute.
About half the time they don't use any DRI, but let's show it.

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL UNIQUE,
..);

CREATE TABLE Vehicles
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
vin CHAR(17) NOT NULL UNIQUE,
.);

There are too many errors here for me to really tell what you're
trying to do. Is "ssn... REFERENCES Personnel(id)" supposed to be
"ssn... REFERENCES Personnel(ssn)" or
"PersonnelID... REFERENCES Personnel(id)"?
Here's what I would normally do, anyway:

Create Table Driver
(
DriverID Int Identity Not Null Primary Key Clustered,
ssn Char(9) Not Null Unique
);

Create Table Vehicle
(
VehicleID Int Identity Not Null Primary Key Clustered,
vin Char(17) Not Null Unique
);

Create Table MotorPool
(
MotorPoolID Int Identity Not Null Primary Key Clustered,
DriverID Int Not Null References Driver (DriverID),
VehicleID Int Not Null References Vehicle (VehicleID)
);

Now change the natural key in Personnel:

UPDATE Personnel
SET ssn = '666666666'
WHERE ssn = '000000000';
Ick. Right justified code. Nothing happened in Motorpool, did it? You can do the same thing
with a VIN.
I'm not quite sure what you're getting at. If you were intending to
use the data model I did, that's the whole advantage. If someone's
ssn changes (it does happen), that one update fixes everything;
otherwise you'd have to drop every foreign key to that table, update
every table referencing it, then reinstate the foreign keys.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering.
It doesn't have to be sequential, or even numeric. Those just happen
to be handy because computers work well with integers.

Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.
If only one table has your "real key", that's not a problem.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
.."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM Transactions
on Database Systems, 4(4). pp. 397-434.

So the type used for surrogate keys should not be int, char, float or
anything else that can be used for regular values. I'll buy that, but
it would cause trouble with client code that didn't have a Key data
type. You'd still end up letting outside programs see the keys as
integers.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user.
I'm with you so far.
That means never
used in queries, DRI or anything else that a user does.
Where would you get that idea? If it's not used for queries and
DRI, what exactly is it used for?

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

The databases are going to have different schemas, anyway. I do run
into this sort of problem, occasionally, when copying data from a test
system into production; this is the only drawback I've ever seen to
Identity columns (and a built-in Key type, like above, could probably
solve it).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

"Doctor, it hurts when I do this."
"Well, don't do that."

But seriously, ssn is a meaningful key; if you're consistent about
using the IDs, you won't run into that problem.
(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

1. Assign the key as soon as you have something to store;
Identity takes care of this, and a hypothetical key type would, too.
2. Why would a retiree cease to have a key?

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Like Identity?
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

It depends what you mean by users, here. Front end users never will see
the Identity value, in most systems.

Bill

Jul 20 '05 #5

Joe Celko

>> If you have a non-meaningful key, there's no need to verify the key,
at all. If it's just something you made up, it's correct. <<

So if I assign a random number to entity X and someone else assigns a
different random number to entity X, we are both right? Both are
equally "made up", aren't they? Obviously, not. How do I know which
one, if either, is correct? I must have a way to verify it! This is
one of the problems with IDENTITY; load the same data twice and you get
new keys. Why? this is a locator, not a key; it is generated from the
physical state of the storage used for the data and not from the data
itself.

If you need to check that the VIN you have stored is correct, you still can; it just doesn't happen to be the key. <<

Then it is, at best, redundant. And I already demonstrated the dangers
of having more than one key which have no relationship to the entity.
And when the length of a UPC is increased (they are running out -- if I recall correctly, they are already longer in Europe than the U.S.),
you're going to have to update every table that uses it as a foreign
key, and not just your product table. <<

I know this one! This is the GTIN (Global Trade Identification Number)
problem. It will add five digits to the left side of the 10 digit UPC
code and the check digit will change. The migration path is already
defined in the industry (basically, add '00000' to the front and await
further instructions; integrating the ISBNs is a bit harder). You can
get consultants if you need them for the conversions. That is the nice
thing about not making up your own keys -- industry support and external
validation and verification can save your tail.

The way to do this in Standard SQL is to use the ALTER DOMAIN statement.
You are in trouble if your SQL did not support CREATE DOMAIN, but since
you have a data dictionary and have always used the same name for this
data element, you can trace all occurences of the UPC codes and write a
script to change it.

This is easy in a properly designed database; we used to go nuts with
file systems. You are probably not old enough to remember when zone
numbers became ZIP (Zone Improvement Program) codes, and then ZIP+4
codes.

But the people that really hurt in this kind of thing are the
application programmers who have to re-design forms and displays. In
the old days of pre-printed forms and physical printing type, you had to
scrap a lot of paper. Now, you just use a smaller font :)
CREATE TABLE Drivers (driverid INTEGER IDENTITY NOT NULL PRIMARY KEY CLUSTERED,
ssn CHAR(9) NOT NULL UNIQUE); <<

Let's ignore which of the redundant keys should be clustered and the
cost of the extra index for now. Just put in two drivers:

INSERT INTO Drivers VALUES ('111111111');
INSERT INTO Drivers VALUES ('222222222');

Is not the same as:

INSERT INTO Drivers VALUES ('222222222');
INSERT INTO Drivers VALUES ('111111111');

And given a table with the two SSNs, this is not the same as the other
two statements:

INSERT INTO Drivers
SELECT ssn FROM Foobar;

Can you give me a LOGICAL explanation of why one SSN should be sequenced
before the other? Not a PHYSICAL explanation, a logical, verifiable
rule. In a proper data model all of these statements are equivalent.

What you doing is imitating a sequential tape file, then building
imitation pointer chains with these proprietary constructs.
If someone's ssn changes (it does happen), that one update fixes everything; otherwise you'd have to drop every foreign key to that
table, update every table referencing it, then reinstate the foreign
keys. <<

Bad example for two reasons. Talk to data quality people; the SSN is
one of the most stable pieces of data because it is used in so many
places. Birthdays are input wrong more often.

But ignoring that bit of trivia, this is why you have DRI actions :

CREATE TABLE Personnel
(ssn CHAR(9) NOT NULL PRIMARY KEY,
...);

CREATE TABLE Foobar
(...
ssn CHAR(9) NOT NULL
REFERENCES Personnel(ssn)
ON UPDATE CASCADE,
...);

One of many differences between a table and a file is that a table takes
care of itself because it is a part of a schema and not a "unit of work"
by itself.
So the type used for surrogate keys should not be int, char, float or anything else that can be used for regular values. I'll buy that, but it
would cause trouble with client code that didn't have a Key data type.
You'd still end up letting outside programs see the keys as integers. <<

Arrgh! you missed the whole point! A surrogate is not exposed at all;
you never know anything about the implementation at the database level.
If SQL-X wants to use and maintain pointer chains under the covers,
fine. If SQL-Y wants to use and maintain hashing under the covers,
fine. If SQL-Z wants to use and maintain compressed bit vectors under
the covers, fine. I ought to be able to take the same SQL code, put it
on SQL X, Y or Z and get the same results.
If it's [surrogate keys] not used for queries and DRI, what exactly

is it used for? <<

It is an implementation device to improve speed in the RDBMS at the
physical level and has nothing to do with the logical level. Do you
also want to expose the indexes in SQL? We used to have to do that in
file systems. Every index was explicitly created and maintained by
application programs, not by the system. You can fill in the horror
stories.

The worst multiple-key situation I did a little consulting on involved a
motorpool (which is why I used that example) and a bit of criminal
behavior. The commercial vehicles in the motorpool has VIN, tags,
company ids and permit numbers on them. The tags were expensive and the
permits for this class of vehicle were really expensive.

The guy running the motorpool noticed that of a fleet of (n) vehicles,
(k) of them were in the shop for maintenance at any time. If the
company gives you money for (n) tags and permits, you buy (n-k) tags and
permits and then pocket the difference. You just have to keep rotating
the (tag, permit) pairs to vehciles in use.

The scheme fell apart when a (tag, permit) pair was wrong during an
inspection and the authorities asked to see the motorpool records. The
records were obviously wrong, so the company called for an audit and the
police. The tags, permits and company id numbers were a scrambled mess
and could not be verified against the reality of the motorpool.

I got an email about how to use the VIN numbers in the audit, since I
had just posted a short piece on them and got Googled. Since the VIN is
a nautral key, they could decode. This sped up the audit quite a bit,
since each person could be given a list of VINs for one year, make and
model. My suggestion was that they replace the company sequence number
with the VIN and tag number printed in a scanable label as the key. I
do not know the outcome.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #6

Serge Rielau

Joe Celko wrote:

Bad example for two reasons. Talk to data quality people; the SSN is
one of the most stable pieces of data because it is used in so many
places. Birthdays are input wrong more often.

Nope. I've had 4 so far. One in Germany, one in the US, two in Canada.
The one in Canada actually changed on me having had some really
interesting effects on my credit rating.
Just my piece of trivia.

Anyway an interesting question is: How do VINs get generated?
Are they free of gaps (is each VIN accounted for)?
My guess is no. So for all intents and purposes VINs may be a
composition of some brand, model and ... an identity.

Once an object is "IDed" however it's probably a good thing to reuse the
ID if it is available.
For example I don't want to provide my SSN everywhere, just so I'm
uniquely identified.
Please give me an account-no, thank you very much and generate it any
which way you please.

Cheers
Serge

--
Serge Rielau
DB2 SQL Compiler Development
IBM Toronto Lab

Jul 20 '05 #7

Joe Celko

>> Nope. I've had 4 so far. One in Germany, one in the US, two in
Canada. <<

was that FOUR Birthdays or SSNs? :)

Anyway an interesting question is: How do VINs get generated? <<

NOTES ON VIN:

They are stamped on the engine block at the factory. In North America, a
system is used that is far more stringent than the ISO Standards but is
"backward compatible." Here, the VIN is divided into four sections:

The first three characters shall uniquely identify the manufacturer,
make and type of vehicle (with the same exception of manufacturers that
produce less than 500 vehicles). Effectively, this is the WMI. There are
indeed examples of manufacturers who have more than one WMI that use the
third character as a code for a vehicle category (for instance bus or
truck). Just as often however this is not the case;

The second section consists of five characters (VIN positions 4-8) and
identifies the attributes of the vehicle. For each type of vehicle
(passenger cars, MPV's, trucks, buses, trailers, motorcycles, incomplete
vehicles other than trailers), different information is required. For
cars, MPV's and light trucks it is required that the first two
characters of this section are alphabetic, the third and fourth shall be
numeric and the fifth alphanumeric. This section is the VDS in ISO 3779
but there it comprises another position of the VIN;

The third section consists of one character which is the check digit,
calculated over the other 16 characters of the VIN. This character can
be numeric or the letter X;
The fourth section consists of eight characters on positions 10-17 of
the VIN. The last five shall be numeric for cars, MPV's and light trucks
and the last four shall be numeric for all other vehicles. The first
character represents the vehicle model year, the second character
represents the plant of manufacture. The third through eighth characters
are a sequential production number (for manufacturers producing more
than 500 vehicles per year). For other manufacturers, the sixth, seventh
and eight positions represent the sequential production number.

This section confirms to the VIS in ISO 3779.

A portion of the VIN is the WMI (World Manufacturer Identifier) Code.
SAE assigns this code to U.S. vehicle manufacturers. If you are a U.S.
manufacturer, please contact:

Cathy Douds
WMI Coordinator
SAE International
400 Commonwealth Drive
Warrendale, PA 15096-0001
724.772.8511
724.776.4026 - fax
do***@sae.org

Related Standards:

There are several standards available on VINs and WMIs:

SAE - J187 - Truck Vehicle Identification Numbers
SAE - J218 - Passenger Car Identification Terminology
SAE - J272 - Vehicle Identification Number Systems
SAE - J273 - Passenger Car Vehicle Identification Number System
SAE - J853 - Vehicle Identification Numbers
SAE - J1108 - Truck and Truck Tractor Vehicle Identification Number
Systems
SAE - J1044 - World Manufacturer Identifier
SAE - J1229 - Truck Identification Terminology
SAE - J1877 - Recommended Practice for Bar-Coded Vehicle Identification
Number Labels

The trusted source for issued VIN is the manufacturer and there are no
missing values. Looking at the recall rate for automobiles, they do a
good job of accounting for them.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #8

William Cleveland

Joe Celko wrote:

If you have a non-meaningful key, there's no need to verify the key, at all. If it's just something you made up, it's correct. <<

So if I assign a random number to entity X and someone else assigns a
different random number to entity X, we are both right?
Whichever of you did it first is right. The second should not
have inserted a duplicate row.

You can
get consultants if you need them for the conversions. That is the nice
thing about not making up your own keys -- industry support and external
validation and verification can save your tail.

The nice thing about not using surrogates is that you can get outside
help in solving problems that don't exist when you use surrogates?

The way to do this in Standard SQL is to use the ALTER DOMAIN statement.
You are in trouble if your SQL did not support CREATE DOMAIN, but since
you have a data dictionary and have always used the same name for this
data element, you can trace all occurences of the UPC codes and write a
script to change it.

True, but you're going to be pretty much locked out of your
database while that runs, and this is a task that wouldn't be
necessary at all with surrogates.

CREATE TABLE Drivers
(driverid INTEGER IDENTITY NOT NULL PRIMARY KEY CLUSTERED,
ssn CHAR(9) NOT NULL UNIQUE); <<

Let's ignore which of the redundant keys should be clustered and the
cost of the extra index for now.

Sure, that wasn't the point. I was just posting what DDL I would
start with.

Just put in two drivers:

INSERT INTO Drivers VALUES ('111111111');
INSERT INTO Drivers VALUES ('222222222');

Is not the same as:

INSERT INTO Drivers VALUES ('222222222');
INSERT INTO Drivers VALUES ('111111111');

Can you give me a LOGICAL explanation of why one SSN should be sequenced
before the other? Not a PHYSICAL explanation, a logical, verifiable
rule. In a proper data model all of these statements are equivalent.
Given that you don't care about the actual value of the ID,
they are equivilant.

If someone's ssn changes (it does happen), that one update fixes everything; otherwise you'd have to drop every foreign key to that
table, update every table referencing it, then reinstate the foreign
keys. <<

this is why you have DRI actions :

CREATE TABLE Personnel
(ssn CHAR(9) NOT NULL PRIMARY KEY,
...);

CREATE TABLE Foobar
(...
ssn CHAR(9) NOT NULL
REFERENCES Personnel(ssn)
ON UPDATE CASCADE,
...);

I have to admit, I wasn't aware that SQL could do that. Now that I
am, though, I'd have to say that SQL shouldn't be able to do that.
The fact that the DBMS can automatically take the necessary action
to correct the data anomolies does not mean that it's not a
violation of 2NF.

So the type used for surrogate keys should not be int, char, float or
anything else that can be used for regular values. I'll buy that, but it
would cause trouble with client code that didn't have a Key data type.
You'd still end up letting outside programs see the keys as integers. <<

Arrgh! you missed the whole point! A surrogate is not exposed at all;
you never know anything about the implementation at the database level.

No, actually I caught that. My point was that you're not going to
reach that ideal as long as you have client code that doesn't support
the key system.

If it's [surrogate keys] not used for queries and DRI, what exactly

is it used for? <<

It is an implementation device to improve speed in the RDBMS at the
physical level and has nothing to do with the logical level.

Like how SQL Server links indexes with the original table rows,
when the original table does not have a clustered primary key?

Bill

Jul 20 '05 #9

William Cleveland

William Cleveland wrote:

If someone's ssn changes (it does happen), that one update fixes

everything; otherwise you'd have to drop every foreign key to that
table, update every table referencing it, then reinstate the foreign
keys. <<

this is why you have DRI actions :

CREATE TABLE Personnel
(ssn CHAR(9) NOT NULL PRIMARY KEY,
...);

CREATE TABLE Foobar
(...
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn)
ON UPDATE CASCADE,
...);

I have to admit, I wasn't aware that SQL could do that. Now that I
am, though, I'd have to say that SQL shouldn't be able to do that.
The fact that the DBMS can automatically take the necessary action
to correct the data anomolies does not mean that it's not a
violation of 2NF.

Perhaps I should clarify, here. It's not a violation of 2NF in the
strict sense, as ssn is the key, but ssn here is really two pieces
of data. First, it records that the Foobar row belongs with the
personnel row, and second, it records with Foobar what the ssn is.

Viewed in that sense, it could be considered a 1NF violation, but
the data errors that come up are the errors you would get from a
2NF violation.

Bill

Jul 20 '05 #10

Joe Celko

>> Whichever of you did it first is right. The second should not have
inserted a duplicate row. <<

But my point is that this will not be a duplicate **row** since it has a
different locator value. The problem is that it is the same **entity**
and since such locators have nothing to do with that entity, they do not
model it correctly

I use "locator" or "physical locator" for exposed proprietary things
generated from the hardware configuration in which the database is
stored -- ROWID, GUID, autoincrement, IDENTITY, etc.-- having nothing to
do with the data itself that newbie mistake for a proper key. I then
use "surrogate key" like Dr. Codd says -for UNexposed proprietary
things generated from the hardware configuration in which the database
is stored that speed up access or maintain integerity under the covers.

The nice thing about not using surrogates is that you can get outside help in solving problems that don't exist when you use surrogates? <<

The trouble with locators is that you still have all of the problems of
real keys (ex. GTIN conversion) PLUS all the extra problems of
maintaining the fake pointer chains among the locators PLUS the data
integrity problems of matching real keys to locators correctly.
True, but you're going to be pretty much locked out of your database while that [ALTER DOMAIN] runs, and this is a task that wouldn't be
necessary at all with surrogates. <<

Hey, if you move from UPC to GTIN, then you still need to change every
UPC domain column, validate the new codes, etc. with or without
surrogates. A massive change is a massive change. Find some old
people: talk to an American about the ZIP and ZIP+4; talk to a Brit
about currency Decimalization and Metricifcation; talk to anyone about
Y2K.
Given that you don't care about the actual value of the ID, they are equivalent. <<

If two values, call them L1 and L2, are equivalent, then they can
substitute for each other by definition. Let's establish some PK-FK
linksages with the two locators. Now swap L1 and L2 in either just the
referenced or referencing table.

In the old days of network databases (we lived in trees, we ate our
children, we programmed in COBOL), there were utility programs to fix
broken pointer chains, find orphans, compress space, etc. I have no
desire to re-write them 30 years later in SQL myself.
I have to admit, I wasn't aware that SQL could do that. Now that I

am, though, I'd have to say that SQL shouldn't be able to do that. <<

DRI actions came into SQL-92, along with CREATE ASSERTION, CREATE DOMAIN
and some other goodies vendors are slow in giving us.

Unh? You want a relational database which cannot maintain
relationships? The model is that the DRI actions take place "all at
once" and that at the end of the statement that fired them the database
is in a state consistent with all constraints. We aimed for Domain-Key
Normal Form.

I have trouble understanding why you'd favor a proprietrary,
non-relational thing like IDENTITY, but balk at a declarative logical
constraint.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #11

Ellen K.

If I can add my two cents here:

In SQL Server (which is the RDBMS being discussed in this newsgroup),
the default behavior is for the primary key to be the clustered index.
You can only have one clustered index on a table. The clustered index
defines how the data are physically stored. So if you use an
artificial primary key which is an automatically-incremented number,
voilà, your data are stored in the order in which they were entered...
JUST LIKE FLAT FILES as Joe has pointed out numerous times.

All the other indexes are non-clustered... and when data is sought
based on a non-clustered index, it's a two-step process -> the
non-clustered index points to the clustered index, which points to the
data.

Suppose you have a table of something like contract status history.
Typically queries are going to be looking for the status history of a
particular contract, or all the contracts of a particular customer.
Using a compound primary key that looks like ContractNo, SeqNo means
first of all only ONE step and secondly, once you find the first row
for that customer, you've found all of them. Certainly this will
result in far better query performance than an artificial primary key
which is an automatically-incremented numeric value.

On 11 May 2004 15:26:54 GMT, Joe Celko <jc*******@earthlink.net>
wrote:

If you have a non-meaningful key, there's no need to verify the key,at all. If it's just something you made up, it's correct. <<

So if I assign a random number to entity X and someone else assigns a
different random number to entity X, we are both right? Both are
equally "made up", aren't they? Obviously, not. How do I know which
one, if either, is correct? I must have a way to verify it! This is
one of the problems with IDENTITY; load the same data twice and you get
new keys. Why? this is a locator, not a key; it is generated from the
physical state of the storage used for the data and not from the data
itself.
If you need to check that the VIN you have stored is correct, youstill can; it just doesn't happen to be the key. <<

Then it is, at best, redundant. And I already demonstrated the dangers
of having more than one key which have no relationship to the entity.
And when the length of a UPC is increased (they are running out -- ifI recall correctly, they are already longer in Europe than the U.S.),
you're going to have to update every table that uses it as a foreign
key, and not just your product table. <<

I know this one! This is the GTIN (Global Trade Identification Number)
problem. It will add five digits to the left side of the 10 digit UPC
code and the check digit will change. The migration path is already
defined in the industry (basically, add '00000' to the front and await
further instructions; integrating the ISBNs is a bit harder). You can
get consultants if you need them for the conversions. That is the nice
thing about not making up your own keys -- industry support and external
validation and verification can save your tail.

The way to do this in Standard SQL is to use the ALTER DOMAIN statement.
You are in trouble if your SQL did not support CREATE DOMAIN, but since
you have a data dictionary and have always used the same name for this
data element, you can trace all occurences of the UPC codes and write a
script to change it.

This is easy in a properly designed database; we used to go nuts with
file systems. You are probably not old enough to remember when zone
numbers became ZIP (Zone Improvement Program) codes, and then ZIP+4
codes.

But the people that really hurt in this kind of thing are the
application programmers who have to re-design forms and displays. In
the old days of pre-printed forms and physical printing type, you had to
scrap a lot of paper. Now, you just use a smaller font :)
CREATE TABLE Drivers(driverid INTEGER IDENTITY NOT NULL PRIMARY KEY CLUSTERED,
ssn CHAR(9) NOT NULL UNIQUE); <<

Let's ignore which of the redundant keys should be clustered and the
cost of the extra index for now. Just put in two drivers:

INSERT INTO Drivers VALUES ('111111111');
INSERT INTO Drivers VALUES ('222222222');

Is not the same as:

INSERT INTO Drivers VALUES ('222222222');
INSERT INTO Drivers VALUES ('111111111');

And given a table with the two SSNs, this is not the same as the other
two statements:

INSERT INTO Drivers
SELECT ssn FROM Foobar;

Can you give me a LOGICAL explanation of why one SSN should be sequenced
before the other? Not a PHYSICAL explanation, a logical, verifiable
rule. In a proper data model all of these statements are equivalent.

What you doing is imitating a sequential tape file, then building
imitation pointer chains with these proprietary constructs.
If someone's ssn changes (it does happen), that one update fixeseverything; otherwise you'd have to drop every foreign key to that
table, update every table referencing it, then reinstate the foreign
keys. <<

Bad example for two reasons. Talk to data quality people; the SSN is
one of the most stable pieces of data because it is used in so many
places. Birthdays are input wrong more often.

But ignoring that bit of trivia, this is why you have DRI actions :

CREATE TABLE Personnel
(ssn CHAR(9) NOT NULL PRIMARY KEY,
...);

CREATE TABLE Foobar
(...
ssn CHAR(9) NOT NULL
REFERENCES Personnel(ssn)
ON UPDATE CASCADE,
...);

One of many differences between a table and a file is that a table takes
care of itself because it is a part of a schema and not a "unit of work"
by itself.
So the type used for surrogate keys should not be int, char, float oranything else that can be used for regular values. I'll buy that, but it
would cause trouble with client code that didn't have a Key data type.
You'd still end up letting outside programs see the keys as integers. <<

Arrgh! you missed the whole point! A surrogate is not exposed at all;
you never know anything about the implementation at the database level.
If SQL-X wants to use and maintain pointer chains under the covers,
fine. If SQL-Y wants to use and maintain hashing under the covers,
fine. If SQL-Z wants to use and maintain compressed bit vectors under
the covers, fine. I ought to be able to take the same SQL code, put it
on SQL X, Y or Z and get the same results.
If it's [surrogate keys] not used for queries and DRI, what exactly

is it used for? <<

It is an implementation device to improve speed in the RDBMS at the
physical level and has nothing to do with the logical level. Do you
also want to expose the indexes in SQL? We used to have to do that in
file systems. Every index was explicitly created and maintained by
application programs, not by the system. You can fill in the horror
stories.

The worst multiple-key situation I did a little consulting on involved a
motorpool (which is why I used that example) and a bit of criminal
behavior. The commercial vehicles in the motorpool has VIN, tags,
company ids and permit numbers on them. The tags were expensive and the
permits for this class of vehicle were really expensive.

The guy running the motorpool noticed that of a fleet of (n) vehicles,
(k) of them were in the shop for maintenance at any time. If the
company gives you money for (n) tags and permits, you buy (n-k) tags and
permits and then pocket the difference. You just have to keep rotating
the (tag, permit) pairs to vehciles in use.

The scheme fell apart when a (tag, permit) pair was wrong during an
inspection and the authorities asked to see the motorpool records. The
records were obviously wrong, so the company called for an audit and the
police. The tags, permits and company id numbers were a scrambled mess
and could not be verified against the reality of the motorpool.

I got an email about how to use the VIN numbers in the audit, since I
had just posted a short piece on them and got Googled. Since the VIN is
a nautral key, they could decode. This sped up the audit quite a bit,
since each person could be given a list of VINs for one year, make and
model. My suggestion was that they replace the company sequence number
with the VIN and tag number printed in a scanable label as the key. I
do not know the outcome.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #12

William Cleveland

Joe Celko wrote:

The nice thing about not using surrogates is that you can get outside help in solving problems that don't exist when you use surrogates? <<

The trouble with locators is that you still have all of the problems of
real keys (ex. GTIN conversion)
What trouble would that be that I'd still have? There's only one
place to update.

True, but you're going to be pretty much locked out of your database while that [ALTER DOMAIN] runs, and this is a task that wouldn't be
necessary at all with surrogates. <<

Hey, if you move from UPC to GTIN, then you still need to change every
UPC domain column, validate the new codes, etc. with or without
surrogates.

Yes, but "every column" is just one column.

A massive change is a massive change. Changing one column in one table is going to be a lot less massive
than changing that column, plus the equivilant column in every table
that the original table has a one-to-many relationship with.

Find some old
people: talk to an American about the ZIP and ZIP+4; talk to a Brit
about currency Decimalization and Metricifcation; talk to anyone about
Y2K.
I'm not sure why you're arguing my point for me, but I appreciate it.

Given that you don't care about the actual value of the ID, they are equivalent. <<

If two values, call them L1 and L2, are equivalent, then they can
substitute for each other by definition. Let's establish some PK-FK
linksages with the two locators. Now swap L1 and L2 in either just the
referenced or referencing table.

The values are only equivilant in that they are both meaningless.
If you use the "Cascade On Update" foreign keys that you were pushing,
then yes, you could swap the values, so long as you had one more
temporary value.
In the old days of network databases (we lived in trees, we ate our
children, we programmed in COBOL), there were utility programs to fix
broken pointer chains, find orphans, compress space, etc. I have no
desire to re-write them 30 years later in SQL myself.

Granted, you've probably been programming four times as long as I have,
but I've never once seen ID references break.

I have to admit, I wasn't aware that SQL could do that. Now that I

am, though, I'd have to say that SQL shouldn't be able to do that. <<

Unh? You want a relational database which cannot maintain
relationships?

Other than constraints, relationships do not need maintaining,
unless you use meaningful keys.

The model is that the DRI actions take place "all at
once" and that at the end of the statement that fired them the database
is in a state consistent with all constraints.
I kind-of figured that. I'm sure it works, but that doesn't mean
it's not a hack.

I have trouble understanding why you'd favor a proprietrary,
non-relational thing like IDENTITY, The Identity keyword is proprietary, but the concept certainly seems
to be a defacto standard. Is there any RDBMS that does not have an
equivilant?

Further, I maintain that it's meaningful keys that are
non-relational, as they violate both the 1NF
and additional 2NF rules, conceptially, if not physically.

but balk at a declarative logical constraint.

Refusing to put bad data in is a constraint. Automatic updates
to keep data in synch is not, unless you would consider an index
to be a constraint, as well.

Bill

Jul 20 '05 #13

William Cleveland

Ellen K. wrote:

If I can add my two cents here:

In SQL Server (which is the RDBMS being discussed in this newsgroup),
the default behavior is for the primary key to be the clustered index.
You can only have one clustered index on a table. The clustered index
defines how the data are physically stored. So if you use an
artificial primary key which is an automatically-incremented number,
voilà, your data are stored in the order in which they were entered...
JUST LIKE FLAT FILES as Joe has pointed out numerous times.

Except that it's probably a B-Tree under there somewhere. Also,
you're not dependant on the storage order.

All the other indexes are non-clustered... and when data is sought
based on a non-clustered index, it's a two-step process -> the
non-clustered index points to the clustered index, which points to the
data.

Yes, and when you join on the ID field you usually are using that
clustered index.
Suppose you have a table of something like contract status history.
Typically queries are going to be looking for the status history of a
particular contract, or all the contracts of a particular customer.
Using a compound primary key that looks like ContractNo, SeqNo means
first of all only ONE step and secondly, once you find the first row
for that customer, you've found all of them. Certainly this will
result in far better query performance than an artificial primary key
which is an automatically-incremented numeric value.

For select in that situation, probably. That is not always going to
be what you're dealing with, and Inserts are faster with a clustered
ID primary key.

It depends on what queries you think will happen frequently. I usually
end up clustering something other than the primary key about 20% of
the time.

Bill

Jul 20 '05 #14

Ellen K.

You want orphan rows?

On Thu, 13 May 2004 00:54:46 GMT, William Cleveland
<WC********@Ameritech.Net> wrote:

Other than constraints, relationships do not need maintaining,
unless you use meaningful keys.

Jul 20 '05 #15

William Cleveland

Orphan rows come from meaningful keys.

Bill
Ellen K. wrote:

You want orphan rows?

On Thu, 13 May 2004 00:54:46 GMT, William Cleveland
<WC********@Ameritech.Net> wrote:

Other than constraints, relationships do not need maintaining,
unless you use meaningful keys.

Jul 20 '05 #16

Ellen K.

So if you use an artificial primary key and you delete a row in the
parent table, you don't care if the related rows in the child table
remain?

That's not what I consider a relational database.

On Thu, 13 May 2004 22:36:32 GMT, William Cleveland
<WC********@Ameritech.Net> wrote:

Orphan rows come from meaningful keys.

Bill
Ellen K. wrote:
You want orphan rows?

On Thu, 13 May 2004 00:54:46 GMT, William Cleveland
<WC********@Ameritech.Net> wrote:

Other than constraints, relationships do not need maintaining,
unless you use meaningful keys.

Jul 20 '05 #17

William Cleveland

Um, no. Foreign keys prevent the row from being deleted.

Bill
Ellen K. wrote:

So if you use an artificial primary key and you delete a row in the
parent table, you don't care if the related rows in the child table
remain?

That's not what I consider a relational database.

On Thu, 13 May 2004 22:36:32 GMT, William Cleveland
<WC********@Ameritech.Net> wrote:

Orphan rows come from meaningful keys.

Bill

Jul 20 '05 #18

Similar topics