Question for --CELKO--

Guinness Mann

Greetings,

I think I saw an article yesterday in which you told someone not to make
an Identity column a primary key. I seem to remember you saying
something like "It undermines the entire foundation of the relational
model."

I've been thinking about that all night and I think I understand what
you're saying, but I have some questions.

I think you're saying that if the real key in a 3 attribute (heh, heh.
didn't say fields. heh heh) tuple (heh heh) is a combination of the
three attributes, then that's what you should use as a primary key.

Do you then advocate never using an Identity attribute? Or is it
acceptable (in the relational model) to have an Identity attribute to
use as a handle to the row, and for attributes in other tables to use as
the target for a foreign key?

Thank you,

-- Rick

Jul 20 '05 #1

Subscribe Post Reply

3178

Daniel Morgan

Guinness Mann wrote:

Greetings,

I think I saw an article yesterday in which you told someone not to make
an Identity column a primary key. I seem to remember you saying
something like "It undermines the entire foundation of the relational
model."

I've been thinking about that all night and I think I understand what
you're saying, but I have some questions.

I think you're saying that if the real key in a 3 attribute (heh, heh.
didn't say fields. heh heh) tuple (heh heh) is a combination of the
three attributes, then that's what you should use as a primary key.

Do you then advocate never using an Identity attribute? Or is it
acceptable (in the relational model) to have an Identity attribute to
use as a handle to the row, and for attributes in other tables to use as
the target for a foreign key?

Thank you,

-- Rick

I'll let Joe answer for himself. But from where I teach at the
University of Washington my
advice would be to use natural keys whenever practicable: That it is a
judgement call.

Some people seem to have knee-jerk reaction to CREATE TABLE that
involves using
a surrogate key without ever applying a single synapse to examining the
data that will be
stored, the methods that may be applied for inserting, updating, and
deleting. And think
that writing reports to get the information back out is "someone else's
problem." The result
is that quite often they make it just that ... someone else's problem.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #2

Daniel Morgan

Guinness Mann wrote:

Greetings,

I think I saw an article yesterday in which you told someone not to make
an Identity column a primary key. I seem to remember you saying
something like "It undermines the entire foundation of the relational
model."

I've been thinking about that all night and I think I understand what
you're saying, but I have some questions.

I think you're saying that if the real key in a 3 attribute (heh, heh.
didn't say fields. heh heh) tuple (heh heh) is a combination of the
three attributes, then that's what you should use as a primary key.

Do you then advocate never using an Identity attribute? Or is it
acceptable (in the relational model) to have an Identity attribute to
use as a handle to the row, and for attributes in other tables to use as
the target for a foreign key?

Thank you,

-- Rick

Jul 20 '05 #3

Joe Celko

>> I think I saw an article yesterday in which you told someone not to
make an Identity column a primary key. I seem to remember you saying
something like "It undermines the entire foundation of the relational
model." <<

Actually that was Dr. Codd, not me. I just happen to agree.

I think you're saying that if the real key in a 3 attribute (heh, heh. didn't say fields. heh heh) tuple (heh heh) is a combination of the
three attributes, then that's what you should use as a primary key. <<

Rats! I love to jump on people about (file/table), (row, record) and
(column/field) :)

I don't know if I'd use it as the PRIMARY KEY, because I have not seen
the rest of the table, but it would certainly be a candidate key and
enforced with at least a UNIQUE(a,b,c) constraint.

Let me go ahead one more step and play Q&A with the direction I think
you are going:

Q: Couldn't that make a very long key?

A1: So what? This is the 2000's century and we have much better
computers than we did in the 1950's when key size was a real physical
issue. What is funny to me is the number of idiots who replace a
natural two or three integer compound key with a hugh GUID that no human
being or other system can possibly understand because they think it will
be faster and easy to program.

A2: This is an implementation problem that the SQL engine can handle.
For example, Teradata is an SQL designed for VLDB apps that uses hashing
instead of B-tree or other indexes. They guarantee that no search
requires more than two probes, no matter how large the database. A tree
index requires more and more probes as the size of the database
increases.
Do you then advocate never using an Identity attribute? Or is it

acceptable (in the relational model) to have an Identity attribute to
use as a handle to the row, and for attributes in other tables to use as
the target for a foreign key? <<

A handle to the row? Oh, you mean faking a sequential file's positional
record number, so I can refernce the physical storage location? Sure,
if I want to lose all the advantages of an asbstract data model, SQL set
oriented programming, carry extra data and destroy the portability of
code!

More and more programmers who have absolutely no database training are
being told to design a database. They are using GUIDs, IDENTITY, ROWID
and other proprietary auto-numbering "features" in SQL products to
imitate either a record number (sequential file system mindset) or OID
(OO mindset) since they don't know anything else.

Experienced database designers tend toward intelligent keys they find in
industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They know
that they need to verify the data against the reality they are modeling.
A trusted external source is a good thing to have.

The IDENTITY column is a holdover from the early programming languages
which were very close to the hardware. For example, the fields [not
columns! Big difference] in a COBOL or FORTRAN program were assumed to
be physically located in main storage in the order they were declared in
the program. The languages have constructs using that model -- logical
and physical implementation are practically one! The data has meaning
BECAUSE of the program reading it (i.e the same bits could be a
character in one program and be an integer in another)

The early SQLs were based on existing file systems. The data was kept
in physically contiguous disk pages, in physically contiguous rows, made
up of physically contiguous columns. In short, just like a deck of
punch cards or a magnetic tape. Most programmer still carry that mental
model, which is why I keep doing that rant about file vs. table, row vs.
record and column vs. field.

But physically contiguous storage is only one way of building a
relational database and it is not the best one. The basic idea of a
relational database is that user is not supposed to know *how* or
*where* things are stored at all, much less write code that depends on
the particular physical representation in a particular release of a
particular product on particular hardware at a particular time.

One of the biggest errors is the IDENTITY column (actually property, not
a column at all) in the Sybase/SQL Server family. People actually
program with this "feature" and even use it as the primary key for the
table! Now, let's go into painful details as to why this thing is bad.

The first practical consideration is that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products. Newbies
actually think they will never port code! Perhaps they only work for
companies that are failing and will be gone. Perhaps their code is such
crap nobody else want their application.

But let's look at the logical problems. First try to create a table
with two columns and try to make them both IDENTITY. If you cannot
declare more than one column to be of a certain data type, then that
thing is not a datatype at all, by definition. It is a property which
belongs to the PHYSICAL table, not the LOGICAL data in the table.

Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.

Finally create a simple table with one IDENTITY and a few other columns.
Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that column
in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged as
deleted or unused to move the empty space to the physical end of the
physical file.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different IDENTITY numbers? In the relational model, they
should be treated the same if all the values of all the attributes are
identical.

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the curren relase of a particular databse
product, how do you verify that an entity has such a number in the
reality you are modeling?

You will see newbies who design tables like this:

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times. Your data integrity is trashed. The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));

If you want to enforce a rule that a car can have one driver:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL PRIMARY KEY
REFERENCES Motorpool(vin));

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure waht time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM Transactions
on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #4

Guinness Mann

In article <1067473255.185683@yasure>, da******@x.washington.edu says...

I'll let Joe answer for himself. But from where I teach at the
University of Washington my advice would be to use natural keys
whenever practicable: That it is a judgement call.

What's your position on this part? (I mean, in *addition* to having the
natural key in the table.)

...is it acceptable (in the relational model) to have an
Identity attribute to use as a handle to the row, and for
attributes in other tables to use as the target for a
foreign key?

--Rick

Jul 20 '05 #5

Guinness Mann

In article <3f*********************@news.frii.net>,
jo*******@northface.edu says...

Actually that was Dr. Codd, not me. I just happen to agree.

Joe, thank you for the thoughtful (and long!) reply.

I guess I have to go think some more. It seems to me that by
eliminating part numbers, say, and making a parts database key on the
attributes of the parts (short, fat, screw) then I would end up
duplicating a lot of information somewhere down the line. Likewise with
eliminating, say, invoice numbers. But perhaps that's a symptom of my
incomplete understanding of database design.

I went to the bookstore the other day to buy a copy of the most recent
edition of Codd/Date, and found out it costs $90 now. Sheesh. Now I've
got to start a savings account so I can buy a book.

Thanks again,

-- Rick

Jul 20 '05 #6

Joe Celko

>> It seems to me that by eliminating part numbers, say, and making a
parts database key on the attributes of the parts (short, fat, screw)
then I would end up duplicating a lot of information somewhere down the
line <<

Bad example. Read the part about researching for a trusted, industry
standard data source to verify your data. ANSI and ISO have complete
definitions of screws
(http://www.industrialpress.com/Handb...ents/Sect9.pdf for
starters).

Some standards are based on physically verifiable attributes. Got a
sextant? Bet you can get (longtitude, latitude) for a location -- which
is a long compound key. Got a tire? The ISO code is made up of "<width
in cm> <material> <diameter in inches>" such as "155R13"; Invariant over
time and testable in reality are good things.

... invoice numbers <<
You might have an idustry standard for them; look at the Ariba Supplier
Network for e-commerce. But if you have to make up your own invoice
numbers, then at least use a check digit or syntax rules to help human
beings get it right.
I went to the bookstore the other day to buy a copy of the most

recent edition of Codd/Date, and found out it costs $90 now. Sheesh. Now
I've got to start a savings account so I can buy a book. <<

Nah! Just look behind the coffee shop at Barnes & Noble; that is where
their loan office is usually located. Bring your car title and deed to
your house.

Seriously, I know the feeling I got when I saw my first $100+ calculus
book. Since I am both an author and a professor, I have mixed feeling
about the cost of textbooks.

I am trying as hard as I can to set up the math classes I will be
teaching using Dover paperbound editions and Schaum's Outlines. But good
Comp Sci books seem to stay high.

I tell people my retiement dream, if I cannot write a run-away best
seller like HARRY POTTER, is to write a High Algebra book that gets
adopted by several large school systems.

1) High cover price means high royalties
2) The content never significantly changes and never goes out of date --
write once, read many in practice!
3) School systems order copies by the truckload
4) Thousands of teenagers will hate me

What more could a middle-aged man want in his retirement?

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #7

xAvailx

Joe Celko:

Fictitious Problem: Company X is a multinational corporation that
would like to give a little back to their employees. Company X is
holding an open event in Vegas to thank everyone for their hard work.
They would like to know who will be attending the event for
hotel/dinner/gambling purposes. Employees are allowed to bring their
spouses and kids.

Company X develops a website in their intranet so employees can
register at their convenience. The information collected ends up in a
database where they will later run reports.

Some of the information collected as follows. R means required.

R-First Name,
R-Last Name,
SSN,
Meal Preference,
Date of birth,
Credit Card Number,
Date Registered,
etc....

Note that SSN is not required because the employer feels they would be
invading the employees privacy for asking this kind of information
about their family.

What would you suggest the natural key is here or in a similar
situation where very little will be known about the entity because X
reason?

Thanks for your time.

BZ

Joe Celko <jo*******@northface.edu> wrote in message news:<3f*********************@news.frii.net>...

I think I saw an article yesterday in which you told someone not to make an Identity column a primary key. I seem to remember you saying
something like "It undermines the entire foundation of the relational
model." <<

Actually that was Dr. Codd, not me. I just happen to agree.
I think you're saying that if the real key in a 3 attribute (heh, heh. didn't say fields. heh heh) tuple (heh heh) is a combination of the
three attributes, then that's what you should use as a primary key. <<

Rats! I love to jump on people about (file/table), (row, record) and
(column/field) :)

I don't know if I'd use it as the PRIMARY KEY, because I have not seen
the rest of the table, but it would certainly be a candidate key and
enforced with at least a UNIQUE(a,b,c) constraint.

Let me go ahead one more step and play Q&A with the direction I think
you are going:

Q: Couldn't that make a very long key?

A1: So what? This is the 2000's century and we have much better
computers than we did in the 1950's when key size was a real physical
issue. What is funny to me is the number of idiots who replace a
natural two or three integer compound key with a hugh GUID that no human
being or other system can possibly understand because they think it will
be faster and easy to program.

A2: This is an implementation problem that the SQL engine can handle.
For example, Teradata is an SQL designed for VLDB apps that uses hashing
instead of B-tree or other indexes. They guarantee that no search
requires more than two probes, no matter how large the database. A tree
index requires more and more probes as the size of the database
increases.
Do you then advocate never using an Identity attribute? Or is it

acceptable (in the relational model) to have an Identity attribute to
use as a handle to the row, and for attributes in other tables to use as
the target for a foreign key? <<

A handle to the row? Oh, you mean faking a sequential file's positional
record number, so I can refernce the physical storage location? Sure,
if I want to lose all the advantages of an asbstract data model, SQL set
oriented programming, carry extra data and destroy the portability of
code!

More and more programmers who have absolutely no database training are
being told to design a database. They are using GUIDs, IDENTITY, ROWID
and other proprietary auto-numbering "features" in SQL products to
imitate either a record number (sequential file system mindset) or OID
(OO mindset) since they don't know anything else.

Experienced database designers tend toward intelligent keys they find in
industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They know
that they need to verify the data against the reality they are modeling.
A trusted external source is a good thing to have.

The IDENTITY column is a holdover from the early programming languages
which were very close to the hardware. For example, the fields [not
columns! Big difference] in a COBOL or FORTRAN program were assumed to
be physically located in main storage in the order they were declared in
the program. The languages have constructs using that model -- logical
and physical implementation are practically one! The data has meaning
BECAUSE of the program reading it (i.e the same bits could be a
character in one program and be an integer in another)

The early SQLs were based on existing file systems. The data was kept
in physically contiguous disk pages, in physically contiguous rows, made
up of physically contiguous columns. In short, just like a deck of
punch cards or a magnetic tape. Most programmer still carry that mental
model, which is why I keep doing that rant about file vs. table, row vs.
record and column vs. field.

But physically contiguous storage is only one way of building a
relational database and it is not the best one. The basic idea of a
relational database is that user is not supposed to know *how* or
*where* things are stored at all, much less write code that depends on
the particular physical representation in a particular release of a
particular product on particular hardware at a particular time.

One of the biggest errors is the IDENTITY column (actually property, not
a column at all) in the Sybase/SQL Server family. People actually
program with this "feature" and even use it as the primary key for the
table! Now, let's go into painful details as to why this thing is bad.

The first practical consideration is that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products. Newbies
actually think they will never port code! Perhaps they only work for
companies that are failing and will be gone. Perhaps their code is such
crap nobody else want their application.

But let's look at the logical problems. First try to create a table
with two columns and try to make them both IDENTITY. If you cannot
declare more than one column to be of a certain data type, then that
thing is not a datatype at all, by definition. It is a property which
belongs to the PHYSICAL table, not the LOGICAL data in the table.

Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.

Finally create a simple table with one IDENTITY and a few other columns.
Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that column
in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged as
deleted or unused to move the empty space to the physical end of the
physical file.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different IDENTITY numbers? In the relational model, they
should be treated the same if all the values of all the attributes are
identical.

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the curren relase of a particular databse
product, how do you verify that an entity has such a number in the
reality you are modeling?

You will see newbies who design tables like this:

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times. Your data integrity is trashed. The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));

If you want to enforce a rule that a car can have one driver:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL PRIMARY KEY
REFERENCES Motorpool(vin));

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure waht time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
.."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM Transactions
on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #8

Joe Celko

>> Company X develops a website in their intranet so employees can
register at their convenience .. <<

So, how do they identify their employees now? They have to be doing
somehow.

What can we deduce? They already know SSN for tax purposes, so that is
not a privacy issue. They know the employee's company email address, so
that would be a very good way to identify them (have you noticed that is
popular becuase it can be easily verified externally?).

Date of birth is already known, but meals and credit card info would be
collected by the hotel registration system, not the company's site.

What would you suggest the natural key is here or in a similar

situation where very little will be known about the entity because X
reason? <<

Make a list of what you know and then learn more, if you need to.
Making a data model of vague "stuff" or "thingies" is a waste of time.
This is the part about RDBMS that newbies hate -- there is no "Silver
Bullet" that solves all problems with the push of a function key or the
click of a mouse button. You have to work! And it can be very hard
work.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #9

Daniel Morgan

Guinness Mann wrote:

In article <1067473255.185683@yasure>, da******@x.washington.edu says...

I'll let Joe answer for himself. But from where I teach at the
University of Washington my advice would be to use natural keys
whenever practicable: That it is a judgement call.

What's your position on this part? (I mean, in *addition* to having the
natural key in the table.)

...is it acceptable (in the relational model) to have an
Identity attribute to use as a handle to the row, and for
attributes in other tables to use as the target for a
foreign key?

--Rick

If I have a natural key that works I'd never build a surrogate key in
addition unless there was
a compelling reason to do so. Last thing I'd want is another index with
its overhead, another
constraint to maintain, a potential maintenance headache, and the sorry
fact that different
developers would make different decisions to accomplish the very same
act making long term
maintainability a nightmare.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #10

Jason Berkan

On 30 Oct 2003 17:51:06 GMT, Joe Celko <jo*******@northface.edu>
wrote:

Some standards are based on physically verifiable attributes. Got a
sextant? Bet you can get (longtitude, latitude) for a location -- which
is a long compound key. Got a tire? The ISO code is made up of "<width
in cm> <material> <diameter in inches>" such as "155R13"; Invariant over
time and testable in reality are good things.

Just for my own curiosity, as I am trying to drag myself away from
flat file databases - what happens when you need to track multiple
tires? Say I have four tires that are all "155R13", and I need to
store each one separately in the database. Obviously the ISO code
will need to have something else added to it, as it does not uniquely
identify each tire I own.

Jason

Jul 20 '05 #11

Daniel Morgan

Jason Berkan wrote:

On 30 Oct 2003 17:51:06 GMT, Joe Celko <jo*******@northface.edu>
wrote:

Some standards are based on physically verifiable attributes. Got a
sextant? Bet you can get (longtitude, latitude) for a location -- which
is a long compound key. Got a tire? The ISO code is made up of "<width
in cm> <material> <diameter in inches>" such as "155R13"; Invariant over
time and testable in reality are good things.

Just for my own curiosity, as I am trying to drag myself away from
flat file databases - what happens when you need to track multiple
tires? Say I have four tires that are all "155R13", and I need to
store each one separately in the database. Obviously the ISO code
will need to have something else added to it, as it does not uniquely
identify each tire I own.

Jason

How about an integer such as 1, 2, 3, 4 or letters such as FL, FR, RL, RR?

Then, of course, there is that reason why bar codes were invented.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #12

Anith Sen

>> Actually that was Dr. Codd, not me. I just happen to agree. <<

OK, I browsed through the 1979 ACM paper and with all due respect, you
simply agreed to him without any further analysis of his positions. A couple
of observations:

1. To quote "...but they have no control over its value, nor is its value
ever displayed to them" by Codd is a blatant violation of the definition of
the relational database in the same paper. To quote "... is a time-varying
collection of data, all of which can be accessed and updated as if they were
organized as a collection of time-varying tabular (nonheirarchic) relations
of assorted degrees defined on a given set of simple domains". Don't you
find this contradictory?

2. As a matter of fact hiding surrogates constitutes a clear violation of
Codd's own Information Principle which states that all information in a
database must be expressed as values in relations and in no other way. So,
you support the information principle and hiding surrogates at the same
time?

3. There is little relevance to the three points paraphrasing the paper
regarding the difficulties in employing user-controlled keys, primarily
because surrogates are, by default, DBMS generated. Nobody, as I see, in
this discussion is in confusion whether a surrogate should be
user-controlled or system-assigned.

This means that a surrogate ought to act like an index; created by the

user, managed by the system and NEVER seen by a user. That means never used
in queries, DRI or anything else that a user does. <<

This is a clear misunderstanding of the concept of data independence. First,
Codd suggested surrogate in RM/T for reasons of semantic extensions,
primarily to overcome the 3 difficulties paraphrased in your post. How on
earth can a surrogate behaving like an index, hidden from the user, solve a
complexity of a logical operation?

I see surrogates, perfectly valid as logical values in a relational
database. The notion of "hidden" surrogate is a misguided concept to me,
unless you can explain what I am missing above.

Now, the problem with value generated by identity column is different from
not being a surrogate by itself. Identity column values in SQL Server simply
do not qualify as valid surrogate primarily on grounds of explicit violation
of physical data independence.

--
-- Anith
( Please reply to newsgroups only )

Jul 20 '05 #13

xAvailx

Joe Celko:

Thanks for your response. Again this is a ficticious problem but I
will propose answers to your questions.

So, how do they identify their employees now? They have to be doing
somehow.
Since this is a global company, each country has their own scheme of
identifying their division employees. This may result in
identifications collisions in between countries/divisions but I guess
then adding the division as part of the key would solve the problem.
What can we deduce? They already know SSN for tax purposes, so that is not a privacy issue. They know the employee's company email address, so that would be a very good way to identify them (have you noticed that is popular becuase it can be easily verified externally?).
The employees SSN you are correct. But how about their spouses and
children?
Also, since this is a multinational, SSN wouldn't be a valid method of
identification for everybody.
They know the employee's company email address, so
that would be a very good way to identify them (have you noticed that is popular becuase it can be easily verified externally?).
Yes, email address could indeed be a very good candidate key. However,
I have come accross several organizations where some of the employees
don't have a business email address because they plain don't need it.
there is no "Silver
Bullet" that solves all problems with the push of a function key or the click of a mouse button. You have to work! And it can be very hard work.

You are right. But I am not suggesting there is a silver bullet. That
is why I am giving a specific scenario. I agree that natural keys
should be used when there is an industry standard but sometimes it may
not be a realistic solution.

Thanks,

BZ

Joe Celko <jo*******@northface.edu> wrote in message news:<3f*********************@news.frii.net>...
Company X develops a website in their intranet so employees can register at their convenience .. <<

So, how do they identify their employees now? They have to be doing
somehow.

What can we deduce? They already know SSN for tax purposes, so that is
not a privacy issue. They know the employee's company email address, so
that would be a very good way to identify them (have you noticed that is
popular becuase it can be easily verified externally?).

Date of birth is already known, but meals and credit card info would be
collected by the hotel registration system, not the company's site.
What would you suggest the natural key is here or in a similar

situation where very little will be known about the entity because X
reason? <<

Make a list of what you know and then learn more, if you need to.
Making a data model of vague "stuff" or "thingies" is a waste of time.
This is the part about RDBMS that newbies hate -- there is no "Silver
Bullet" that solves all problems with the push of a function key or the
click of a mouse button. You have to work! And it can be very hard
work.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #14

Daniel Morgan

xAvailx wrote:

Joe Celko:

<snipped>

Since this is a global company, each country has their own scheme of
identifying their division employees. This may result in
identifications collisions in between countries/divisions but I guess
then adding the division as part of the key would solve the problem.

What can we deduce? They already know SSN for tax purposes, so that

is

not a privacy issue. They know the employee's company email

address, so

that would be a very good way to identify them (have you noticed

that is

popular becuase it can be easily verified externally?).

The employees SSN you are correct. But how about their spouses and
children?
Also, since this is a multinational, SSN wouldn't be a valid method of
identification for everybody.

Based on your assumptions SSN would be the worst possible key. I can
only think of a single
country that uses SSNs in the format ###-##-####. And there is no
guarantee that even some
one in the US will have one, or that a person will have only one, or
that every person has a
different one.

A natural key must be a key ... not a contrivance. And then one must
consider Federal
privacy legislation that might make such usage illegal.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #15

xAvailx

Daniel:

Based on your assumptions SSN would be the worst possible key. I can
only think of a single
country that uses SSNs in the format ###-##-####. And there is no
guarantee that even some
one in the US will have one, or that a person will have only one, or
that every person has a
different one
Not sure if this was meant for me...but I said SSN would not be a good key :)
And then one must
consider Federal
privacy legislation that might make such usage illegal.

I was not aware of that. Any good references where I can learn more about this?

Thanks,

BZ

Daniel Morgan <da******@x.washington.edu> wrote in message news:<1067634277.582361@yasure>... xAvailx wrote:
Joe Celko:

<snipped>

Since this is a global company, each country has their own scheme of
identifying their division employees. This may result in
identifications collisions in between countries/divisions but I guess
then adding the division as part of the key would solve the problem.

What can we deduce? They already know SSN for tax purposes, so that

is

not a privacy issue. They know the employee's company email

address, so

that would be a very good way to identify them (have you noticed

that is

popular becuase it can be easily verified externally?).

The employees SSN you are correct. But how about their spouses and
children?
Also, since this is a multinational, SSN wouldn't be a valid method of
identification for everybody.

Based on your assumptions SSN would be the worst possible key. I can
only think of a single
country that uses SSNs in the format ###-##-####. And there is no
guarantee that even some
one in the US will have one, or that a person will have only one, or
that every person has a
different one.

A natural key must be a key ... not a contrivance. And then one must
consider Federal
privacy legislation that might make such usage illegal.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)
--

Jul 20 '05 #16

Joe Celko

>> have four tires that are all "155R13", and I need to
store each one separately in the database. Obviously <<

How about a million gains of sand instead? Tirs area commodity item,
you have t hav esome way of making them different if you want (for Ghod
knows what reason) track each one. What the heck, bar code them with a
scanable tag that has a check digit. Make sure the tag is not removable
so it becomes a physical part of the tire.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #17

Daniel Morgan

xAvailx wrote:

Daniel:

Based on your assumptions SSN would be the worst possible key. I can
only think of a single
country that uses SSNs in the format ###-##-####. And there is no
guarantee that even some
one in the US will have one, or that a person will have only one, or
that every person has a
different one
Not sure if this was meant for me...but I said SSN would not be a good key :)
And then one must
consider Federal
privacy legislation that might make such usage illegal.

I was not aware of that. Any good references where I can learn more about this?

Thanks,

BZ

Daniel Morgan <da******@x.washington.edu> wrote in message news:<1067634277.582361@yasure>...

xAvailx wrote:

Joe Celko:

<snipped>

Since this is a global company, each country has their own scheme of
identifying their division employees. This may result in
identifications collisions in between countries/divisions but I guess
then adding the division as part of the key would solve the problem.

>What can we deduce? They already know SSN for tax purposes, so that
>
>
>
>
is

>not a privacy issue. They know the employee's company email
>
>
>
>
address, so

>that would be a very good way to identify them (have you noticed
>
>
>
>
that is

>popular becuase it can be easily verified externally?).
>
>
>
>
The employees SSN you are correct. But how about their spouses and
children?
Also, since this is a multinational, SSN wouldn't be a valid method of
identification for everybody.

Based on your assumptions SSN would be the worst possible key. I can
only think of a single
country that uses SSNs in the format ###-##-####. And there is no
guarantee that even some
one in the US will have one, or that a person will have only one, or
that every person has a
different one.

A natural key must be a key ... not a contrivance. And then one must
consider Federal
privacy legislation that might make such usage illegal.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)
--

You'd need to contact your attorney or the Social Security
Administration as I can't give legal
advice here. But recent changes to privacy rules related to identity
theft are creating situations
where it is illegal to use it and it is anticipated that further
restrictions will be put into place in the
future.

So even if you could legally use it now ... there is no guarantee that
your usage wouldn't be a
violation in the future.

--
Daniel Morgan
http://www.outreach.washington.edu/e...ad/oad_crs.asp
http://www.outreach.washington.edu/e...oa/aoa_crs.asp
da******@x.washington.edu
(replace 'x' with a 'u' to reply)

Jul 20 '05 #18

John Palmer

On Thu, 30 Oct 2003 14:36:36 -0600, Jason Berkan
<ja**********@canada.com> wrote:

On 30 Oct 2003 17:51:06 GMT, Joe Celko <jo*******@northface.edu>
wrote:
Some standards are based on physically verifiable attributes. Got a
sextant? Bet you can get (longtitude, latitude) for a location -- which
is a long compound key. Got a tire? The ISO code is made up of "<width
in cm> <material> <diameter in inches>" such as "155R13"; Invariant over
time and testable in reality are good things.

Just for my own curiosity, as I am trying to drag myself away from
flat file databases - what happens when you need to track multiple
tires? Say I have four tires that are all "155R13", and I need to
store each one separately in the database. Obviously the ISO code
will need to have something else added to it, as it does not uniquely
identify each tire I own.

Well, there's certainly brand name and product line ("Goodyear
Aquatread 155R13"), which ought to be as close as you can get. Since
this is a permanant, meaningful part of the tire, it sounds like it'll
satisfy the conditions the theorists are proposing.
--
Everything I needed to know in life I learned in Kindergarten. Like:
Once you pull the pin on Mr. Hand Grenade, he is no longer your friend.

Jul 20 '05 #19

Similar topics