Modeling/Constraint question

Thomas R. Hummel

I'm going to try to describe this situation as best as I can, but if
anything is unclear please let me know. Some of this database was
already in place before I arrived on the scene, and I don't really
have the buyin to change it.

There is a table of certificates that a person can hold. A person may
hold certificates of various types. I need to further group a person's
certificates together so that (for example) renewal dates can be
synchronized. There are only certain types of certificates that can be
linked together, which is defined in another set of tables. Here's a
first-go of the model. I've eliminated irrelevant columns from some of
the tables.

-- This table already exists as-is
CREATE TABLE dbo.Person (
ID INT NOT NULL IDENTITY,
FirstName VARCHAR(30) NULL,
LastName VARCHAR(30) NULL)
GO
ALTER TABLE dbo.Person
ADD CONSTRAINT PK_Person PRIMARY KEY (ID)
GO

-- This table already exists as-is
CREATE TABLE dbo.Certificate (
ID INT NOT NULL IDENTITY,
PersonID INT NOT NULL,
CertificateTypeID INT NOT NULL) -- The table this FKs to is
just an ID and description. Not included here for brevity reasons
GO
ALTER TABLE dbo.Certificate
ADD CONSTRAINT PK_Certificate PRIMARY KEY (ID)
GO
ALTER TABLE dbo.Certificate
ADD CONSTRAINT FK_CertificatePerson FOREIGN KEY (PersonID) REFERENCES
dbo.Person (ID)
GO

-- This table is new and represents which types of certificates can be
linked and synchronized
CREATE TABLE dbo.CCRTypeCertificateType (
CCRTypeID INT NOT NULL, -- This FKs to a table that is
mostly just an ID and description
CertificateTypeID INT NOT NULL)
GO
ALTER TABLE dbo.CCRTypeCertificateType
ADD CONSTRAINT PK_CCRTypeCertificateType PRIMARY KEY (CCRTypeID,
CertificateTypeID)
GO

-- This table defines which actual certificates are linked together
CREATE TABLE dbo.CCRCertificate (
CCRTypeID INT NOT NULL,
PersonID INT NOT NULL,
PrimaryCertificationID INT NOT NULL, -- This is necessary
because a person can let a certification lapse and then start a new
one of the same type as the old
CertificationID INT NOT NULL )
GO
ALTER TABLE dbo.CCRCertificate
ADD CONSTRAINT PK_CCRCertificate PRIMARY KEY (CCRTypeID, PersonID,
PrimaryCertificationID, CertificationID)
GO

Other tables hang off of the CCRCertificate table. The PersonID and
CertificationID in the same table seems to violate Boyce-Codd Normal
Form, so I've been asked to remove the PersonID (not that BCNF was the
reason given of course, but it's "duplicate data" since the
PrimaryCertificationID should already determine the PersonID).

What I'd like to do is:

1. Constrain the CCRCertificate table (or use additional table(s) if
necessary) so that all certifications linked as CertificationID are
from the same person.

2. Require that at most 1 certificate of each type in
CCRTypeCertificateType for a CCRType is added to CCRCertificate. In
other words, if the CCRTypeCertificateType table indicates that a
CCRType 1 is made up of CertificationType 1 and CertificationType3,
the CCRCertificate table should only include at most one certification
that has a CertificationTypeID of 1 and one that has an ID of 3 if the
CCRTypeID is 1.

I hope this isn't too confusing. Let me know if there's anything that
I can make clearer.

Thanks!
-Tom.

P.S. The SQL here is for illustrative purposes. I'm on a laptop that
doesn't have SQL Server installed, so I can't test the SQL at the
moment. Since I'm not asking for any code that might need to be tested
I hope this isn't an issue.

Jun 27 '08 #1

Subscribe Reply

2332

Eric Isaacs

What I find a bit odd is that the database has PrimaryCertificationID
and CertificationID as foreignkeys in the CCRCert table. I don't
pretend to understand the business side here, but if a cert can be
primary to other certs, then there should be a relationship between
rows in the cert table instead, where the cert relates back to
itself. This model isn't particularly invalid, but when you start
adding other foreign keys to that table, it's more difficult to
maintain the integrity, as it looks like you're experiencing.

You mentioned that you don't have buyin to change the design, but a
possible relationship within Cert to it's primary Cert record might
make things a bit easier???

Hope that helps.

Jun 27 '08 #2

Thomas R. Hummel

On May 22, 4:46*pm, Eric Isaacs <eisa...@gmail.comwrote:

What I find a bit odd is that the database has PrimaryCertificationID
and CertificationID as foreignkeys in the CCRCert table. *I don't
pretend to understand the business side here, but if a cert can be
primary to other certs, then there should be a relationship between
rows in the cert table instead, where the cert relates back to
itself. *This model isn't particularly invalid, but when you start
adding other foreign keys to that table, it's more difficult to
maintain the integrity, as it looks like you're experiencing.

You mentioned that you don't have buyin to change the design, but a
possible relationship within Cert to it's primary Cert record might
make things a bit easier???

Hope that helps.

Thanks for the advice. The PrimaryCertificationID was added on after
it was realized that a Person could hold more than one CCRType
historically. Originally there was another table (CCR) which had a
primary key of (CCRTypeID, PersonID). The idea was that a person could
only have one CCR at most for each possible type. Then we determined
that a person could have a certification expire, apply for a new
certification, and have each certification joined to a separate CCR.
The CCR table had an identifying FK into CCRCertification. The
PrimaryCertificationID was added to get around that issue, but now I'm
trying to think through whether or not that is truly descriptive of
the real world situation.

Thanks to your advice, I'm going to spend some time today
concentrating on that relationship between certifications.

Thanks,
-Tom.

Jun 27 '08 #3

Thomas R. Hummel

Just in case people see these responses and figure that the issue is a
solved matter... I'm still interested in whether or not people tend to
be concerned about using BCNF and how one might add a constraint like
the one that I described in question #2.

Thanks!
-Tom.

Jun 27 '08 #4

--CELKO--

The original design stinks on several levels
1) singular table names -- only one row?
2) IDENTITY and that silly magically universal "id"
3) too many NULL-able columns
4) data element changing names among tables
5) absurd names like "type_id" -- which is it? They are totally
different kinds of attributes
6) Vague specs like "certificates can be linked and synchronized" ; I
will guess you mean that there are prerequisites
7) You talk expiration of certifications, but have no such data in the
schema.

CREATE TABLE Persons
(person_id INTEGER NOT NULL PRIMARY KEY, -- find a real key!!
first_name VARCHAR(30) NOT NULL,
last_name VARCHAR(30) NOT NULL);

-- a person is not an attribute of a certificate!
CREATE TABLE Certificates
(certificate_nbr INTEGER NOT NULL, -- needs actual work and probably
should be a string
certificate_type INTEGER NOT NULL
CHECK (certificate_type IN (..)), -- use check() when list is
short and stable
prereq_certificate_nbr INTEGER, -- null means no prerequisite
);

--this is a relationship among persons and certificates
CREATE TABLE Certifications
(certificate_nbr INTEGER NOT NULL
REFERENCES Certificates(certificate_nbr)
ON UPDATE CASCADE,
person_id INTEGER NOT NULL
REFERENCES Persons (person_id)
ON UPDATE CASCADE,
PRIMARY KEY (person_id, certificate_nbr),
start_date DATE NOT NULL,
expiration_date DATE NOT NULL,
CHECK (start_date < expiration_date));

-- this is where I lost you, so I did this
CREATE TABLE Prerequisites
(certificate_nbr INTEGER NOT NULL
REFERENCES Certificates(certificate_nbr)
ON UPDATE CASCADE,
prereq_certificate_nbr INTEGER NOT NULL
REFERENCES Certificates(certificate_nbr)
ON UPDATE CASCADE,
PRIMARY KEY (certificate_nbr, prereq_certificate_nbr));

Jun 27 '08 #5

Thomas R. Hummel

On May 23, 12:54*pm, --CELKO-- <jcelko...@earthlink.netwrote:

The original design stinks on several levels
1) singular table names -- only one row?
2) IDENTITY and that silly magically universal "id"
3) too many NULL-able columns
4) data element changing names among tables

I fully agree with you on all four of these, but as explained they
already exist in the database that I was handed. If I had the ok to
design it as I'd like or change the existing design then I certainly
would. In fact, I've spent much of this week trying to convince our
"Enterprise Architecture" team that using IDENTITY for PKs is a
mistake. I've even used many of your own arguments to do so Joe.
Unfortunately, I may lose that battle as well. The reason given (this
still makes me sick to my stomach) is, "I'm prejudiced that way" When
pressed for a reason as to *WHY*, no reason was given. But I
digress...

5) absurd names like "type_id" -- which is it? They are totally
different kinds of attributes

Again, I agree with you on this one, but it was already the
terminology being used and I don't consider this enough of an issue to
warrant the inconsistencies of doing it another way.

6) Vague specs like "certificates can be linked and synchronized" ; I
will guess you mean that there are prerequisites

Obviously what I've written out aren't the full specs. I hoped to
provide enough information without writing a 300-page document, but I
obviously erred on the side of not providing enough information.

7) You talk expiration of certifications, but have no such data in the
schema.

This isn't a full data model, only what I thought would be relevant to
the specific questions that I had. There is a table called CCRCycle
which includes start and end dates. The DDL for that is something
along the lines of:

CREATE TABLE dbo.CCR (
CCRTypeID INT NOT NULL,
PersonID INT NOT NULL,
PrimaryCertificationID INT NOT NULL )
GO
ALTER TABLE dbo.CCR
ADD CONSTRAINT PK_CCR PRIMARY KEY (CCRTypeID, PersonID,
PrimaryCertificationID)
GO

CREATE TABLE dbo.CCRCycle (
CCRTypeID INT NOT NULL,
PersonID INT NOT NULL,
PrimaryCertificationID INT NOT NULL,
CycleNumber INT NOT NULL,
EffectiveStartDate DATETIME NOT NULL,
EffectiveEndDate DATETIME NOT NULL )
GO
ALTER TABLE dbo.CCRCycle
ADD CONSTRAINT PK_CCRCycle PRIMARY KEY (CCRTypeID, PersonID,
PrimaryCertificationID, CycleNumber)
GO
ALTER TABLE dbo.CCRCycle
ADD CONSTRAINT FK_CCRCycle FOREIGN KEY (CCRTypeID, PersonID,
PrimaryCertificationID) REFERENCES CCR (CCRTypeID, PersonID,
PrimaryCertificationID)
GO

There isn't a prerequisite between certificates, but once a person has
two certificates of particular types (defined by the
CCRTypeCertificationType) they have to synchronize the start and end
dates of the final cycle for the two certifications. The
synchronization can go in either direction, so neither certificate
type is necessarily a primary one (PrimaryCertificationID was a poor
name choice).

I hope this makes things a little clearer. I'm trying to type this out
in a meeting, so I'll try to add additional pertinent information
later if I think of anything.

Thanks,
-Tom.

Jun 27 '08 #6

Eric Isaacs

Tom,

I'm still shooting in the dark here a little, but are all the
certificates grouped together in one group per person, or are there
multiple groups of certificates for each person? Certificates have
primary certs now, but you said that was a bad name because a single
cert is not really the "primary" and that they're just groups.

If the person has only one "group" of certificates, consider putting
the data for the "group" in the person table. If a person can have
more than one certificate "group", you'll need a table between people
and certs to logically model the "group" notion.

HTH

Jun 27 '08 #7

Thomas R. Hummel

On May 23, 2:58*pm, Eric Isaacs <eisa...@gmail.comwrote:

Tom,

I'm still shooting in the dark here a little, but are all the
certificates grouped together in one group per person, or are there
multiple groups of certificates for each person? *Certificates have
primary certs now, but you said that was a bad name because a single
cert is not really the "primary" and that they're just groups.

If the person has only one "group" of certificates, consider putting
the data for the "group" in the person table. *If a person can have
more than one certificate "group", you'll need a table between people
and certs to logically model the "group" notion.

HTH

They can definitely have more than one group. As an example, let's say
that there are three types of certificates, A, B, and C. A person
could get a certification of type A and B which are not tied in any
way to each other. Later, they might get a certification for type C
which is synchronized with type A. At that point they would have two
groups: (A, C) and (B). Each group would have its own set of cycles.

Thanks,
-Tom.

Jun 27 '08 #8

--CELKO--

> I fully agree with you on all four of these, ... If I had the ok to design it as I'd like or change the existing design then I certainly would. In fact, I've spent much of this week trying to convince our "Enterprise Architecture" team that using IDENTITY for PKs is a mistake. <<

If you want some more ammo; here is a reply to an old article on mine
that just got posted. Every year, I get a lot (at least 2 since I
started writing in the trade press and an all-time high of 15 in 2002)
of private emails that say the same thing with the painful details of
dealing with an "ID-iot" design; I cannot publish them because they
are private emails, but you might want to Google this public one.
Honest, it is real and typical, very typical:

=============
Re: Celko on SQL: Natural, Artificial and Surrogate Keys Explained

Joe hits it dead on here. I've been specialized in DB design and
programming for a dozen years now as a consultant, and every time I've
gone into a shop that uses identity columns as PKs there have been
data issues because of it.

My current client used them extensively and now they have a complex
process to merge entities because duplicates always turn up.

On top of that, they have two systems with two databases which both
include data for the same entity. They have another complex system set
up to transfer identities across the databases so that they can try to
keep them in sync. The result is a mess and there are constantly
problems with the syncing. Problems that wouldn't be there if the keys
were natural instead of derived by the system.

People complain that Joe's outlook is overly theoretical, but those
"theories" result in real-world consequences. Gravity is just a
theory, but that doesn't mean that jumping off a 10-story building is
a good idea.

http://www.informationweek.com/showA...leID=201806814

================

> The reason given (this still makes me sick to my stomach) is, "I'm prejudiced that way" When pressed for a reason as to *WHY*, no reason was given. But I digress...<<

We both know that is not a reason, but an excuse :)

>but I obviously erred on the side of not providing enough information. <<

Safer in a newsgroup posting ..

>There isn't a prerequisite between certificates, but once a person has two certificates of particular types (defined by the CCRTypeCertificationType) they have to synchronize the start and end dates of the final cycle for the two certifications. The synchronization can go in either direction, so neither certificate type is necessarily a primary one PrimaryCertificationID was a poor name choice). <<

It sounds like a fun problem; can you catch me off line? I was out of
town for a week in the NYC area working with a small hedge fund (under
NDA just after an internal audit) and I just back for the 3-day
weekend. Junk mail, emails, junk emails, and dog poop piled up while
I was gone :)

Jun 27 '08 #9

Thomas R. Hummel

On May 25, 6:56*pm, --CELKO-- <jcelko...@earthlink.netwrote:

*I fully agree with you on all four of these, ... If I had the ok to design it as I'd like or change the existing design then I certainly would. In fact, I've spent much of this week trying to convince our "Enterprise Architecture" team that using IDENTITY for PKs is a mistake. <<

If you want some more ammo; here is a reply to an old article on mine
that just got posted. *Every year, I get a lot (at least 2 since I
started writing in the trade press and an all-time high of 15 in 2002)
of private emails that say the same thing with the painful details of
dealing with an "ID-iot" design; I cannot publish them because they
are private emails, but you might want to Google this public one.
Honest, it is real and typical, very typical:

Heheh... it's funny that you should post that response to your
article. I don't seem to see the comments on the link that you
provided, but I'm familiar with it from looking for "ammo" a few days
ago. You'll notice that the name on that response is "Tom". That's not
coincidence :)

There isn't a prerequisite between certificates, but once a person has two certificates of particular types (defined by the CCRTypeCertificationType) they have to synchronize the start and end dates of the final cycle for the two certifications. The synchronization can go in either direction, so neither certificate type is necessarily a primary one PrimaryCertificationID was a poor *name choice). *<<

It sounds like a fun problem; can you catch me off line? *I was out of
town for a week in the NYC area working with a small hedge fund (under
NDA just after an internal audit) and I just back for the 3-day
weekend. *Junk mail, emails, junk emails, and dog poop piled up while
I was gone :)

I'll try to get an email out later today. Thanks for the help.

-Tom.

Jun 27 '08 #10

Eric Isaacs

On May 23, 12:03*pm, "Thomas R. Hummel" <tom_hum...@hotmail.com>
wrote:

They can definitely have more than one group. As an example, let's say
that there are three types of certificates, A, B, and C. A person
could get a certification of type A and B which are not tied in any
way to each other. Later, they might get a certification for type C
which is synchronized with type A. At that point they would have two
groups: (A, C) and (B). Each group would have its own set of cycles.

There isn't a prerequisite between certificates, but once a person has two certificates of particular types (defined by the CCRTypeCertificationType) they have to synchronize the start and end dates of the final cycle for the two certifications. The synchronization can go in either direction, so neither certificate type is necessarily a primary one PrimaryCertificationID was a poor name choice). <<

It seems to me the missing entity could be the
PersonCCRTypeCertificationType table that would contain the FKs for
each of those tables and the start and end dates for those
certification groups, (unless those dates are calculated based on the
individual certification dates.)

Jun 27 '08 #11

Iain Sharp

On Fri, 23 May 2008 09:54:56 -0700 (PDT), --CELKO--
<jc*******@earthlink.netwrote:

>The original design stinks on several levels
2) IDENTITY and that silly magically universal "id"

>
CREATE TABLE Persons
(person_id INTEGER NOT NULL PRIMARY KEY, -- find a real key!!
first_name VARCHAR(30) NOT NULL,
last_name VARCHAR(30) NOT NULL);

Just curious here Joe, what would be your preferred primary key for a
person?

Jun 27 '08 #12

--CELKO--

>Just curious here Joe, what would be your preferred primary key for a person? <<

What would be your preferred primary key for a thing? See how silly
that question is now? Now, if you beleive in Kabbalah mysticism then
God gave everything a number and the true name of God is a 216-digit
number. Better than a GUID, unh?

The answer is that you pick a key based on the *role* that a person
plays in your data model. If there are options, look at the degree of
trust/risk you need versus validation and verification effort/cost.

1) When I worked for the prison system, we used a "ten-card" (full set
of finger prints) to move inmates around. We needed a high degree of
certainty. Same thing for DNA in courts, hospitals, etc.

2) When I cash a check at the grocery store, my driver's license
number is good enough. They can live with a 3% bad check rate and
don't have to fingerprint me (tho Kroger's was doing that in Ohio at
one time).

3) Amazon.com, Google groups, etc. use my email address

4) I use a passport when I am a tourist overseas.

This question needs a name, since newbies ask it so often -- how about
"The magic, generic person question" , "The Kabbalah Kwestion" until
we have something better?

Jun 27 '08 #13

Erland Sommarskog

--CELKO-- (jc*******@earthlink.net) writes:

>>Just curious here Joe, what would be your preferred primary key for a

person? <<

1) When I worked for the prison system, we used a "ten-card" (full set
of finger prints) to move inmates around. We needed a high degree of
certainty. Same thing for DNA in courts, hospitals, etc.

Such a solution may work when the persons to put in the database are
all physical persons, and you actually can get their fingerprints. It's
not going to fly in a web shop.

2) When I cash a check at the grocery store, my driver's license
number is good enough. They can live with a 3% bad check rate and
don't have to fingerprint me (tho Kroger's was doing that in Ohio at
one time).

Yeah, but can a relational database live with 3% duplicates in its
primary key?

3) Amazon.com, Google groups, etc. use my email address

Then they are not modelling persons, they are modelling e-mail
addresses. If that is meant to be persons, they are better off using
IDENTITY, which is easier to work with than strings. You can shop as
jo*******@hotmail.com, as jo******@hotmail.com, as jc********@earthlink.net,
and they get a lot of duplicates in their database. On the other
hand, if you and your wife share e-mail address, you are two persons,
and legally two customers.

4) I use a passport when I am a tourist overseas.

Not all persons have a passport. Specifically, companies never have any.

This question needs a name, since newbies ask it so often -- how about
"The magic, generic person question" , "The Kabbalah Kwestion" until
we have something better?

This is not a newbie question. Any one who has worked with data modelling
in real-life system will this to persons who only have done data
modelling in theory.

In the general case, there is no natural usable key for persons. For a
specific application there may be. But it is not always the case. Trying to
sweep it under the carpet by saying silly with things like
"The Kabbalah Kwestion" is not going to change that.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jun 27 '08 #14

--CELKO--

>Such a solution may work when the persons to put in the database are all physical persons, and you actually can get their fingerprints. It's not going to fly in a web shop. <<

Yes. That is why websites ask for your email address as a verifiable
identifier with easy validation. We are back to the degree of trust
in the data again, too.

>Yeah, but can a relational database live with 3% duplicates in its primary key? <<

By definition, there are no duplicates in the keys; they are in the
data. Most mailing lists I have worked with would be happy if the
error rate was <3% (six sigma and all that jazz); read the standard
contract for buying a mailing list -- the upper limit is 10% before
you can get your money back.

Every application has to set its own degree of trust/risk parameter
for their data quality. We are back to the idea of trust and risk.

>Then they are not modeling persons, they are modeling e-mail addresses. <<

No, the email is the way to get to a person (in the ISO sense of a
lawful person, human or otherwise) who plays the role of "customer"
for them. We are back to the idea of a role.

>If that is meant to be persons, they are better off using IDENTITY, which is easier to work with than strings. <<

No, it isn't. Besides the design fallacy of confusing the
representation with the fact being modeled, it is a hell of lot easier
to screw up an integer than an email address that has validation and
verification rules. How did you confirm that Cindy Lou Who was really
number 42? (Pardon the Dr. Seuss reference)

>You can shop as joe.ce...@hotmail.com, as joece...@hotmail.com, as jcelko4...@earthlink.net, and they get a lot of duplicates in their database.<<

Gee, they don't use one of those address scrubbing pieces of software
that would show them the duplication? This was a problem that was
solved for postal addresses decades ago; it is super easy with IP
Addresses.

But more than the obvious fix for personal email via software,
remember the idea of roles. The role of "jc*******@earthink.net" is
as a private citizen who buys math and CS books while the role of
"jcelko@some_company.com" is as an employee who buys office supplies.
So I buy my porno at one address and never the other.n They are
different roles.

>On the other hand, if you and your wife share e-mail address, you are two persons, and legally two customers. <<

That depends on how we pay. If we use the same credit card, then are
not legally two customers; we are both liable. A company credit card
would not make ALL the stockholders separate customers. This is basic
business law!

>Not all persons have a passport. Specifically, companies never have any. <<

Nor do those lawful person travel much. Ever been on a cruise ship
with IBM sitting at your table? You seem to be missing the point
about roles..

>Any one who has worked with data modeling in real-life system will this to persons who only have done data modeling in theory. <<

Sorry, but I have a fair number of implemented data models under my
belt, AND I teach the theory. Knowing what I am doing got me the
jobs :) The last complete data model was for a company that does
software for disaster relief (very fuzzy data coming is rapidly). The
last fix was for a small company with less than 100 employees and more
than $8 billion in assets this year.

Expecting a Universal, Magical identifier for "objects",
"relationships", "persons", etc. is very much a "newbie question" that
expects a Kabbalah number to appear for them. The "relationships"
version of this is an EAV design, while exposed physical locators like
GUIDs, IDENTITY and other proprietary features are the error committed
for entities.

Sit down and re-word your posting as favoring EAV; it maps very
nicely.

>In the general case, there is no natural usable key for persons. For a specific application there may be. <<

That is what I have been saying all this time! That is why you look
for roles.

>But it is not always the case. <<

How hard did you look? Given the concept of a role played by a lawful
person, you can most often find an identifier that gives you the role;
all you care about is that role.

But it is easier to just start writing code than actually designing.
They don't have the role concept or much ability to abstract, so they
want a physical human being.

>Trying to sweep it under the carpet by saying silly with things like "The Kabbalah Kwestion" is not going to change that. <<

No, just the opposite; I am trying to expose the design fallacy of a
magical Kabbalah number for people or other entities.

Jun 27 '08 #15

Ed Murphy

--CELKO-- wrote:

>>Holy crap... you don't want to deal with the "overhead" associated with using an IDENTITY, but you want him to write fuzzy logic and/or buy a social network application to "figure out" which person he's looking at? <<

No, I want him to keep this contact list on a "Roladex" or something
that allows for him to do all the work by hand while the problem is
small, and risk is minimal (calling the wrong Bruce is not a
disaster). When the problem gets too big for humans, such as doing
intelligence work, and the data is fuzzy, then move on to special
tools (bombing the wrong town is a disaster).

The original questioner didn't specify (at least I don't remember him
doing so) whether he was actually using IDENTITY, or a human-selected
key. To provide a reference point, the main SQL-based software package
that I work with does the latter; contact names must be unique within a
(customer, vendor, etc.), and if you have multiple Bruce Smiths then you
have to call the second one "Bruce Smith (AP department)" or something;
which, while technically part of the RDBMS, ends up working like a
Rolodex in practice.

Jun 27 '08 #16

Erland Sommarskog

--CELKO-- (jc*******@earthlink.net) writes:

>>Yeah, but can a relational database live with 3% duplicates in its
primary key? <<

By definition, there are no duplicates in the keys; they are in the
data.

So why then asked about the natural key for a person, do you start
yiddling about 3% acceptable error rate?

We are in a relational database, and for technical reason we don't
accept any duplicates. But as you notice, in the real world they are
unavoidable in practice. Enter: the surrogate key.

No, the email is the way to get to a person (in the ISO sense of a
lawful person, human or otherwise) who plays the role of "customer"
for them. We are back to the idea of a role.

The question was about persons, not roles. When you asked about the key
for a person, you suggested "e-mail address". You now admit that it is
not the key for a person, but for something else.

No, it isn't. Besides the design fallacy of confusing the
representation with the fact being modeled, it is a hell of lot easier
to screw up an integer than an email address that has validation and
verification rules. How did you confirm that Cindy Lou Who was really
number 42? (Pardon the Dr. Seuss reference)

On most invoices I get, there is a customer number. Apparently many
organisations think that assigning numbers to their customers is a good
idea.

Sure, someone may confuse a number, but why would e-mail addresswa be
less prone to confusion? They are typically longer, often include
people's name that may be difficult to spell, so there are more
chances for things to go wrong. Systems with case-sensitive collations
have their own problems.

And even if an application may use e-mail address or whatever as the
exposed key, you can be fairly sure that they also have a surrogate
that they do not expose. That makes things a little easier when a
customer changes his e-mail address.

>>Any one who has worked with data modeling in real-life system will this
to persons who only have done data modeling in theory. <<

Sorry, but I have a fair number of implemented data models under my
belt, AND I teach the theory.

This isn't about classroom theory, Joe. This is about real world.

Expecting a Universal, Magical identifier for "objects",
"relationships", "persons", etc. is very much a "newbie question"

And it's very much a newbie answer to suggest that a surrogate such
an IDENTITY is not an acceptable key for a person.

while exposed physical locators like GUIDs, IDENTITY

None of them are physical locators, and you know that very well. Since
you blantantly repeat this lie which is very easily demasked by the
SQL Server community, why should believe anything else you say?

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jun 27 '08 #17

--CELKO--

> We are in a relational database, and for technical reason we don't accept any duplicates. But as you notice, in the real world they are unavoidable in practice. Enter: the surrogate key. <<

And it does not help because duplication, missing values, errors and
other data quality problems are in the data itself -- not the physical
storage. There is a whole set of tools in CRM packages for trying to
find duplicates which has nothing to do with data access methods.

>The question was about persons, not roles. When you asked about the key for a person, you suggested "e-mail address". You now admit that it is not the key for a person, but for something else. <<

Again, with the magical universal "person" rather than the role they
play in the data model! The only reasonable answer for a person has
to be DNA, then. It can be validated and verified. It will not
change over time. It will not be any different in another DB (ISO/IEC
19794 Biometric Data Interchange Format for DNA). I can even tell
identical twins apart with it (http://www.nytimes.com/2008/03/11/
health/11real.html).

Of course, it is useless until there is a cheap testing procedure that
everyone agrees to use. I believe that in ~10 years you will have a
smart card with a ton of personal data on it. When you want to do
business, you stick the card in one slot and your finger in another,
and the machine matches you to the card and pulls out the data for the
role that you play in that system. At the bank, I am an account; at
the hospital, I am a patient; at the airport, I am a fleeing felony;
etc.

>On most invoices I get, there is a customer number. Apparently many organisations think that assigning numbers to their customers is a good idea. <<

They are finding that it is a better idea to make the commercial
customers get a DUNS number rather than crate their own.

> Sure, someone may confuse a number, but why would e-mail addresses be less prone to confusion? They are typically longer, often include people's name that may be difficult to spell, so there are more chances for things to go wrong. Systems with case-sensitive collations have their own problems. <<

Being longer does not make something more error-prone. For example,
Compuserve uses an <adjective><nounphrase for passwords. You are
less apt to mess up typing "purple bagpipes" than a 15-digit number.
And they are not easy to guess because of the large domain space. In
fact, because numeric strings are so easy to mess up, we add check
digits to them.

> And even if an application may use e-mail address or whatever as the exposed key, you can be fairly sure that they also have a surrogate that they do not expose. That makes things a little easier when a customer changes his e-mail address. <<

Why not just cascade the change?

>None of them are physical locators, and you know that very well. Since you blatantly repeat this lie which is very easily unmasked by the SQL Server community, why should believe anything else you say? <<

Okay, tell me what attribute of the entity they model? Tell me how to
verify them in the real world without looking at one particular piece
of software and hardware? Why are they different from a pointer or a
link (the only difference I see is that pointers are usually hidden
locators and the system, rather than the programmer, does the
housekeeping). If they are attributes, why don't they have the same
value for the same entity everywhere (i.e. my VIN on my automobile
does not change in the DMV database, my insurance company database,
CarFax, etc. )?

If they come from the physical storage and not the data model or the
real world, and their purpose is to locate a particular record (not
row; a row would have a proper key built from attributes) then what
are they?

Jun 27 '08 #18

Shuurai

INSERT INTO Contacts (name) VALUES ('Bruce');
INSERT INTO Contacts (name) VALUES ('Bruce'); -- opps, hit the submit
key!
INSERT INTO Contacts (name) VALUES ('Bruce'); -- hit the submit key,
again!
INSERT INTO Contacts (name) VALUES ('Bruce'); -- hit the submit key,
because I had to be sure!

He will have to clean up the mess by hand. *That is fine, but let's
not pretend that this is RDBMS or that it can scale or that it has
strong data quality.

There are a number of ways to avoid this issue; for example, you can
clear the form upon a successful submit. Press submit twice - you get
an error asking you to fill in the form. Again this comes down to a
lack of industry experience on your part.

Validation, verification and trust seem to be very strange concepts to
you, so I would assume that your work experience has pretty much been
at the code level and never got into DB design, data quality, data
stewardship, MDM, etc.

As as is often the case, your assumption would be wrong.

As for verification, you have no problem advocating email address and/
or phone number as a key to identify a PERSON, despite the fact that
many people maintain multiple email addresses, or the fact that phone
numbers can be reassigned. If there are ten people named Bruce Smith
working at Company XYZ in the Purchasing Dept, and they all use
pu******@XYZ.com, and all use 555-555-1234 as their phone number -
what is the key?

Even if we go with your rolodex idea, how exactly are you
differentiating these people? Your solution seems to be "just wing
it" - but what if there are 10,000+ contacts?

What overhead, specifically? *Either way you're making sure that Mr. Smith has the same identifier throughout your system. The only real difference here is that we're not asking Mr. Smith to provide it himself. <<

Cleaning up redundant duplicate data by hand. *Manually maintaining
"fake pointer chains". *Removing orphan rows.
Converting from the local exposed locators to external idustry
standards. *Porting data and code to another RDBMS.

This is the "code monkey" view of the world -- only the local server
exists, so I can generate my own local identifiers. *Unfortunately,
that does not work even inside the same company. *Ever look at a POS
receipt? *It has the store id, the POS unit id, and a finally a
sequence number from that unit. *Why? *Because the stores are not on
just one server with a locally generate auto-increment number.

Ever notice that recipts ALSO generally contain a Customer Number?
Where exactly do they get that Customer Number? I have one from a
local bookstore that says I am customer number 23234... I don't
recall ever seeing that number on my birth certificate, social
security card, drivers license, or any other official document. Seems
to me that number must be a surrogate key. Something their system
came up with to uniquely identify me as a customer.

Can they link that customer number to some other database? Probably
not. Do they need to? Probably not.

Which is a moot point if we're stipulating that the external identifiers don't exist. <<

That is also a reason this is not RDBMS problem. *Identifiers are
attributes of the entity, so they are not external by their nature.
The verification is external to the database.

If we have Bruce Smith with a contact_id of 999, all we're saying is
that Bruce Smith is the 999th contact we entered into the database.
We have uniquely identified this Bruce Smith. What would you like to
verify?

Holy crap... *you don't want to deal with the "overhead" associated with using an IDENTITY, but you want him to write fuzzy logic and/or buy a social network application to "figure out" which person he's looking at? <<

No, I want him to keep this contact list on a "Roladex" or something
that allows for him to do all the work by hand while the problem is
small, and risk is minimal (calling the wrong Bruce is not a
disaster). *

And if he has 1,000 contacts? 10,000 contacts? You actually think
keeping track of people - assuming the possibility of duplicate names,
email, phone, etc - is going to be *less* overhead? And how exactly
is calling the wrong Bruce Smith more of a disaster if we found his
name in a table, versus finding his name on a rolodex?

When the problem gets too big for humans, such as doing
intelligence work, and the data is fuzzy, then move on to special
tools (bombing the wrong town is a disaster).

Except in this case we're simply storing vendor contact information,
not planning bombing runs.

Jun 27 '08 #19

Erland Sommarskog

--CELKO-- (jc*******@earthlink.net) writes:

Again, with the magical universal "person" rather than the role they
play in the data model! The only reasonable answer for a person has
to be DNA, then.

Companies do not have DNA.

It can be validated and verified.

Yeah, you can validate and verify whatever you want, but the primary
key in a relational database must be unique and must no be nullable.

Being longer does not make something more error-prone. For example,
Compuserve uses an <adjective><nounphrase for passwords. You are
less apt to mess up typing "purple bagpipes" than a 15-digit number.
And they are not easy to guess because of the large domain space.

From what I have learnt, passwords that consists of letters a-z and digits
only, are easily cracked. Whatever passwords have to do with this.
Passwords are not keys. And the discussion is about keys. You suggested
that an e-mail address were less likely to be screwed up than an a
number. Now was that jo******@somewhere.net, jo*******@somewhere.net or
was it .com? Didn't he spell it celco?

Numbers are less error-prone in that all digits are or more or less neutral,
and you don't have to worry abour dots, captials and that.

In fact, because numeric strings are so easy to mess up, we add check
digits to them.

And that is another advantage with numbers. Try adding check-digits to mail
addresses!

>> And even if an application may use e-mail address or whatever as the

exposed key, you can be fairly sure that they also have a surrogate that
they do not expose. That makes things a little easier when a customer
changes his e-mail address. <<

Why not just cascade the change?

This is perfect question to illustrate that you are a classroom expert.

You get an account with gmail, and inform your stock broker that from
now on you address is now jc****@gmail.com. Off and on through the years
you have been day-trading, and the broker has over 10000 contract notes
stored for you, there are even more transactions, and there are references
to your account id in historic positions and lots of places. Maybe fifty
tables. And every day there are some 100 customers reporting new e-mail
addresses. Guess how good all those cascading updates are for the system.
It's particular fun where the account id is part of the primary key, and
there are triggers that look at both the inserted and deleted tables. Since
there is no way to correlate the rows when the PK is updated, the update has
to be done row by row in those table.

To add insult to injury, there is a business problem as well. Looking at
a contract note from 1998, how could jc****@gmail.com been involved in
that deal? There wasn't any Gmail then!

>>None of them are physical locators, and you know that very well. Since
you blatantly repeat this lie which is very easily unmasked by the SQL
Server community, why should believe anything else you say?

Okay, tell me what attribute of the entity they model? Tell me how to
verify them in the real world without looking at one particular piece
of software and hardware?

You don't. And if you think they should, then you will have to answer
that question again: what is the natural key for a person?

Why are they different from a pointer or a link

Just like they are different from apples or oranges. That is, they
don't have anything in common with pointer or links. It's a machine-
generated identification in lieu of something better. Whether the
programmer drew a number from his own pool, or used a feature with the
product is irrelevant.

If they are attributes, why don't they have the same
value for the same entity everywhere

So that Big Brother can not come and get you!

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jun 27 '08 #20

--CELKO--

>There are a number of ways to avoid this issue; for example, you can clear the form upon a successful submit. Press submit twice - you get an error asking you to fill in the form. Again this comes down to a lack of industry experience on your part. <<

Forms? There are no forms in SQL. You are looking for a **kludge**
in the front end, which can be circumvented in the database or with
another front end. If you were an SQL programmer, you would be
looking a **solution** in the Database that would protect the data
from any screw ups, present and future.

Hey, there is nothing wrong with being a careful front end or
applications programmer! I even applaud the forms designer who is
watching out for this kind of thing. In an on-line web app, it is
often faster to check some of the data before it gets to the database
-- date formats and ranges, credit card numbers, phone numbers, etc.
come to mind.

This comes down to a lack of data and database experience on your
part; you have never seen "the big picture" and had to worry about the
enterprise level.

>As for verification, you have no problem advocating email address and/or phone number as a key to identify a PERSON, despite the fact that many people maintain multiple email addresses, or the fact that phone numbers can be reassigned. If there are ten people named Bruce Smith working at Company XYZ in the Purchasing Dept, and they all use purch...@XYZ.com, and all use 555-555-1234 as their phone number - what is the key? <<

What is the role they play in my data model? From this description, I
don't care about the people, just that I am dealing with the
Purchasing Dept of Company XYZ (a corporate entity). Have you ever
called a help desk? The names are false; the important things is the
help ticket number.

>Ever notice that receipts ALSO generally contain a Customer Number? Where exactly do they get that Customer Number? I have one from a local bookstore that says I am customer number 23234... that number must be a surrogate key. Something their system came up with to uniquely identify me as a customer. Can they link that customer number to some other database? Probably not. Do they need to? Probably not. <<

Agreed. Having owned two small used bookstores, we never bothered to
identify a customer inside the accounting system; we were a cash
business who never expected. That was 30 years ago; today, we would
have gotten an email address so we could do mailings. If you deal
with a larger company, then the customer ids will be DUNS or other
industry standards.

>If we have Bruce Smith with a contact_id of 999, all we're saying is that Bruce Smith is the 999th contact we entered into the database. We have uniquely identified this Bruce Smith. What would you like to verify? <<

I want to be able to actually find Bruce Smith in the real world.
What 999 gives me is a locator for a record in physical storage; it
does not uniquely identify Mr. Smith at all. Now if I can get his
credit card number, driver's license number, or even a description of
our last contact then I can verify that 999 is that particular Mr.
Smith.

>And if he has 1,000 contacts? 10,000 contacts? You actually think keeping track of people - assuming the possibility of duplicate names, email, phone, etc - is going to be *less* overhead? And how exactly is calling the wrong Bruce Smith more of a disaster if we found his name in a table, versus finding his name on a Rolodex? <<

If you want to implement a Roladex on a computer,that is fine. But it
is not an RDBMS, and does not scale well at all. Perhaps this is why
identity management is becoming a hot topic in CRM.

Jun 27 '08 #21

Shuurai

On Jun 2, 6:30*pm, --CELKO-- <jcelko...@earthlink.netwrote:

There are a number of ways to avoid this issue; for example, you can clear the form upon a successful submit. *Press submit twice - you get an error asking you to fill in the form. *Again this comes down to a lack of industry experience on your part. <<

Forms? *There are no forms in SQL. *

Where else is the end-user pressing the Submit button multiple times
in your example?

You are looking for a **kludge**
in the front end, which can be circumvented in the database or with
another front end. If you were an SQL programmer, you would be
looking a **solution** in the Database that would protect the data
from any screw ups, present and future.

In this case we have already established that multiple persons in our
database can have the same name, same company, same contact
information. How are you going to deal with this on the database
side? Bear in mind, you're not allowed to use or even ask for SSN,
drivers license, or anything along those lines.

And no, simply saying "use a rolodex" isn't an answer.

Hey, there is nothing wrong with being a careful front end or
applications programmer! *I even applaud the forms designer who is
watching out for this kind of thing. *In an on-line web app, it is
often faster to check some of the data before it gets to the database
-- date formats and ranges, credit card numbers, phone numbers, etc.
come to mind.

LOL! You applaud careful application development, unless it happens
to conflict with one of your examples, in which case it is kludge.
Aside from being outdated and arrogant, you're also a hypocrite.

Jun 27 '08 #22

Erland Sommarskog

--CELKO-- (jc*******@earthlink.net) writes:

No, the email is enough and has some nice hashing properties for
access.

And in their giant data warehouse they really want to cascading updates
when people change their e-mail addresses? Seems very unlikely to me. Or
just foolish.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx

Jun 27 '08 #23

jhofmeyr

<quote>
No, Option #2 is the almost universal choice. If you use #1, you
lose
your shopping cart, your recommendations list, targeted e-mails,
notification about advance orders, etc.
</quote>

Hmm .. I must be in the minority. But then I only use Amazon to buy
specific items that I already know that I want/need. I don't care
about shopping carts and recommendations and I usually can't remember
the password (or email address) that I used the last time I bought
something there. Basically if the site doesn't still "remember" me
from my last visit, I'll have 1 or 2 attempts at logging in, and then
just create a new account :)

<quote>
And in their giant data warehouse they really want to cascading
updates
when people change their e-mail addresses? Seems very unlikely to me.
Or
just foolish.
</quote>

Indeed ...

<quote>
No not for a simple contact, but there are contacts I have much more
information on. Particularly those which are agents of mine. I.e.
they
behave as contacts and potentially as staff members.
</quote>

Correct me if I'm wrong, but this sounds a lot like the Roles model
that Celko has been banging on about ...

<quote>
Any of this information can be used to match for the contact, and is
therefore useful in some class of indexing structure. But none of it
is classifiable as 'non null, unique' and therefore not succeptible
to
becoming a primary key.
</quote>

Exactly - which is Celko's reasoning behind saying that these things
should not be thought of as regular RDBMS entities. That is not to
say that you shouldn't store them in a database where you can make use
of the DB engine to assist in querying the data. How do you currently
match your Mr. Smith from XYZ.com's to Mr. Smith from ABC.net's?
Chances are, someone needs to dig around the data a bit, probably
narrowing down a list (I'm sure I've seen that somewhere in this
thread ...), and once they find the correct Mr. Smith they update his
details. Or do they just say "Hi there Mr Smith ... you don't happen
to know the arbitrary number that I assigned to you when I first met
you, do you?"

Again - I am not saying that surrogate keys are "bad" or "stupid".
All I'm saying is that these issues are so common that maybe the "real
world"'ers haven't thought them through for a while. Disregarding
everything Celko says out-of-hand because it's not always practical,
while tempting is probably not a good idea.

J

Jun 27 '08 #24

--CELKO--

>Hmm .. I must be in the minority. But then I only use Amazon to buy specific items that I already know that I want/need. I don't care about shopping carts and recommendations and I usually can't remember the password (or email address) that I used the last time I bought something there. Basically if the site doesn't still "remember" me from my last visit, I'll have 1 or 2 attempts at logging in, and then just create a new account :) <<

That is the minority for Amazon. They have really great repeat
business, which is why they added all those customer services. The
demographics on internet shopping are really interesting.

1) People tend to keep one email address these days. More Americans
are getting internet, telephone and TV from one providers in package
deals, so they don't switch like they used.

2) People tend to pick one password and use it all over the place. Or
worse, hit the "Remember me on this computer" button every time. Yes,
this is bad security, but we do it.

3) Even sites that you visit on an irregular basis have some kind of
history on you. It is cheap to do and pays off. In particular,
clothing or other seasonal businesses can run a clearance sale via
email on quantities of goods that would be too small to justify a
printed sales flyer. Overstock.com is the extreme example of this
model.

Jun 27 '08 #25

Iain Sharp

On Wed, 4 Jun 2008 03:04:03 -0700 (PDT), jh******@googlemail.com
wrote:

>
<quote>
No not for a simple contact, but there are contacts I have much more
information on. Particularly those which are agents of mine. I.e.
they
behave as contacts and potentially as staff members.
</quote>

Correct me if I'm wrong, but this sounds a lot like the Roles model
that Celko has been banging on about ...

So the data should be stored in two unrelated entities? If not, how do
you relate the two roles entities in a simple manner?
I currently have 3 entities all for the same physical thing, person,
staff and enduser and many, one to many relationships between them and
other entities. Several many to many relationships as well.

So I need something easy to include, incredibly unlikely to change and
(to satisfy my old programming instincts) small to be stored
repeatedly.

I choose to use an integer.

Jun 27 '08 #26

Similar topics