The choice of primary key is largely a matter of convenience and what is easiest to use. We’ll
discuss primary keys later in this chapter in the context of relationships. The important thing to
remember is that when you have values that should exist only once in the database, you need to
protect against duplicates.
Choosing Keys
While keys can consist of any number of columns, it is best to try to limit the number of columns in
a key as much as possible. For example, you may have a
Book table with the columns
Publisher_Name, Publisher_City, ISBN_Number, Book_Name, and Edition. From these attributes, the
following three keys might be defined:
•
Publisher_Name, Book_Name, Edition: A publisher will likely publish more than one book.
Also, it is safe to assume that book names are not unique across all books. However, it is
probably true that the same publisher will not publish two books with the same title and the
same edition (at least, we assume that this is true!).
•
ISBN_Number: The ISBN number is the unique identification number assigned to a book when
it is published.
•
Publisher_City, ISBN_Number: Because ISBN_Number is unique, it follows that Publisher_City
and ISBN_Number combined is also unique.
The choice of (
Publisher_Name, Book_Name) as a composite candidate key seems valid, but the
(
Publisher_City, ISBN_Number) key requires more thought. The implication of this key is that in
every city,
ISBN_Number can be used again, a conclusion that is obviously not appropriate. This is a
common problem with composite keys, which are often not thought out properly. In this case, you
might choose
ISBN_Number as the PK and (Publisher_Name,
Book_Name
al keys
ar
e as follo
ws:
•
F
or people
: D
r
iv
er
’
s license numbers (including the state of issue), company identification
number, or other assigned IDs (e.g., customer numbers or employee numbers).
•
For transactional documents (e.g., invoices, bills, and computer-generated notices): These usu-
ally have some sort of number assigned when they are printed.
•
For products for sale: These could be product numbers (product names are likely not unique).
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS18
8662Ch01.qxp 7/28/08 3:37 PM Page 18
• For companies that clients deal with: These are commonly assigned a customer/client number
for tracking.
•
For buildings: This is usually the complete address, including the postal code.
•
F
or mail
these people has hit the big time, but watch out—it could be happening soon!)
■
Tip
We tend to think of names in most systems as a kind of semiunique natural key. This isn’t good enough for
identifying a single row, but it’s great for a human to find a value. The phone book is a good example of this. Say
you need to find Ray Janakowski in the phone book. There might be more than one person with this name, but it
might be a “good enough” way to look up a person’s phone number. This semiuniqueness is a very interesting
attribute of a table and should be documented for later use, but only in rare cases would you use the semiunique
values and make a key from them using a uniquifier.
Smart Keys
A commonly occurring type of natural key in computer systems is a smart or intelligent key. Some
identifiers will have additional information embedded in them, often as an easy way to build a
unique value for helping a human identify some real-world thing. In most cases, the smart key can
be disassembled into its par
ts
. I
n some cases
, ho
wever, the data will probably not jump out at you.
Take the following example of the fictitious product serial number XJV102329392000123:
•
X: Type of product (LCD television)
•
JV: S
ubtype of pr
oduct (32-inch console)
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS 19
8662Ch01.qxp 7/28/08 3:37 PM Page 19
Smart keys are useful tools to communicate a lot of information to the user in a small package. However,
all the bits of information that make up the smart key need to be identified, documented, and implemented in a
straightforward manner. Optimum SQL code expects the data to all be stored in individual columns, and as such, it
is of great importance that you needn’t ever base computing decisions on decoding the value. We will talk more
about the subject of choosing implementation keys in Chapter 5.
Surrogate Keys
Surrogate keys (sometimes described as artificial keys) are kind of the opposite of natural keys. The
word
surrogate means “
something that substitutes for
,” and in this case, a surrogate key substitutes
for a natural key. Sometimes there may not be a natural key that you think is stable or reliable
enough to use, in which case you may decide to use a surrogate key. In reality, many of our exam-
ples of natur
al keys w
ere actually surrogate keys in their original database but were elevated to a
natural status by usage in the “real” world.
A surrogate key can uniquely identify each instance of an entity, but it has no actual meaning
with regard to that entity other than to represent existence. Surrogate keys are usually maintained
by the system. Common methods for creating surrogate key values are using a monotonically
increasing number (e.g., an
Identity column), some form of hash function, or even a globally
unique identifier (GUID), which is a v
er
y long identifier that is unique on all machines in the world.
The concept of a surrogate key can be troubling to purists. Since the surrogate key does not
describe the row at all, can it really be an attribute of the row? Nevertheless, an exceptionally nice
aspect of a surr
ogate key is that the value of the key should never change. This, coupled with the fact
that surrogate keys are always a single column, makes several aspects of implementation far easier.
two surrogate keys in a table: one that is the unchanging “address” of a value, the other that is built for user con-
sumption (that is compact, readable, and changeable if it somehow offends your user).
Just as the driver’s license number probably has no meaning to the police officer other than a
means to quickly call up and check your records, the surrogate is used to make working with the
data programmatically easier. Since the source of the value for the surrogate key does not have any
correspondence to something a user might care about, once a value has been associated with a row,
there is not ever a reason to change the value. This is an exceptionally nice aspect of surrogate keys.
The fact that the value of the key does not change, coupled with the fact that it is always a single col-
umn, makes several aspects of implementation far easier. This will be made clearer later in the book
when choosing a pr
imary key.
Thinking back to the driver’s license analogy, if the driver’s license card has just a single value
(the surr
ogate key) on it, how would Officer Uberter Sloudoun determine whether you were actually
the person identified? He couldn’t, so there are other attributes listed, such as name, birth date, and
usually your pictur
e
, which is an excellent unique key for a human to deal with (except possibly for
identical twins, of course). In this very same way, a table ought to have other keys defined as well, or
it is not a proper table.
C
onsider the earlier example of a pr
oduct identifier consisting of seven parts:
•
X:
T
ype of product (LCD television)
•
JV: Subtype of product (32-inch console)
•
a surrogate key on the table that has a value of 3384038483. If the only key defined on the rows is the
surrogate, the following situation might occur:
SurrogateKey ProductSerialNumber
–––––––––––– –––––––––––––––––––
10 XJV102329392000123
3384038483 XJV102329392000123
3384434222 ZJV104329382043534
The first two rows are not duplicates, but since the surrogate key values have no real meaning,
in essence these are duplicate rows, since the user could not effectively tell them apart.
This sort of problem is common, because most people using surrogate keys do not understand
that only having a surrogate key opens them up to having rows with duplicate data in the columns
where the data has some logical relationship to each other. A user looking at the preceding table would
have no clue which row actually represented the product he or she was after, or if both rows did.
■
Note
When doing logical design, I tend to model each table with a surrogate key, since during the design
process I may not yet know what the final keys will in fact turn out to be. This approach will become obvious
throughout the book, especially in the case study presented throughout much of the book.
Missing Values (NULLs)
If you look up the definition of a “loaded subject” in a computer dictionary, you will likely find the
word
NULL. In the database, there must exist some way to say that the value of a given column is not
known or that the value is irrelevant. Often, a value outside of legitimate actual range (sometimes
referred to as a
sentinel value) is used to denote this value. For decades, programmers have used
ancient dates in a date column to indicate that a certain value does not matter, they use a negative
value where it does not make sense in the context of a column, or they simply use a text string of
'UNKNOWN' or 'N/A'. These approaches are fine, but special coding is required to deal with these val-
ues, for example:
IF (value<>'UNKNOWN') THEN ...
• Logical comparisons can get tricky when
NULL is introduced.
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS22
8662Ch01.qxp 7/28/08 3:37 PM Page 22