Surrogate Keys Considered Harmful

One of the issues that comes up in database design is whether to use part of the data for each record as a primary key on the data, or whether automatically to generate unique identifiers that have nothing to do with the data. Automatically generated primary keys are called surrogate keys because they acts as a surrogate for the data in the database. The other option is to see if there is a way to represent each record with a unique identifier that is derived from the data itself. There is a decent introduction to the matter at http://en.wikipedia.org/wiki/Surrogate_key, with a summarize of the advantages and disadvantages of each approach.

Now, I am not really going to provide a satisfying rant in the “Considered Harmful” tradition. But I am in a position to state a  clear preference for meaningful, rather than surrogate, keys. Here are a few of my reasons, coming from the perspective of a web application developer:

  1. Surrogate keys always require a join whenever I want a piece of meaningful information. But in many instances, I only need just one piece of information from the record. If this information were the primary key, no join would be required. Joins always require resources and should be avoided, wherever possible, in a web application. Surrogate keys slow things down. They limit speed.
  2. Surrogate keys make it difficult for human beings quickly and easily to understand the data in a join table, requiring the construction of additional views. For example, if you have a people table and a books table, and a join table (people_books, say) indicating which people have which books, surrogate keys make the join table meaningless to a human reader. Whereas using meaningful keys makes it very easy to see at a glance which people have which books. Surrogate keys make things hard to understand. They limit discoverability.
  3. One of the arguments in favor of surrogate keys is that it means that changes to the data or the model do not have to result in a difficult-to-implement change to the meaningful key. For example, many systems do not allow cascading updates to fields that are referenced by foreign keys. That’s not a problem for people who use an enterprise-class database like PostgreSQL. In cases when the actual schema of the data changes, then it is not too hard to figure out how to transform the data to match the new schema. Surrogate keys attempt to solve a problem that doesn’t really amount to much. They have limited utility.

I’ll stop there. Perhaps later I’ll provide a couple of examples of situations in which we started with surrogate keys but ended up ignoring them and creating meaningful, primary keys that we used throughout the application.

In the meantime, what are your thoughts about surrogate vs. primary keys?

Advertisements
  1. #1 by phillip on March 8, 2009 - 7:12 pm

    This is actually the first time I’ve heard the term “surrogate keys” so maybe I’ve no right to comment. In working with software and some database work and working with people it seems to make sense to use keys that have some meaning whenever possible. The more steps you add to any process the more points of failure you have, so you’d want to keep the steps down to the number really necessary and no more. That’s my two cents.

    • #2 by Sean Harrison on March 9, 2009 - 1:36 am

      Hi, Phillip! The most common kind of “surrogate key” is an auto-generated integer, which is easy to produce. You’ve probably seen databases that have an integer id for every row of the tables. My experience is that, when you use these ids to join tables, it makes things harder, and it really isn’t that much more work to generate something meaningful to use as a key.

  2. #3 by David Daniels on March 10, 2009 - 10:05 am

    OK, sure. But, just to play Devil’s advocate, using “meaningful keys” places a restriction on those keys that they uniquely reference elements of the database, doesn’t it? In a well-constructed and limited DB that never accretes new fields, perhaps this is possible. But, it is not to hard to imagine a case where a database is changed in such a way that the meaningful keys also need to be changed. And updating all the joins, or all the join tables, after they’ve been initially constructed, could be problematic. A simple, maybe too trivial example, is Lastname_Firstname as a meaningful key to a people table. It only works until the table has two people with the same first and last names.

    Email addresses are often constructed in this way. How many corporate emails begin (when the company is small) with the “standard construction” first initial, last name @ company.com? It isn’t long before there is a jsmith2@company.com, is it? OK, that’s a dumb example, because it’s easily fixed with a little foresight, but more complex examples suffer from the same limitations.

    It isn’t sufficient to have a smart database designer, either. Because, especially if the database proves useful, people (ok, non-IT managers, let’s be honest here) will want frequent and seemingly-arbitrary changes made to the database, that will inevitably mess up the schema used to create the meaningful key. (“Let’s put my pets’ names in there.”)

    Now, I get your point about speed, and I get your point about meaningful join tables, and they’re good points. (I once worked with a company’s customer order database, in which they wanted to know the number of customers they had. The only way to identify unique customers was by name – it was effectively the meaningful key. When we sorted by unique customer name, we counted 120 different permutations of IBM: “IBM”, “Ibm”, “I B M”, “I.B.M.”, “International Business Machines”, “IBM – Brazil”, etc., etc., etc. Using a surrogate key in place of customer name would not have yielded such insight.) But, you don’t _need_ meaningful keys in order to find such insight. (In the example above, the customer table _did_ have surrogate keys, we just ignored them.) I think the only real reason to require meaningful keys is when speed is a priority.

    At least, that’s what the Devil’s advocate would say.

    • #4 by Sean Harrison on March 10, 2009 - 8:28 pm

      Hi, David! Thanks for the good interaction.

      When I started at the company I work for (Tyndale House Publishers in Carol Stream, IL), email addresses were initials@…. My address is (still) that way, but they’ve changed schemes a couple of times since then (currently it’s FirstnameLastname@), so now there’s a certain cachet in having an initials@ address. 🙂

      I work primarily in web-based applications, so speed is one of the most important factors.

      Good point about changing schemas and cascading updates. Not all RDBMSs can handle cascading updates, and even for those that can, schema changes can still be a problem. But in my experience, it just takes a little longer to figure out what the new “unique” info is going to be, and then to construct it from the existing data. If you’re in an environment in which schemas are changing a _lot_, that might slow things down a bit. Otherwise, I would think that having to think about creating a new unique key would force the db designer to keep the db design “sensible.”

      In fact, that might be an argument against using surrogate keys, even in an environment of fluctuation. Would surrogate keys encourage DBAs to choose expediency over good design, resulting in a database schema that becomes more and more Baroque? I don’t really know, but it’s a hypothesis I’d be interested in exploring. Do you (or others reading) have any experience with this kind of situation?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: