One of the issues that comes up in database design is whether to use part of the data for each record as a primary key on the data, or whether automatically to generate unique identifiers that have nothing to do with the data. Automatically generated primary keys are called surrogate keys because they acts as a surrogate for the data in the database. The other option is to see if there is a way to represent each record with a unique identifier that is derived from the data itself. There is a decent introduction to the matter at http://en.wikipedia.org/wiki/Surrogate_key, with a summarize of the advantages and disadvantages of each approach.
Now, I am not really going to provide a satisfying rant in the “Considered Harmful” tradition. But I am in a position to state a clear preference for meaningful, rather than surrogate, keys. Here are a few of my reasons, coming from the perspective of a web application developer:
- Surrogate keys always require a join whenever I want a piece of meaningful information. But in many instances, I only need just one piece of information from the record. If this information were the primary key, no join would be required. Joins always require resources and should be avoided, wherever possible, in a web application. Surrogate keys slow things down. They limit speed.
- Surrogate keys make it difficult for human beings quickly and easily to understand the data in a join table, requiring the construction of additional views. For example, if you have a people table and a books table, and a join table (people_books, say) indicating which people have which books, surrogate keys make the join table meaningless to a human reader. Whereas using meaningful keys makes it very easy to see at a glance which people have which books. Surrogate keys make things hard to understand. They limit discoverability.
- One of the arguments in favor of surrogate keys is that it means that changes to the data or the model do not have to result in a difficult-to-implement change to the meaningful key. For example, many systems do not allow cascading updates to fields that are referenced by foreign keys. That’s not a problem for people who use an enterprise-class database like PostgreSQL. In cases when the actual schema of the data changes, then it is not too hard to figure out how to transform the data to match the new schema. Surrogate keys attempt to solve a problem that doesn’t really amount to much. They have limited utility.
I’ll stop there. Perhaps later I’ll provide a couple of examples of situations in which we started with surrogate keys but ended up ignoring them and creating meaningful, primary keys that we used throughout the application.
In the meantime, what are your thoughts about surrogate vs. primary keys?