Monday, April 30, 2012

Denormalizing the real world to normalize your database

When writing database schema's sometimes we envision a 1-to-1 correlation with what we are trying to model from the real world and assume it must be a good design because it seems to fit so perfectly. This is not always, and perhaps not even often, the case.

Double Entry Ledger
What do I mean by this? Let's use an accounting ledger as an example. It looks like a nice clean table on paper. Columns and rows that translate nicely to a database table. There is also a well defined process for making entries called double entry bookkeeping. You insert one entry for the debit and one for the credit. They must both be entered together in one transaction. They even call these transactions in accounting! Seems pretty cut and dry, right?


 tx_id |            tx_tsz             | acct_id | debit | credit | tx_type |    ref     |       dscr       
-------+-------------------------------+---------+-------+--------+---------+------------+-------------------
     1 | 2012-04-30 17:25:58.003826-04 |       1 |   100 |        |       1 | order_id 1 | Sale to Alice
     1 | 2012-04-30 17:25:58.003826-04 |       2 |       |    100 |       1 | order_id 1 | Sale to Alice
     2 | 2012-04-30 17:31:58.953455-04 |       2 |   100 |        |       2 | order_id 1 | Refund from Alice
     2 | 2012-04-30 17:31:58.953455-04 |       1 |       |    100 |       2 | order_id 1 | Refund from Alice

The problem is this whole system is designed for a human to enter the data manually using pen or pencil on paper. It was never meant to scale. One of it's main benefits being error checking for error prone humans. You make separate entries for both debits and credits. Then you add both columns up. They should be identical. If they aren't, you made a mistake in the entry, or in the summing. This is how you keep your books "balanced."

Computers on the other hand don't make mistakes when programmed properly, and when programmed improperly often make the same mistake repeatedly. So if it made an error on the debit entry it likely made the same error on the credit entry. Or if it made an error summing the debits it most likely made the same error summing the credits. If you have an actual human making ledger entries into a computer and you want to have them use double entry, you probably should be doing the error checking in your interface instead. For your cosmic ray data corruption, there are usually better methods for error checking than this.

Implementing strict double entry accounting in a database limits the benefits and introduces new problems like requiring twice as many rows in a table. You are also "linking" the rows by a "unique" transaction ID. Except it's not unique since you are using it in two rows. This means you can't put a unique constraint on it and thus you are limiting the benefits of putting it in a database in the first place. Starting to sound like the worst of both worlds.

You might even have had accountants tell you that you should store it this way in the name of best practices. What they mean is that you should be able to retrieve it back out of the system in that format. In their world presentation and implementation are probably the same thing. I am sure that your accountant is fine with (if they are even aware of) the fact that all the data is stored in binary and that the data is spread across the disk based on the inode allocation heuristics of your file system. They also aren't complaining if you are compressing the data. Why would they really care if you are normalizing it then? All they really care about is that you don't lose any information and that you can present it to them in a format they are comfortable with. You could implement this with a view or perhaps just in your reporting package.

Here's what I would propose:
 tx_id |            tx_tsz             | debit_acct_id | credit_acct_id | amt | tx_type
-------+-------------------------------+---------------+----------------+-----+---------
     1 | 2012-04-30 17:25:58.003826-04 |             1 |              2 | 100 |       1
     2 | 2012-04-30 17:31:58.953455-04 |             2 |              1 | 100 |       2

This is the minimal amount of information you would need. It would be trivial to create a view based on this table that mimics double entry accounting. You still have a debit and a credit account. This keeps the feature of double entry accounting that you can track where the money came from and where it is going. You may also notice that 'ref' and 'dscr' are gone. I would replace 'ref' with a real relational table. One that is basically two foreign keys to the ledger and the other table, in this case an orders table. I would also argue that 'dscr' (short for description) would be better served as a separate table with a foreign key as well. Most times I see this column as redundant data. In the few cases when it's not, for example it contains a short note from someone who did an override, I think it would be best to keep that external to the ledger. This way you don't have a lot of useless information or NULL's. If you have a good use case for a column of that type, by all means add it. Also I used integers for the tx_id and *_acct_id's. You may want to use UUID's for one or all of these. If you do, be sure to use Postgres' awesome UUID type.

For the record I have seen a denormalized ledger in the wild several times. That's why I decided it was worth blogging about. I don't think people are dumb if they have implemented it that way. It's very easy to find something in another discipline to use as a model for something in a database, but sometimes it's just sub-optimal and will cause you problems down the road when your ledger is 2-4 times as large as your next largest table and it's becoming unwieldy. It's a pretty important table to a business so I just want to see people not fall into the easy to avoid traps.