Monday, April 30, 2012

Denormalizing the real world to normalize your database

When writing database schema's sometimes we envision a 1-to-1 correlation with what we are trying to model from the real world and assume it must be a good design because it seems to fit so perfectly. This is not always, and perhaps not even often, the case.

Double Entry Ledger
What do I mean by this? Let's use an accounting ledger as an example. It looks like a nice clean table on paper. Columns and rows that translate nicely to a database table. There is also a well defined process for making entries called double entry bookkeeping. You insert one entry for the debit and one for the credit. They must both be entered together in one transaction. They even call these transactions in accounting! Seems pretty cut and dry, right?


 tx_id |            tx_tsz             | acct_id | debit | credit | tx_type |    ref     |       dscr       
-------+-------------------------------+---------+-------+--------+---------+------------+-------------------
     1 | 2012-04-30 17:25:58.003826-04 |       1 |   100 |        |       1 | order_id 1 | Sale to Alice
     1 | 2012-04-30 17:25:58.003826-04 |       2 |       |    100 |       1 | order_id 1 | Sale to Alice
     2 | 2012-04-30 17:31:58.953455-04 |       2 |   100 |        |       2 | order_id 1 | Refund from Alice
     2 | 2012-04-30 17:31:58.953455-04 |       1 |       |    100 |       2 | order_id 1 | Refund from Alice

The problem is this whole system is designed for a human to enter the data manually using pen or pencil on paper. It was never meant to scale. One of it's main benefits being error checking for error prone humans. You make separate entries for both debits and credits. Then you add both columns up. They should be identical. If they aren't, you made a mistake in the entry, or in the summing. This is how you keep your books "balanced."

Computers on the other hand don't make mistakes when programmed properly, and when programmed improperly often make the same mistake repeatedly. So if it made an error on the debit entry it likely made the same error on the credit entry. Or if it made an error summing the debits it most likely made the same error summing the credits. If you have an actual human making ledger entries into a computer and you want to have them use double entry, you probably should be doing the error checking in your interface instead. For your cosmic ray data corruption, there are usually better methods for error checking than this.

Implementing strict double entry accounting in a database limits the benefits and introduces new problems like requiring twice as many rows in a table. You are also "linking" the rows by a "unique" transaction ID. Except it's not unique since you are using it in two rows. This means you can't put a unique constraint on it and thus you are limiting the benefits of putting it in a database in the first place. Starting to sound like the worst of both worlds.

You might even have had accountants tell you that you should store it this way in the name of best practices. What they mean is that you should be able to retrieve it back out of the system in that format. In their world presentation and implementation are probably the same thing. I am sure that your accountant is fine with (if they are even aware of) the fact that all the data is stored in binary and that the data is spread across the disk based on the inode allocation heuristics of your file system. They also aren't complaining if you are compressing the data. Why would they really care if you are normalizing it then? All they really care about is that you don't lose any information and that you can present it to them in a format they are comfortable with. You could implement this with a view or perhaps just in your reporting package.

Here's what I would propose:
 tx_id |            tx_tsz             | debit_acct_id | credit_acct_id | amt | tx_type
-------+-------------------------------+---------------+----------------+-----+---------
     1 | 2012-04-30 17:25:58.003826-04 |             1 |              2 | 100 |       1
     2 | 2012-04-30 17:31:58.953455-04 |             2 |              1 | 100 |       2

This is the minimal amount of information you would need. It would be trivial to create a view based on this table that mimics double entry accounting. You still have a debit and a credit account. This keeps the feature of double entry accounting that you can track where the money came from and where it is going. You may also notice that 'ref' and 'dscr' are gone. I would replace 'ref' with a real relational table. One that is basically two foreign keys to the ledger and the other table, in this case an orders table. I would also argue that 'dscr' (short for description) would be better served as a separate table with a foreign key as well. Most times I see this column as redundant data. In the few cases when it's not, for example it contains a short note from someone who did an override, I think it would be best to keep that external to the ledger. This way you don't have a lot of useless information or NULL's. If you have a good use case for a column of that type, by all means add it. Also I used integers for the tx_id and *_acct_id's. You may want to use UUID's for one or all of these. If you do, be sure to use Postgres' awesome UUID type.

For the record I have seen a denormalized ledger in the wild several times. That's why I decided it was worth blogging about. I don't think people are dumb if they have implemented it that way. It's very easy to find something in another discipline to use as a model for something in a database, but sometimes it's just sub-optimal and will cause you problems down the road when your ledger is 2-4 times as large as your next largest table and it's becoming unwieldy. It's a pretty important table to a business so I just want to see people not fall into the easy to avoid traps.

8 comments:

  1. I think you can even avoid two account id columns. You can separate debit and credit accounts in a view using transaction type column.

    ReplyDelete
  2. You are assuming that every entry will have just two lines with equal amounts. This is not always the case. Even a simple invoice of $100 item with 10% tax will have 3 lines:

    Account________Dr_________Cr
    Receivables_____110
    Income____________________100
    Tax_______________________10

    Also, most accounting systems have entry header tables with transaction information like id, date, journal, user, period, ...etc, and a lines table (details) for ledger lines.

    Some systems use one column for Dr/Cr where positive values are credit and negative are debit or the opposite.

    ReplyDelete
  3. I have to agree with Anonymous #2. Your scheme doesn't support split amounts, which are an important part of accounting reality. In my experience, accounting databases don't typically implement accounting entries based on what you're calling an "accounting ledger". Most accountants would call that a general journal or simply journal. In manual systems, you first make entries into a journal (which used to be a bound book) and then you post them to account ledgers (one page for each account), which are more detailed or sophisticated versions of the famous T-accounts. The essential elements of an entry (manual or automated) therefore are: account number, entry date, amount and a debit or credit indicator. In an automated system, since we do away with the journal, we have to tie in the two or more entries with a transaction identifier, which also becomes essential. Everything else, e.g., foreign currency code, foreign currency amount, transaction type, description, are refinements.

    ReplyDelete
    Replies
    1. I believe I goofed on some of the terminology. I apologize. I am thinking of specific implementations that I have worked with in the past and they were just two entries. I think you are right that when I said "accounting ledger" I meant "general ledger".

      Delete
    2. I'm using the same model as Phil is proposing (or very similar) and it certainly does support split amounts. See example below


      Old model
      ======
      Transaction Debit Credit
      1001 100 -- in the "Owner Equity" book
      1001 90 -- in the "Checking" book
      100 110 -- in the "VAT" book

      would translate in

      New model
      =======
      Transaction Credit_Account Debit_Account Amount
      1001 Owner Equity Checking 90
      1001 Owner Equity VAT 10

      --
      Martin

      Delete
    3. @Martin: Then, again, tx_id can't be a primary key because you have two entries with id 1001. You could call the "VAT"-transaction 1002, having TWO seperate transactions. Or you use 1001 as tx_id for both but then you effectively group two transactions together - and get an implicit journal entry 1001 describing some unknown business activity.

      Delete
  4. In addition to the other comments about accounting reality it seems as if your schema makes it harder to summarize the balance of accounts since you have to do a union of the debits and credits. This seems like an example where standard normalization leads to the optimal design. If anything I would consider adding a summary field to avoid having to recompute the balance, especially after a close when account balances cannot change. That implies denormalizing to improve performance. Accounting systems typically don't generate a lot of data compared to other types of applications, so storage is not necessarily the thing you want to minimize.

    ReplyDelete