Low Sodium

Tuesday, February 5, 2013

ApacheCon NA 2013

Come to ApacheCon NA and see my tutorial on writing plugins for ATS. Maybe even stay and check out some of the other speakers too!

Monday, September 3, 2012

Stale-while-revalidate support back in ATS!

The kind people over at OHSU graciously sponsored the creation of an RFC 5861 plugin for Apache Traffic Server. I was lucky enough to get to write it. This is not the first stale-while-revalidate plugin for ATS, but the original did not work with ATS 3.x anymore. In addition, the new version also supports the stale-if-error portion of the RFC.

It's currently in the experimental directory, and I invite you to test it out and provide some feedback. It's currently being used in production in a few places but we'd like to hear a few more success stories before it gets moved to the stable plugin directory.

Thursday, May 31, 2012

How I learned to stop worrying and love Postgres hooks

Recently, Theo added a statsd interface to reconnoiter. He and I talked about the possibility of getting Postgres query stats (duration and number of tuples returned) into reconnoiter through this interface. His original idea was to get some logging hooks added to Postgres and pull the info that way. It just so happens that I had just seen Guillaume's talk on hooks in Postgres at PGCon. The hooks in Postgres are among the more poorly documented features. I knew there were hooks, but really didn't know much more until I attended this very informative talk.

I did some hacking and the solution I came up with is now known as pg_query_statsd. It uses the ExecutorStart and ExecutorEnd hooks to turn on and pull instrumentation data and then ships that data over to statsd (really reconnoiter posing as statsd) via UDP so there is no blocking. There are some buffer stats that I can enable and may add that feature in the future. It also has a hostname and port GUC so you can change then reload the config to send the data to a new host.

A couple notes about compatibility. First, I currently send two metrics in one packet. The official statsd from Etsy does not support this. I sent them a pull request to fix this. Second, since the documentation is non-existent, I had to make some guesses as to how I was supposed to handle taking the hooks and calling default functions. It may not play nice with other modules. If someone happens to see a problem in this area, or any area really, I take pull requests.

Monday, April 30, 2012

Denormalizing the real world to normalize your database

When writing database schema's sometimes we envision a 1-to-1 correlation with what we are trying to model from the real world and assume it must be a good design because it seems to fit so perfectly. This is not always, and perhaps not even often, the case.

Double Entry Ledger

What do I mean by this? Let's use an accounting ledger as an example. It looks like a nice clean table on paper. Columns and rows that translate nicely to a database table. There is also a well defined process for making entries called double entry bookkeeping. You insert one entry for the debit and one for the credit. They must both be entered together in one transaction. They even call these transactions in accounting! Seems pretty cut and dry, right?

tx_id |            tx_tsz             | acct_id | debit | credit | tx_type |    ref     |       dscr
-------+-------------------------------+---------+-------+--------+---------+------------+-------------------
     1 | 2012-04-30 17:25:58.003826-04 |       1 |   100 |        |       1 | order_id 1 | Sale to Alice
     1 | 2012-04-30 17:25:58.003826-04 |       2 |       |    100 |       1 | order_id 1 | Sale to Alice
     2 | 2012-04-30 17:31:58.953455-04 |       2 |   100 |        |       2 | order_id 1 | Refund from Alice
     2 | 2012-04-30 17:31:58.953455-04 |       1 |       |    100 |       2 | order_id 1 | Refund from Alice

The problem is this whole system is designed for a human to enter the data manually using pen or pencil on paper. It was never meant to scale. One of it's main benefits being error checking for error prone humans. You make separate entries for both debits and credits. Then you add both columns up. They should be identical. If they aren't, you made a mistake in the entry, or in the summing. This is how you keep your books "balanced."

Computers on the other hand don't make mistakes when programmed properly, and when programmed improperly often make the same mistake repeatedly. So if it made an error on the debit entry it likely made the same error on the credit entry. Or if it made an error summing the debits it most likely made the same error summing the credits. If you have an actual human making ledger entries into a computer and you want to have them use double entry, you probably should be doing the error checking in your interface instead. For your cosmic ray data corruption, there are usually better methods for error checking than this.

Implementing strict double entry accounting in a database limits the benefits and introduces new problems like requiring twice as many rows in a table. You are also "linking" the rows by a "unique" transaction ID. Except it's not unique since you are using it in two rows. This means you can't put a unique constraint on it and thus you are limiting the benefits of putting it in a database in the first place. Starting to sound like the worst of both worlds.

You might even have had accountants tell you that you should store it this way in the name of best practices. What they mean is that you should be able to retrieve it back out of the system in that format. In their world presentation and implementation are probably the same thing. I am sure that your accountant is fine with (if they are even aware of) the fact that all the data is stored in binary and that the data is spread across the disk based on the inode allocation heuristics of your file system. They also aren't complaining if you are compressing the data. Why would they really care if you are normalizing it then? All they really care about is that you don't lose any information and that you can present it to them in a format they are comfortable with. You could implement this with a view or perhaps just in your reporting package.

Here's what I would propose:

tx_id |            tx_tsz             | debit_acct_id | credit_acct_id | amt | tx_type
-------+-------------------------------+---------------+----------------+-----+---------
     1 | 2012-04-30 17:25:58.003826-04 |             1 |              2 | 100 |       1
     2 | 2012-04-30 17:31:58.953455-04 |             2 |              1 | 100 |       2

This is the minimal amount of information you would need. It would be trivial to create a view based on this table that mimics double entry accounting. You still have a debit and a credit account. This keeps the feature of double entry accounting that you can track where the money came from and where it is going. You may also notice that 'ref' and 'dscr' are gone. I would replace 'ref' with a real relational table. One that is basically two foreign keys to the ledger and the other table, in this case an orders table. I would also argue that 'dscr' (short for description) would be better served as a separate table with a foreign key as well. Most times I see this column as redundant data. In the few cases when it's not, for example it contains a short note from someone who did an override, I think it would be best to keep that external to the ledger. This way you don't have a lot of useless information or NULL's. If you have a good use case for a column of that type, by all means add it. Also I used integers for the tx_id and *_acct_id's. You may want to use UUID's for one or all of these. If you do, be sure to use Postgres' awesome UUID type.

For the record I have seen a denormalized ledger in the wild several times. That's why I decided it was worth blogging about. I don't think people are dumb if they have implemented it that way. It's very easy to find something in another discipline to use as a model for something in a database, but sometimes it's just sub-optimal and will cause you problems down the road when your ledger is 2-4 times as large as your next largest table and it's becoming unwieldy. It's a pretty important table to a business so I just want to see people not fall into the easy to avoid traps.

Friday, March 23, 2012

Readin' Dirty

They see me rollin'

A client recently had a data integrity issue when they deployed some ORM code that inadvertently NULL'd some columns during updates. They needed to recover the data quickly and came to me. The first thing that came to mind was to restore from backup and then replay xlog's to right before the deployment happened and see how many would be recoverable that way. They asked me to think outside the box and see if I could come up with a solution to do this more quickly.

Unfortunately I wasn't able to. I looked at pg_filedump and xlogdump, but I didn't really get out of them what I wanted. Ideally I would have liked to say something like 'CREATE TABLE bar AS SELECT * FROM foo WHERE corrupt_col IS NOT NULL USING SNAPSHOT ANY'. For those of you that are unfamiliar, there are a handful of global snapshots internal to Postgres that allow you to peek under the covers of the MVCC. 'SnapshotAny' is one that allows you to see "any" row in the relation, including deleted ones that haven't been vacuumed yet. As far as I could tell, there was no better solution then restoring from a backup.

So after being unable to meet the client's expectations, and some prodding by co-workers, I decided to see what I could come up with that would meet my needs. So today I made publicly available the first version of pg_dirtyread. It's an extension that provides a simple function for reading heap tuples from relations using SnapshotAny. It's not ideal as, among other drawbacks, it requires a full sequential scan of the relation, but at least it's a start.

Example usage:

SELECT * FROM pg_dirtyread('foo'::regclass) as t(bar bigint, baz text);

Where the schema of foo is (bar bigint, baz text)

I'd be interested in any feedback on it as there is plenty of room for improvement. Maybe even some similar projects out there that my google-fu did not find for me. Also, I know the name is fallacious. A dirty read is reading rows that aren't committed. There is a different snapshot for that even. Maybe I should have called it pg_readany, but that name didn't sound as cool. Maybe in the future I will call it something like pg_readsnapshot and allow you to specify which snapshot you would like to use.

Wednesday, March 21, 2012

What to do when your timeline isn't right

A lot of us are experienced with a single primary and single secondary replication setup in Postgres. Failover is straightforward. Your primary goes away, and the secondary is promoted to a primary. Then you create a new secondary in some manner. Maybe there is some floating IP magic in there to keep the clients happy in a transparent fashion. No problem.

The truth is though, since hot standby became available in 9.0, a lot of us are using more complicated setups these days. We'll have a primary, and then a failover secondary. Then maybe a secondary that is used for generating reports. Maybe one that we make backups on. Maybe even several others that we use for load balancing read-only queries. What happens in a primary failure situation now?

If you think about this for a little bit, the outlook seems dismal. Your failover machine switches timelines and now all your secondaries have a timeline mismatch with the new primary. Time to refresh all those databases from the primary? That doesn't sound fun. Especially if you have a few that are load balancing read-only queries in production. Ugh.

Good news, everyone, you can migrate your secondaries to a new primary and timeline!

First setup your primary and secondary for failover. Use the hybrid method (Streaming + WAL shipping) for all replication. Have a floating IP ready to be moved over to the failover database. Have the failover configured so that it is ready to be a primary. This means have your streaming options, like wal_senders, already set up. Have a replication entry in the pg_hba.conf. Have an archive_command already in place to start archiving WAL segments out to all the other secondaries. The only big difference between the primary and failover secondary config's should be the archive_command not sending WAL segments to itself. Make sure you have a trigger_file specified in the recovery.conf as well.

So far this is probably similar to what you already have. The key here is to use the WAL shipping in addition to streaming (hybrid method). This makes sure that all the other secondaries will get the new history file. It's also important that your archive_mode is 'on' and your archive_command will work as soon as failover is triggered. The first two files it will archive are critical to the process. I use OmniPITR for all the WAL shipping as well as the WAL restore and cleanup on the secondary side.

Next you need to have all your other secondaries point to the floating IP. This is so that at failover time they will be able to seamlessly connect to the new primary. They will also need an additional line in their recovery.conf:

recovery_target_timeline = 'latest'

This tells them to follow whatever timeline change they see. So now when your newly promoted primary pushes them a history file via WAL shipping, they will honor it and switch timelines too. This feature isn't documented in 9.0 (documentation bug?) but it still has an effect.

Now you should be ready to test your new setup. You do test, right? Good.

Unplug your primary.

Touch the trigger_file on the failover secondary.

Bring up the floating IP on your failover secondary.

You should now have a newly promoted primary in a new timeline. This is the part where the differences in 9.0 and 9.1 come out. In 9.0 the streaming protocol is one way, primary to secondary. In my testing the connections in 9.0 hung indefinitely. I even adjusted tcp_keepalive settings. Nothing seemed to help short of restarting the database. Ultimately, this is still easier then refreshing from the new primary, so I figure it still counts. In 9.1 there are some new feedback options. The option wal_receiver_status_interval is particularly useful. This option is enabled by default with a value of 10 seconds. It is meant to update the primary about the status of the secondary with regards to replication. In our scenario it lets the secondary know that the primary has disappeared. This causes it to try running the restore_command and finds the history file and then changes timelines. After which it connects back to the floating IP which now points to the new primary.

In this post I have been a little light on the technical details. I have made some assumptions about your level of knowledge. If you want some more in depth information, here are some links to the documentation:

Wednesday, February 22, 2012

You keep using that query. I do not think it does what you think it does.

A client recently came to us with a performance issue. They had multiple threads pulling rows from a table using FOR UPDATE to lock the row then doing some external work that accounts for a large portion of the overall time and then updating the row to show it had been processed. Essentially a queue. The work flow was something like this:

BEGIN;
SELECT id FROM foo WHERE processed_at IS NULL ORDER BY created_at LIMIT 1 FOR UPDATE;
-- Do external work here
UPDATE foo SET processed_at=NOW() WHERE id=?;
END;

The problem with this pattern is that the other threads running that same SELECT will block until the first transaction commits. Basically this locking pattern forces a single threaded model. Pretty much the exact opposite of what the client wanted. I think the assumption that they made was that it would give them the first unlocked row. Something like SKIP LOCKED DATA behavior. At one point in my career I am sure I held the same assumption.

There are probably several ways to skin this cat, but I wanted to go the route that I thought would hold locks for the least amount of time because that majority was spent on the external task. Postgres has a clever feature called RETURNING that you can use with UPDATE and DELETE statements that allows you to see the rows affected as if you had queried them with a standard SELECT effectively combining the two into one atomic statement. I used this feature to place a "hold" on the row at the application level by setting a value in a column to show that it was being worked on.

UPDATE foo SET processing_started_at=NOW() WHERE processing_started_at IS NULL AND id=(SELECT id FROM foo WHERE processing_started_at IS NULL ORDER BY created_at LIMIT 1) RETURNING id;
-- Do external work here
UPDATE foo SET processing_finished_at=NOW() WHERE id=?;

I needed the sub-query because I can't limit in an update. Also, because the sub-query might return the same row as another query run at the same time, I re-check the NULL timestamp in the update which is atomic. If the application gets back 0 rows from the query then it knows that there was a conflict and it just tries again. Looking back now, I think I could have used FOR UPDATE in the sub-query to avoid the conflict and returning 0 rows, but at the time I was trying to move away from that syntax altogether.

There are some cons to this approach. I have an extra update which causes more table bloat. I could have used advisory locks but I think that approach would have been more complex. This method gave the client much better throughput and reduced the lock contention significantly and in the end that was what counted.

Locks before and after