Friday, March 23, 2012

Readin' Dirty

They see me rollin'
A client recently had a data integrity issue when they deployed some ORM code that inadvertently NULL'd some columns during updates. They needed to recover the data quickly and came to me. The first thing that came to mind was to restore from backup and then replay xlog's to right before the deployment happened and see how many would be recoverable that way. They asked me to think outside the box and see if I could come up with a solution to do this more quickly.

Unfortunately I wasn't able to. I looked at pg_filedump and xlogdump, but I didn't really get out of them what I wanted. Ideally I would have liked to say something like 'CREATE TABLE bar AS SELECT * FROM foo WHERE corrupt_col IS NOT NULL USING SNAPSHOT ANY'. For those of you that are unfamiliar, there are a handful of global snapshots internal to Postgres that allow you to peek under the covers of the MVCC. 'SnapshotAny' is one that allows you to see "any" row in the relation, including deleted ones that haven't been vacuumed yet. As far as I could tell, there was no better solution then restoring from a backup.

So after being unable to meet the client's expectations, and some prodding by co-workers, I decided to see what I could come up with that would meet my needs. So today I made publicly available the first version of pg_dirtyread. It's an extension that provides a simple function for reading heap tuples from relations using SnapshotAny. It's not ideal as, among other drawbacks, it requires a full sequential scan of the relation, but at least it's a start.

Example usage:
SELECT * FROM pg_dirtyread('foo'::regclass) as t(bar bigint, baz text);
Where the schema of foo is (bar bigint, baz text)

I'd be interested in any feedback on it as there is plenty of room for improvement. Maybe even some similar projects out there that my google-fu did not find for me. Also, I know the name is fallacious. A dirty read is reading rows that aren't committed. There is a different snapshot for that even. Maybe I should have called it pg_readany, but that name didn't sound as cool. Maybe in the future I will call it something like pg_readsnapshot and allow you to specify which snapshot you would like to use.

Wednesday, March 21, 2012

What to do when your timeline isn't right

A lot of us are experienced with a single primary and single secondary replication setup in Postgres. Failover is straightforward. Your primary goes away, and the secondary is promoted to a primary. Then you create a new secondary in some manner. Maybe there is some floating IP magic in there to keep the clients happy in a transparent fashion. No problem.

The truth is though, since hot standby became available in 9.0, a lot of us are using more complicated setups these days. We'll have a primary, and then a failover secondary. Then maybe a secondary that is used for generating reports. Maybe one that we make backups on. Maybe even several others that we use for load balancing read-only queries. What happens in a primary failure situation now?

If you think about this for a little bit, the outlook seems dismal. Your failover machine switches timelines and now all your secondaries have a timeline mismatch with the new primary. Time to refresh all those databases from the primary? That doesn't sound fun. Especially if you have a few that are load balancing read-only queries in production. Ugh.

Good news, everyone, you can migrate your secondaries to a new primary and timeline!

First setup your primary and secondary for failover. Use the hybrid method (Streaming + WAL shipping) for all replication. Have a floating IP ready to be moved over to the failover database. Have the failover configured so that it is ready to be a primary. This means have your streaming options, like wal_senders, already set up. Have a replication entry in the pg_hba.conf. Have an archive_command already in place to start archiving WAL segments out to all the other secondaries. The only big difference between the primary and failover secondary config's should be the archive_command not sending WAL segments to itself. Make sure you have a trigger_file specified in the recovery.conf as well.

So far this is probably similar to what you already have. The key here is to use the WAL shipping in addition to streaming (hybrid method). This makes sure that all the other secondaries will get the new history file. It's also important that your archive_mode is 'on' and your archive_command will work as soon as failover is triggered. The first two files it will archive are critical to the process. I use OmniPITR for all the WAL shipping as well as the WAL restore and cleanup on the secondary side.

Next you need to have all your other secondaries point to the floating IP. This is so that at failover time they will be able to seamlessly connect to the new primary. They will also need an additional line in their recovery.conf:
recovery_target_timeline = 'latest'
This tells them to follow whatever timeline change they see. So now when your newly promoted primary pushes them a history file via WAL shipping, they will honor it and switch timelines too. This feature isn't documented in 9.0 (documentation bug?) but it still has an effect.

Now you should be ready to test your new setup. You do test, right? Good.

Unplug your primary.

Touch the trigger_file on the failover secondary.

Bring up the floating IP on your failover secondary.

You should now have a newly promoted primary in a new timeline. This is the part where the differences in 9.0 and 9.1 come out. In 9.0 the streaming protocol is one way, primary to secondary. In my testing the connections in 9.0 hung indefinitely. I even adjusted tcp_keepalive settings. Nothing seemed to help short of restarting the database. Ultimately, this is still easier then refreshing from the new primary, so I figure it still counts. In 9.1 there are some new feedback options. The option wal_receiver_status_interval is particularly useful. This option is enabled by default with a value of 10 seconds. It is meant to update the primary about the status of the secondary with regards to replication. In our scenario it lets the secondary know that the primary has disappeared. This causes it to try running the restore_command and finds the history file and then changes timelines. After which it connects back to the floating IP which now points to the new primary.

In this post I have been a little light on the technical details. I have made some assumptions about your level of knowledge. If you want some more in depth information, here are some links to the documentation: