The truth is though, since hot standby became available in 9.0, a lot of us are using more complicated setups these days. We'll have a primary, and then a failover secondary. Then maybe a secondary that is used for generating reports. Maybe one that we make backups on. Maybe even several others that we use for load balancing read-only queries. What happens in a primary failure situation now?
If you think about this for a little bit, the outlook seems dismal. Your failover machine switches timelines and now all your secondaries have a timeline mismatch with the new primary. Time to refresh all those databases from the primary? That doesn't sound fun. Especially if you have a few that are load balancing read-only queries in production. Ugh.
Good news, everyone, you can migrate your secondaries to a new primary and timeline!
First setup your primary and secondary for failover. Use the hybrid method (Streaming + WAL shipping) for all replication. Have a floating IP ready to be moved over to the failover database. Have the failover configured so that it is ready to be a primary. This means have your streaming options, like wal_senders, already set up. Have a replication entry in the pg_hba.conf. Have an archive_command already in place to start archiving WAL segments out to all the other secondaries. The only big difference between the primary and failover secondary config's should be the archive_command not sending WAL segments to itself. Make sure you have a trigger_file specified in the recovery.conf as well.
So far this is probably similar to what you already have. The key here is to use the WAL shipping in addition to streaming (hybrid method). This makes sure that all the other secondaries will get the new history file. It's also important that your archive_mode is 'on' and your archive_command will work as soon as failover is triggered. The first two files it will archive are critical to the process. I use OmniPITR for all the WAL shipping as well as the WAL restore and cleanup on the secondary side.
Next you need to have all your other secondaries point to the floating IP. This is so that at failover time they will be able to seamlessly connect to the new primary. They will also need an additional line in their recovery.conf:
recovery_target_timeline = 'latest'This tells them to follow whatever timeline change they see. So now when your newly promoted primary pushes them a history file via WAL shipping, they will honor it and switch timelines too. This feature isn't documented in 9.0 (documentation bug?) but it still has an effect.
Now you should be ready to test your new setup. You do test, right? Good.
Unplug your primary.
Touch the trigger_file on the failover secondary.
Bring up the floating IP on your failover secondary.
You should now have a newly promoted primary in a new timeline. This is the part where the differences in 9.0 and 9.1 come out. In 9.0 the streaming protocol is one way, primary to secondary. In my testing the connections in 9.0 hung indefinitely. I even adjusted tcp_keepalive settings. Nothing seemed to help short of restarting the database. Ultimately, this is still easier then refreshing from the new primary, so I figure it still counts. In 9.1 there are some new feedback options. The option wal_receiver_status_interval is particularly useful. This option is enabled by default with a value of 10 seconds. It is meant to update the primary about the status of the secondary with regards to replication. In our scenario it lets the secondary know that the primary has disappeared. This causes it to try running the restore_command and finds the history file and then changes timelines. After which it connects back to the floating IP which now points to the new primary.
In this post I have been a little light on the technical details. I have made some assumptions about your level of knowledge. If you want some more in depth information, here are some links to the documentation: