Wednesday, March 21, 2012

What to do when your timeline isn't right

A lot of us are experienced with a single primary and single secondary replication setup in Postgres. Failover is straightforward. Your primary goes away, and the secondary is promoted to a primary. Then you create a new secondary in some manner. Maybe there is some floating IP magic in there to keep the clients happy in a transparent fashion. No problem.

The truth is though, since hot standby became available in 9.0, a lot of us are using more complicated setups these days. We'll have a primary, and then a failover secondary. Then maybe a secondary that is used for generating reports. Maybe one that we make backups on. Maybe even several others that we use for load balancing read-only queries. What happens in a primary failure situation now?

If you think about this for a little bit, the outlook seems dismal. Your failover machine switches timelines and now all your secondaries have a timeline mismatch with the new primary. Time to refresh all those databases from the primary? That doesn't sound fun. Especially if you have a few that are load balancing read-only queries in production. Ugh.

Good news, everyone, you can migrate your secondaries to a new primary and timeline!

First setup your primary and secondary for failover. Use the hybrid method (Streaming + WAL shipping) for all replication. Have a floating IP ready to be moved over to the failover database. Have the failover configured so that it is ready to be a primary. This means have your streaming options, like wal_senders, already set up. Have a replication entry in the pg_hba.conf. Have an archive_command already in place to start archiving WAL segments out to all the other secondaries. The only big difference between the primary and failover secondary config's should be the archive_command not sending WAL segments to itself. Make sure you have a trigger_file specified in the recovery.conf as well.

So far this is probably similar to what you already have. The key here is to use the WAL shipping in addition to streaming (hybrid method). This makes sure that all the other secondaries will get the new history file. It's also important that your archive_mode is 'on' and your archive_command will work as soon as failover is triggered. The first two files it will archive are critical to the process. I use OmniPITR for all the WAL shipping as well as the WAL restore and cleanup on the secondary side.

Next you need to have all your other secondaries point to the floating IP. This is so that at failover time they will be able to seamlessly connect to the new primary. They will also need an additional line in their recovery.conf:
recovery_target_timeline = 'latest'
This tells them to follow whatever timeline change they see. So now when your newly promoted primary pushes them a history file via WAL shipping, they will honor it and switch timelines too. This feature isn't documented in 9.0 (documentation bug?) but it still has an effect.

Now you should be ready to test your new setup. You do test, right? Good.

Unplug your primary.

Touch the trigger_file on the failover secondary.

Bring up the floating IP on your failover secondary.

You should now have a newly promoted primary in a new timeline. This is the part where the differences in 9.0 and 9.1 come out. In 9.0 the streaming protocol is one way, primary to secondary. In my testing the connections in 9.0 hung indefinitely. I even adjusted tcp_keepalive settings. Nothing seemed to help short of restarting the database. Ultimately, this is still easier then refreshing from the new primary, so I figure it still counts. In 9.1 there are some new feedback options. The option wal_receiver_status_interval is particularly useful. This option is enabled by default with a value of 10 seconds. It is meant to update the primary about the status of the secondary with regards to replication. In our scenario it lets the secondary know that the primary has disappeared. This causes it to try running the restore_command and finds the history file and then changes timelines. After which it connects back to the floating IP which now points to the new primary.

In this post I have been a little light on the technical details. I have made some assumptions about your level of knowledge. If you want some more in depth information, here are some links to the documentation:

9 comments:

  1. Thank you for the article, it is very informative.

    BTW, setting up the hybrid method (streaming + archive_command) is not necessary to solve this task. If one have a pure streaming replication then to switch a replica to a new master he/her just needs to add

    recovery_target_timeline = 'latest'

    in the recovery.conf on the replica. Then stop the service on the replica, rsync pg_xlog to the replica and start the service.

    Probably we can even replace the last 3 actions with just "rsync pg_xlog to the replica and restart the service" but I have not tested it.

    ReplyDelete
    Replies
    1. That's essentially the same thing as WAL shipping, but you also get the added benefit of not having replication lag too far behind what streaming can provide from the master to catch up. And in a failover situation there is less to do. I recommend hybrid replication even if switching timelines isn't your goal.

      Delete
    2. Sure, the hybrid one has more benefits.

      Delete
  2. Yeah, this method is helpful, but there's a warning: the slave you promote MUST be the one that was the most up-to-date with the master before it failed.

    If you promote the wrong one, slaves that had a smaller lag with the master might be corrupted, but will not complain.

    Checking this is pretty sure with 9.1, but IIRC there was some issue with 9.0 where pg_current_xlog_location was able to step back in some situations...but my memory might be wrong.

    About the tcp_keepalive, how/where did you set it up ? Did you try to set it up on the system side ?

    ReplyDelete
    Replies
    1. This is interesting. I had thought about the possibility of something like that potentially happening. When you say "corrupted" is it actual corruption, or just loss of updates? I suppose setting your designated failover to synchronous replication would solve this, but that might be too high of a cost.

      As for the tcp_keepalive stuff, I just played with those settings in postgresql.conf.

      Delete
  3. Might be both loss of updates and corruption.

    Worst case: you promote the most lagged slave. It starts receiving writes and stream them to other slaves that had some more up-to-date datas.

    If I understand it correctly, either thoses datas will be erased by new writes, or worst, some will stay visible. Think about clog as well, a txid which commit some data in t1 might have commited some other datas in t2 in previous the timeline... Anyway, it just look like a big and dangerous mess.

    ReplyDelete
    Replies
    1. I didn't talk about this in the post, but I usually don't allow queries to block updates on the failover, but do allow that on the "other" secondaries. This isn't an assurance that it is the most up to date, but in practice it almost always is.

      Also, if the other secondaries are too far ahead, they will be past the timeline branch point in the history file. I think if there is a problem it involves a checkpoint after applying the changes from the WAL segment of the branch point but going no further. Honestly I am not familiar enough with these internals to say there is or is not a problem here, but I think I need to find the answer now.

      Delete
    2. I'm not sure to understand what you exactly mean here: «I think if there is a problem it involves a checkpoint after applying the changes from the WAL segment of the branch point but going no further.»

      Slaves don't do checkpoints, but restart points, which are not related anyway. Slaves apply WAL entries about relations and clogs directly on the datas. Checkpoints occurs on the master and are not related to the WAL writes.

      As WAL entries are copy-on-write. That means you can replay them again and again in the correct order from the good starting point (during multi slave reboot as instance) and getting back to the same situation.

      AFAIK, if a slave already applied WAL entries from the old master in the old timeline, it has no mechanism to detect that the new WAL entries from the new timeline is branched in the past and remove invalid pages it already applied from the old master.

      That was my understanding last time I did the same tests than yours, leading to this big important pre-requisit (promoting the most up-to-date slave). I'm obviously interested about you own conclusion as soon as you are ready to publish them :)

      Thank you !

      Delete
    3. So these seems much worse than I first thought:

      https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L4354

      So you are correct, you can't roll back to a restart point before the timeline switch. I think what should at least be done here though is throw an error and don't switch timelines. I don't see any case where switching to the new timeline after the switch point makes any sense.

      Delete