Wednesday, February 22, 2012

You keep using that query. I do not think it does what you think it does.

A client recently came to us with a performance issue. They had multiple threads pulling rows from a table using FOR UPDATE to lock the row then doing some external work that accounts for a large portion of the overall time and then updating the row to show it had been processed. Essentially a queue. The work flow was something like this:
BEGIN;
SELECT id FROM foo WHERE processed_at IS NULL ORDER BY created_at LIMIT 1 FOR UPDATE;
-- Do external work here
UPDATE foo SET processed_at=NOW() WHERE id=?;
END;
The problem with this pattern is that the other threads running that same SELECT will block until the first transaction commits. Basically this locking pattern forces a single threaded model. Pretty much the exact opposite of what the client wanted. I think the assumption that they made was that it would give them the first unlocked row. Something like SKIP LOCKED DATA behavior. At one point in my career I am sure I held the same assumption.

There are probably several ways to skin this cat, but I wanted to go the route that I thought would hold locks for the least amount of time because that majority was spent on the external task. Postgres has a clever feature called RETURNING that you can use with UPDATE and DELETE statements that allows you to see the rows affected as if you had queried them with a standard SELECT effectively combining the two into one atomic statement. I used this feature to place a "hold" on the row at the application level by setting a value in a column to show that it was being worked on.

UPDATE foo SET processing_started_at=NOW() WHERE processing_started_at IS NULL AND id=(SELECT id FROM foo WHERE processing_started_at IS NULL ORDER BY created_at LIMIT 1) RETURNING id;
-- Do external work here
UPDATE foo SET processing_finished_at=NOW() WHERE id=?;
I needed the sub-query because I can't limit in an update. Also, because the sub-query might return the same row as another query run at the same time, I re-check the NULL timestamp in the update which is atomic. If the application gets back 0 rows from the query then it knows that there was a conflict and it just tries again. Looking back now, I think I could have used FOR UPDATE in the sub-query to avoid the conflict and returning 0 rows, but at the time I was trying to move away from that syntax altogether.

There are some cons to this approach. I have an extra update which causes more table bloat. I could have used advisory locks but I think that approach would have been more complex. This method gave the client much better throughput and reduced the lock contention significantly and in the end that was what counted.

Locks before and after

6 comments:

  1. Hi Phil,

    do you mind telling us which tool you use to track and visualize the locks held?

    Thx,
    Jan

    ReplyDelete
    Replies
    1. Jan,

      Sure, it's done with Circonus. http://circonus.com/

      Delete
  2. Hi Phil

    Thanks, this is an interesting write-up. I've seen lots of similar approaches used to distribute work, and I'm wondering if someone has written about and named them. It would be interesting to know how long is spent picking up jobs, how often you need to retry, and how those numbers change as you add more competing workers, with new jobs being inserted at different rates (ie keeping up, not keeping up).

    FWIW Since writing about that SKIP LOCKED DATA idea you mentioned, I went on to try implementing it, and have posted an early proof-of-concept patch to run up the flagpole and see which way the wind blows:

    http://archives.postgresql.org/pgsql-hackers/2012-02/msg00151.php

    The idea is that many competing workers can each get a row to chew on, without ever having to wait for a lock OR retry.

    Thanks
    Thomas Munro

    ReplyDelete
    Replies
    1. You will definitely be increasing the chances of having to retry as you increase workers, but how well it works in general is going to be workload dependent. The smaller the percent of time you are running the first update, better it will scale. Unfortunately I don't have much I can share on performance metrics outside of the included graph.

      I am very interested in a SKIP LOCKED DATA patch. I think it would be an ideal solution for many cases such as the one in my post. It definitely would seem more intuitive to the end user as many people already assume similar behavior exists. I will be keeping an eye on it to see how it is received and what progress is makes toward getting committed.

      Delete
  3. I've written a lot about this in the MySQL world, where SELECT FOR UPDATE is equally evil. Different takes on a similar problem:

    http://www.engineyard.com/blog/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you/

    http://www.percona.com/about-us/mysql-white-paper/mysql-performance-analysis-with-percona-toolkit-and-tcp-ip-network-traffic/

    ReplyDelete
  4. I see that people in the MySQL world have also been known to ask for a SKIP LOCKED [DATA] feature to help with this usage pattern:

    http://bugs.mysql.com/bug.php?id=17426

    http://forge.mysql.com/worklog/task.php?id=3597

    ReplyDelete