Wednesday, February 22, 2012

You keep using that query. I do not think it does what you think it does.

A client recently came to us with a performance issue. They had multiple threads pulling rows from a table using FOR UPDATE to lock the row then doing some external work that accounts for a large portion of the overall time and then updating the row to show it had been processed. Essentially a queue. The work flow was something like this:
BEGIN;
SELECT id FROM foo WHERE processed_at IS NULL ORDER BY created_at LIMIT 1 FOR UPDATE;
-- Do external work here
UPDATE foo SET processed_at=NOW() WHERE id=?;
END;
The problem with this pattern is that the other threads running that same SELECT will block until the first transaction commits. Basically this locking pattern forces a single threaded model. Pretty much the exact opposite of what the client wanted. I think the assumption that they made was that it would give them the first unlocked row. Something like SKIP LOCKED DATA behavior. At one point in my career I am sure I held the same assumption.

There are probably several ways to skin this cat, but I wanted to go the route that I thought would hold locks for the least amount of time because that majority was spent on the external task. Postgres has a clever feature called RETURNING that you can use with UPDATE and DELETE statements that allows you to see the rows affected as if you had queried them with a standard SELECT effectively combining the two into one atomic statement. I used this feature to place a "hold" on the row at the application level by setting a value in a column to show that it was being worked on.

UPDATE foo SET processing_started_at=NOW() WHERE processing_started_at IS NULL AND id=(SELECT id FROM foo WHERE processing_started_at IS NULL ORDER BY created_at LIMIT 1) RETURNING id;
-- Do external work here
UPDATE foo SET processing_finished_at=NOW() WHERE id=?;
I needed the sub-query because I can't limit in an update. Also, because the sub-query might return the same row as another query run at the same time, I re-check the NULL timestamp in the update which is atomic. If the application gets back 0 rows from the query then it knows that there was a conflict and it just tries again. Looking back now, I think I could have used FOR UPDATE in the sub-query to avoid the conflict and returning 0 rows, but at the time I was trying to move away from that syntax altogether.

There are some cons to this approach. I have an extra update which causes more table bloat. I could have used advisory locks but I think that approach would have been more complex. This method gave the client much better throughput and reduced the lock contention significantly and in the end that was what counted.

Locks before and after