Low Sodium: You keep using that query. I do not think it does what you think it does.

Wednesday, February 22, 2012

You keep using that query. I do not think it does what you think it does.

A client recently came to us with a performance issue. They had multiple threads pulling rows from a table using FOR UPDATE to lock the row then doing some external work that accounts for a large portion of the overall time and then updating the row to show it had been processed. Essentially a queue. The work flow was something like this:

BEGIN;
SELECT id FROM foo WHERE processed_at IS NULL ORDER BY created_at LIMIT 1 FOR UPDATE;
-- Do external work here
UPDATE foo SET processed_at=NOW() WHERE id=?;
END;

The problem with this pattern is that the other threads running that same SELECT will block until the first transaction commits. Basically this locking pattern forces a single threaded model. Pretty much the exact opposite of what the client wanted. I think the assumption that they made was that it would give them the first unlocked row. Something like SKIP LOCKED DATA behavior. At one point in my career I am sure I held the same assumption.

There are probably several ways to skin this cat, but I wanted to go the route that I thought would hold locks for the least amount of time because that majority was spent on the external task. Postgres has a clever feature called RETURNING that you can use with UPDATE and DELETE statements that allows you to see the rows affected as if you had queried them with a standard SELECT effectively combining the two into one atomic statement. I used this feature to place a "hold" on the row at the application level by setting a value in a column to show that it was being worked on.

UPDATE foo SET processing_started_at=NOW() WHERE processing_started_at IS NULL AND id=(SELECT id FROM foo WHERE processing_started_at IS NULL ORDER BY created_at LIMIT 1) RETURNING id;
-- Do external work here
UPDATE foo SET processing_finished_at=NOW() WHERE id=?;

I needed the sub-query because I can't limit in an update. Also, because the sub-query might return the same row as another query run at the same time, I re-check the NULL timestamp in the update which is atomic. If the application gets back 0 rows from the query then it knows that there was a conflict and it just tries again. Looking back now, I think I could have used FOR UPDATE in the sub-query to avoid the conflict and returning 0 rows, but at the time I was trying to move away from that syntax altogether.

There are some cons to this approach. I have an extra update which causes more table bloat. I could have used advisory locks but I think that approach would have been more complex. This method gave the client much better throughput and reduced the lock contention significantly and in the end that was what counted.

Locks before and after

6 comments:

Jan_MFebruary 23, 2012 at 2:57 AM
Hi Phil,

do you mind telling us which tool you use to track and visualize the locks held?

Thx,
Jan
ReplyDelete
Replies
Thomas MunroFebruary 23, 2012 at 8:48 AM
Hi Phil

Thanks, this is an interesting write-up. I've seen lots of similar approaches used to distribute work, and I'm wondering if someone has written about and named them. It would be interesting to know how long is spent picking up jobs, how often you need to retry, and how those numbers change as you add more competing workers, with new jobs being inserted at different rates (ie keeping up, not keeping up).

FWIW Since writing about that SKIP LOCKED DATA idea you mentioned, I went on to try implementing it, and have posted an early proof-of-concept patch to run up the flagpole and see which way the wind blows:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00151.php

The idea is that many competing workers can each get a row to chew on, without ever having to wait for a lock OR retry.

Thanks
Thomas Munro
ReplyDelete
Replies
BaronFebruary 23, 2012 at 8:19 PM
I've written a lot about this in the MySQL world, where SELECT FOR UPDATE is equally evil. Different takes on a similar problem:

http://www.engineyard.com/blog/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you/

http://www.percona.com/about-us/mysql-white-paper/mysql-performance-analysis-with-percona-toolkit-and-tcp-ip-network-traffic/
ReplyDelete
Replies
Thomas MunroFebruary 24, 2012 at 4:52 AM
I see that people in the MySQL world have also been known to ask for a SKIP LOCKED [DATA] feature to help with this usage pattern:

http://bugs.mysql.com/bug.php?id=17426

http://forge.mysql.com/worklog/task.php?id=3597
ReplyDelete
Replies

Add comment