Ruby concurrency: in praise of condition variables

Jul 17, 2017

In a previous post, we talked about the benefits conferred by Ruby mutexes. While a programmer’s familiarity with mutexes is likely to depend on what kind of programs she usually writes, most developers tend to be at least somewhat familiar with these particular synchronization primitives. This article, however, is going to focus on a much lesser-known synchronization construct: the condition variable.

Condition variables are used for putting threads to sleep and waking them back up once a certain condition is met. Don’t worry if this sounds a bit vague; we’ll go into a lot more detail later. As condition variables always need to be used in conjunction with mutexes, we’ll lead with a quick mutex recap. Next, we’ll introduce consumer-producer problems and how to elegantly solve them with the aid of condition variables. Then, we’ll have a look at how to use these synchronization primitives for implementing blocking method calls. Finishing up, we’ll describe some curious condition variable behavior and how to safeguard against it.

A mutex recap

A mutex is a data structure for protecting shared state between multiple threads. When a piece of code is wrapped inside a mutex, the mutex guarantees that only one thread at a time can execute this code. If another thread wants to start executing this code, it’ll have to wait until our first thread is done with it. I realize this may all sound a bit abstract, so now is probably a good time to bring in some example code.

Writing to shared state

In this first example, we’ll have a look at what happens when two threads try to modify the same shared variable. The snippet below shows two methods: counters_with_mutex and counters_without_mutex. Both methods start by creating a zero-initialized counters array before spawning 5 threads. Each thread will perform 100,000 loops, with every iteration incrementing all elements of the counters array by one. Both methods are the same in every way except for one thing: only one of them uses a mutex.

def counters_with_mutex
  mutex = Mutex.new
  counters = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

  5.times.map do
    Thread.new do
      100000.times do
        mutex.synchronize do
          counters.map! { |counter| counter + 1 }
        end
      end
    end
  end.each(&:join)

  counters.inspect
end

def counters_without_mutex
  counters = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

  5.times.map do
    Thread.new do
      100000.times do
        counters.map! { |counter| counter + 1 }
      end
    end
  end.each(&:join)

  counters.inspect
end

puts counters_with_mutex
# => [500000, 500000, 500000, 500000, 500000, 500000, 500000, 500000, 500000, 500000]

puts counters_without_mutex
# => [500000, 447205, 500000, 500000, 500000, 500000, 203656, 500000, 500000, 500000]
# note that we seem to have lost some increments here due to not using a mutex

As you can see, only the method that uses a mutex ends up producing the correct result. The method without a mutex seems to have lost some increments. This is because the lack of a mutex makes it possible for our second thread to interrupt our first thread at any point during its execution. This can lead to some serious problems.

For example, imagine that our first thread has just read the first entry of the counters array, incremented it by one, and is now getting ready to write this incremented value back to our array. However, before our first thread can write this incremented value, it gets interrupted by the second thread. This second thread then goes on to read the current value of the first entry, increments it by one, and succeeds in writing the result back to our counters array. Now we have a problem!

We have a problem because the first thread got interrupted before it had a chance to write its incremented value to the array. When the first thread resumes, it will end up overwriting the value that the second thread just placed in the array. This will cause us to essentially lose an increment operation, which explains why our program output has entries in it that are less than 500,000.

All these problems can be avoided by using a mutex. Remember that a thread executing code wrapped by a mutex cannot be interleaved with another thread wanting to execute this same code. Therefore, our second thread would never have gotten interleaved with the first thread, thereby avoiding the possibility of results getting overwritten.

Reading from shared state

There’s a common misconception that a mutex is only required when writing to a shared variable, and not when reading from it. The snippet below shows 50 threads flipping the boolean values in the flags array over and over again. Many developers think this snippet is without error as the code responsible for changing these values was wrapped inside a mutex. If that were true, then every line of the output of puts flags.to_s should consist of 10 repetitions of either true or false. As we can see below, this is not the case.

mutex = Mutex.new
flags = [false, false, false, false, false, false, false, false, false, false]

threads = 50.times.map do
  Thread.new do
    100000.times do
      # don't do this! Reading from shared state requires a mutex!
      puts flags.to_s

      mutex.synchronize do
        flags.map! { |f| !f }
      end
    end
  end
end
threads.each(&:join)

$ ruby flags.rb > output.log
$ grep 'true, false' output.log | wc -l
    30

What’s happening here is that our mutex only guarantees that no two threads can modify the flags array at the same time. However, it is perfectly possible for one thread to start reading from this array while another thread is busy modifying it, thereby causing the first thread to read an array that contains both true and false entries. Luckily, all of this can be easily avoided by wrapping puts flags.to_s inside our mutex. This will guarantee that only one thread at a time can read from or write to the flags array.

Before moving on, I would just like to mention that even very experienced people have gotten tripped up by not using a mutex when accessing shared state. In fact, at one point there even was a Java design pattern that assumed it was safe to not always use a mutex to do so. Needless to say, this pattern has since been amended.

Consumer-producer problems

With that mutex refresher out of the way, we can now start looking at condition variables. Condition variables are best explained by trying to come up with a practical solution to the consumer-producer problem. In fact, consumer-producer problems are so common that Ruby already has a data structure aimed at solving these: the Queue class. This class uses a condition variable to implement the blocking variant of its shift() method. In this article, we made a conscious decision not to use the Queue class. Instead, we’re going to write everything from scratch with the help of condition variables.

Let’s have a look at the problem that we’re going to solve. Imagine that we have a website where users can generate tasks of varying complexity, e.g. a service that allows users to convert uploaded jpg images to pdf. We can think of these users as producers of a steady stream of tasks of random complexity. These tasks will get stored on a backend server that has several worker processes running on it. Each worker process will grab a task, process it, and then grab the next one. These workers are our task consumers.

With what we know about mutexes, it shouldn’t be too hard to write a piece of code that mimics the above scenario. It’ll probably end up looking something like this.

tasks = []
mutex = Mutex.new
threads = []

class Task
  def initialize
    @duration = rand()
  end

  def execute
    sleep @duration
  end
end

# producer threads
threads += 2.times.map do
  Thread.new do
    while true
      mutex.synchronize do
        tasks << Task.new
        puts "Added task: #{tasks.last.inspect}"
      end
      # limit task production speed
      sleep 0.5
    end
  end
end

# consumer threads
threads += 5.times.map do
  Thread.new do
    while true
      task = nil
      mutex.synchronize do
        if tasks.count > 0
          task = tasks.shift
          puts "Removed task: #{task.inspect}"
        end
      end
      # execute task outside of mutex so we don't unnecessarily
      # block other consumer threads
      task.execute unless task.nil?
    end
  end
end

threads.each(&:join)

The above code should be fairly straightforward. There is a Task class for creating tasks that take between 0 and 1 seconds to run. We have 2 producer threads, each running an endless while loop that safely appends a new task to the tasks array every 0.5 seconds with the help of a mutex. Our 5 consumer threads are also running an endless while loop, each iteration grabbing the mutex so as to safely check the tasks array for available tasks. If a consumer thread finds an available task, it removes the task from the array and starts processing it. Once the task had been processed, the thread moves on to its next iteration, thereby repeating the cycle anew.

While the above implementation seems to work just fine, it is not optimal as it requires all consumer threads to constantly poll the tasks array for available work. This polling does not come for free. The Ruby interpreter has to constantly schedule the consumer threads to run, thereby preempting threads that may have actual important work to do. To give an example, the above code will interleave consumer threads that are executing a task with consumer threads that just want to check for newly available tasks. This can become a real problem when there is a large number of consumer threads and only a few tasks.

If you want to see for yourself just how inefficient this approach is, you only need to modify the original code for consumer threads with the code shown below. This modified program prints well over a thousand lines of This thread has nothing to do for every single line of Removed task. Hopefully, this gives you an indication of the general wastefulness of having consumer threads constantly poll the tasks array.

# modified consumer threads code
threads += 5.times.map do
  Thread.new do
    while true
      task = nil
      mutex.synchronize do
        if tasks.count > 0
          task = tasks.shift
          puts "Removed task: #{task.inspect}"
        else
          puts 'This thread has nothing to do'
        end
      end
      # execute task outside of mutex so we don't unnecessarily
      # block other consumer threads
      task.execute unless task.nil?
    end
  end
end

Condition variables to the rescue

So how we can create a more efficient solution to the consumer-producer problem? That is where condition variables come into play. Condition variables are used for putting threads to sleep and waking them only once a certain condition is met. Remember that our current solution to the producer-consumer problem is far from ideal because consumer threads need to constantly poll for new tasks to arrive. Things would be much more efficient if our consumer threads could go to sleep and be woken up only when a new task has arrived.

Shown below is a solution to the consumer-producer problem that makes use of condition variables. We’ll talk about how this works in a second. For now though, just have a look at the code and perhaps have a go at running it. If you were to run it, you would probably see that This thread has nothing to do does not show up anymore. Our new approach has completely gotten rid of consumer threads busy polling the tasks array.

The use of a condition variable will now cause our consumer threads to wait for a task to be available in the tasks array before proceeding. As a result of this, we can now remove some of the checks we had to have in place in our original consumer code. I’ve added some comments to the code below to help highlight these removals.

tasks = []
mutex = Mutex.new
cond_var = ConditionVariable.new
threads = []

class Task
  def initialize
    @duration = rand()
  end

  def execute
    sleep @duration
  end
end

# producer threads
threads += 2.times.map do
  Thread.new do
    while true
      mutex.synchronize do
        tasks << Task.new
        cond_var.signal
        puts "Added task: #{tasks.last.inspect}"
      end
      # limit task production speed
      sleep 0.5
    end
  end
end

# consumer threads
threads += 5.times.map do
  Thread.new do
    while true
      task = nil
      mutex.synchronize do
        while tasks.empty?
          cond_var.wait(mutex)
        end

        # the `if tasks.count == 0` statement will never be true as the thread
        # will now only reach this line if the tasks array is not empty
        puts 'This thread has nothing to do' if tasks.count == 0

        # similarly, we can now remove the `if tasks.count > 0` check that
        # used to surround this code. We no longer need it as this code will
        # now only get executed if the tasks array is not empty.
        task = tasks.shift
        puts "Removed task: #{task.inspect}"
      end
      # Note that we have now removed `unless task.nil?` from this line as
      # our thread can only arrive here if there is indeed a task available.
      task.execute
    end
  end
end

threads.each(&:join)

Aside from us removing some if statements, our new code is essentially identical to our previous solution. The only exception to this are the five new lines shown below. Don’t worry if some of the accompanying comments don’t quite make sense yet. Now is also a good time to point out that the new code for both the producer and consumer threads was added inside the existing mutex synchronization blocks. Condition variables are not thread-safe and therefore always need to be used in conjunction with a mutex!

# declaring the condition variable
cond_var = ConditionVariable.new

# a producer thread now signals the condition variable
# after adding a new task to the tasks array
cond_var.signal

# a consumer thread now goes to sleep when it sees that
# the tasks array is empty. It can get woken up again
# when a producer thread signals the condition variable.
while tasks.empty?
  cond_var.wait(mutex)
end

Let’s talk about the new code now. We’ll start with the consumer threads snippet. There’s actually so much going on in these three lines that we’ll limit ourselves to covering what cond_var.wait(mutex) does for now. We’ll explain the need for the while tasks.empty? loop later. The first thing to notice about the wait method is the parameter that’s being passed to it. Remember how a condition variable is not thread-safe and therefore should only have its methods called inside a mutex synchronization block? It is that mutex that needs to be passed as a parameter to the wait method.

Calling wait on a condition variable causes two things to happen. First of all, it causes the thread that calls wait to go to sleep. That is to say, the thread will tell the interpreter that it no longer wants to be scheduled. However, this thread still has ownership of the mutex as it’s going to sleep. We need to ensure that the thread relinquishes this mutex because otherwise all other threads waiting for this mutex will be blocked. By passing this mutex to the wait method, the wait method internals will ensure that the mutex gets released as the thread goes to sleep.

Let’s move on to the producer threads. These threads are now calling cond_var.signal. The signal method is pretty straightforward in that it wakes up exactly one of the threads that were put to sleep by the wait method. This newly awoken thread will indicate to the interpreter that it is ready to start getting scheduled again and then wait for its turn.

So what code does our newly awoken thread start executing once it gets scheduled again? It starts executing from where it left off. Essentially, a newly awoken thread will return from its call to cond_var.wait(mutex) and resume from there. Personally, I like to think of calling wait as creating a save point inside a thread from which work can resume once the thread gets woken up and rescheduled again. Please note that since the thread wants to resume from where it originally left off, it’ll need to reacquire the mutex in order to get scheduled. This mutex reacquisition is very important, so be sure to remember it.

This segues nicely into why we need to use while tasks.empty? when calling wait in a consumer thread. When our newly awoken thread resumes execution by returning from cond_var.wait, the first thing it’ll do is complete its previously interrupted iteration through the while loop, thereby evaluating while tasks.empty? again. This actually causes us to neatly avoid a possible race condition.

Let’s say we don’t use a while loop and use an if statement instead. The resulting code would then look like shown below. Unfortunately, there is a very hard to find problem with this code. Note how we now need to re-add the previously removed if tasks.count > 0 and unless task.nil? statements to our code below in order to ensure its safe execution.

# consumer threads
threads += 5.times.map do
  Thread.new do
    while true
      task = nil
      mutex.synchronize do
        cond_var.wait(mutex) if tasks.empty?

        # using `if tasks.empty?` forces us to once again add this
        # `if tasks.count > 0` check. We need this check to protect
        # ourselves against a nasty race condition.
        if tasks.count > 0
          task = tasks.shift
          puts "Removed task: #{task.inspect}"
        else
          puts 'This thread has nothing to do'
        end
      end
      # using `if tasks.empty?` forces us to re-add `unless task.nil?`
      # in order to safeguard ourselves against a now newly introduced
      # race condition
      task.execute unless task.nil?
    end
  end
end

Imagine a scenario where we have:

two producer threads
one consumer thread that’s awake
four consumer threads that are asleep

A consumer thread that’s awake will go back to sleep only when there are no more tasks in the tasks array. That is to say, a single consumer thread will keep processing tasks until no more tasks are available. Now, let’s say one of our producer threads adds a new task to the currently empty tasks array before calling cond_var.signal at roughly the same time as our active consumer thread is finishing its current task. This signal call will awaken one of our sleeping consumer threads, which will then try to get itself scheduled. This is where a race condition is likely to happen!

We’re now in a position where two consumer threads are competing for ownership of the mutex in order to get scheduled. Let’s say our first consumer thread wins this competition. This thread will now go and grab the task from the tasks array before relinquishing the mutex. Our second consumer thread then grabs the mutex and gets to run. However, as the tasks array is empty now, there is nothing for this second consumer thread to work on. So this second consumer thread now has to do an entire iteration of its while true loop for no real purpose at all.

We now find ourselves in a situation where a complete iteration of the while true loop can occur even when the tasks array is empty. This is a not unlike the position we were in when our program was just busy polling the tasks array. Sure, our current program will be more efficient than busy polling, but we will still need to safeguard our code against the possibility of an iteration occurring when there is no task available. This is why we needed to re-add the if tasks.count > 0 and unless task.nil? statements. Especially the latter of these two is important, as otherwise our program might crash with a NilException.

Luckily, we can safely get rid of these easily overlooked safeguards by forcing each newly awakened consumer thread to check for available tasks and having it put itself to sleep again if no tasks are available. This behavior can be accomplished by replacing the if tasks.empty? statement with a while tasks.empty? loop. If tasks are available, a newly awoken thread will exit the loop and execute the rest of its code. However, if no tasks are found, then the loop is repeated, thereby causing the thread to put itself to sleep again by executing cond_var.wait. We’ll see in a later section that there is yet another benefit to using this while loop.

Building our own Queue class

At the beginning of a previous section, we touched on how condition variables are used by the Queue class to implement blocking behavior. The previous section taught us enough about condition variables for us to go and implement a basic Queue class ourselves. We’re going to create a thread-safe SimpleQueue class that is capable of:

having data appended to it with the << operator
having data retrieved from it with a non-blocking shift method
having data retrieved from it with a blocking shift method

It’s easy enough to write code that meets these first two criteria. It will probably end up looking something like the code shown below. Note that our SimpleQueue class is using a mutex as we want this class to be thread-safe, just like the original Queue class.

class SimpleQueue
  def initialize
    @elems = []
    @mutex = Mutex.new
  end

  def <<(elem)
    @mutex.synchronize do
      @elems << elem
    end
  end

  def shift(blocking = true)
    @mutex.synchronize do
      if blocking
        raise 'yet to be implemented'
      end
      @elems.shift
    end
  end
end

simple_queue = SimpleQueue.new
simple_queue << 'foo'

simple_queue.shift(false)
# => "foo"

simple_queue.shift(false)
# => nil

Now let’s have a look at what’s needed to implement the blocking shift behavior. As it turns out, this is actually very easy. We only want the thread to block if the shift method is called when the @elems array is empty. This is all the information we need to determine where we need to place our condition variable’s call to wait. Similarly, we want the thread to stop blocking once the << operator appends a new element, thereby causing @elems to no longer be empty. This tells us exactly where we need to place our call to signal.

In the end, we just need to create a condition variable that makes the thread go to sleep when a blocking shift is called on an empty SimpleQueue. Likewise, the << operator just needs to signal the condition variable when a new element is added, thereby causing the sleeping thread to be woken up. The takeaway from this is that blocking methods work by causing their calling thread to fall asleep. Also, please note that the call to @cond_var.wait takes place inside a while @elems.empty? loop. Always use a while loop when calling wait on a condition variable! Never use an if statement!

class SimpleQueue
  def initialize
    @elems = []
    @mutex = Mutex.new
    @cond_var = ConditionVariable.new
  end

  def <<(elem)
    @mutex.synchronize do
      @elems << elem
      @cond_var.signal
    end
  end

  def shift(blocking = true)
    @mutex.synchronize do
      if blocking
        while @elems.empty?
          @cond_var.wait(@mutex)
        end
      end
      @elems.shift
    end
  end
end

simple_queue = SimpleQueue.new

# this will print "blocking shift returned with: foo" after 5 seconds
# that is to say, the first thread will go to sleep until the second
# thread adds an element to the queue, thereby causing the first thread
# to be woken up again
threads = []
threads << Thread.new { puts "blocking shift returned with: #{simple_queue.shift}" }
threads << Thread.new { sleep 5; simple_queue << 'foo' }
threads.each(&:join)

One thing to point out in the above code is that @cond_var.signal can get called even when there are no sleeping threads around. This is a perfectly okay thing to do. In these types of scenarios calling @cond_var.signal will just do nothing.

Spurious wakeups

A “spurious wakeup” refers to a sleeping thread getting woken up without any signal call having been made. This is an impossible to avoid edge-case in condition variables. It’s important to point out that this is not being caused by a bug in the Ruby interpreter or anything like that. Instead, the designers of the threading libraries used by your OS found that allowing for the occasional spurious wakeup greatly improves the speed of condition variable operations. As such, any code that uses condition variables needs to take spurious wakeups into account.

So does this mean that we need to rewrite all the code that we’ve written in this article in an attempt to make it resistant to possible bugs introduced by spurious wakeups? You’ll be glad to know that this isn’t the case as all code snippets in this article have always wrapped the cond_var.wait statement inside a while loop!

We covered earlier how using a while loop makes our code more efficient when dealing with certain race conditions as it causes a newly awakened thread to check whether there is actually anything to do for it, and if not, the thread goes back to sleep. This same while loop helps us deal with spurious wakeups as well.

When a thread gets woken up by a spurious wakeup and there is nothing for it to do, our usage of a while loop will cause the thread to detect this and go back to sleep. From the thread’s point of view, being awakened by a spurious wakeup isn’t any different than being woken up with no available tasks to do. So the same mechanism that helps us deal with race conditions solves our spurious wakeup problem as well. It should be obvious by now that while loops play a very important role when working with condition variables.

Conclusion

Ruby’s condition variables are somewhat notorious for their poor documentation. That’s a shame, because they are wonderful data structures for efficiently solving a very specific set of problems. Although, as we’ve seen, using them isn’t without pitfalls. I hope that this post will go some way towards making them (and their pitfalls) a bit better understood in the wider Ruby community.

I also feel like I should point out that while everything mentioned above is correct to the best of my knowledge, I’m unable to guarantee that absolutely no mistakes snuck in while writing this. As always, please feel free to contact me if you think I got anything wrong, or even if you just want to say hello.

vaneyckt