Memory ordering, barriers, acquire and release semantics

The problem

Multithreading is difficult, and multithreading without locks is even more difficult. This post is about one of the problems: out-of-order memory accesses.

Here's an example of the unexpected behaviour caused by out-of-order processors (assume variables are initialized to 0 in all the following examples):

Example 1
Thread 1 Thread 2
a = 1
b = 2
x = b
y = a

After running this code, if x is 2, then is y guaranteed to be 1? Unfortunately the answer is no, x==2 and y==0 is a possible result. This can happen because the writes in thread 1 can be re-ordered:

Example 1, re-ordered execution
Thread 1 Thread 2
b = 2
...
...
a = 1
...
x = b
y = a
...

Here's another example:

Example 2
Thread 1 Thread 2
x = 1
a = y
y = 1
b = x

We would assume that after running this code then either a or b must be set to 1. Again, this is not the case, as the instructions can be re-ordered:

Example 2, re-ordered execution
Thread 1 Thread 2
a = y
x = 1
b = x
y = 1

What on earth is actually going on here?

So why are we seeing all these instructions being re-ordered? We don't see this when writing a program without multithreading, so why does it start happening in multithreaded code? The answer is that it always happens - it's just that the processor is nice enough to hide it from us when we're running in a single thread of execution. Processors have separate caches, memory writes can be buffered, the processor can re-order instructions, all of which will affect when the memory accesses actually happen. The processor hides this from single threaded code using careful cache protocols, tracking dependent instructions, etc.

For these reasons, memory ordering is not an issue when running multiple threads on a single core, and these problems only show up when running on a machine with multiple processors or multicore processors. (Not totally true, the compiler can re-order stuff too, which can break it on single core machines too, see below).

Traditional locking methods for writing multithreaded applications handle this automatically. The locks will include the necessary memory barriers for the platform (see acquire and release semantics below) to ensure that memory appears consistent to the other processors. Sometimes programmers writing multithreaded code try to be clever, and avoid locking when doing something simple like just setting a flag:

Example 3
Thread 1 Thread 2
a = 100
isReady = true
while( !isReady ) { } //wait for flag
x = a //do something with 'a'

And this code will then fail horribly when ran on a multicore CPU. I've written code like this many times, before I knew better. When writing lock based code, the rule is simple - always protect shared data using a lock. Trying to be cute is a bad idea.

Compiler re-ordering

Unfortunately it's not just the processor which can re-order memory accesses. The compiler will also re-order instructions as part of its optimizing pass. This means we can still see memory ordering issues on single core machines, and machines which would otherwise have strong memory consistency guarantees (e.g. x86). Compiler intrinsics such as _ReadWriteBarrier can be used to block the compiler from re-ordering instructions, or the volatile keyword can have the same effect. Sometimes the barrier can be implicit - an inline assembly block will also function as a barrier on most compilers, so the implementation of memory barriers using an inline assembly block for the barrier opcode will also function as a compiler barrier.

Acquire and release semantics

A refinement of the full memory barrier model is acquire and release semantics. This is often referred to as 'load-with-acquire' and 'store-with-release'. The naming comes from the types of barriers needed when implementing locks. Acquiring a lock is always associated with a load instruction, to confirm that the lock has been successfully acquired (often a compare-and-swap). Releasing a lock is likewise always associated with a store instruction.

The fences required for a lock to be consistent must ensure that all memory accesses between the acquire and release of the lock must remain inside the lock, they can not be moved either before the acquire or after the release. It is allowed to move instructions outside the lock to the inside however, this will not affect the lock property. Therefore a load-with-acquire is a one-way fence which allows movement downwards but not upwards, and a store-with-release is a fence which allows instructions to move upward but not downward.

In C# and Microsofts Visual C++ 2005, volatile reads have acquire semantics and volatile writes have release semantics. This actually makes volatile useful again in C++ (see first reference below for why it is currently useless). However it looks like the C++ standard will take a different approach, C++0x has defined atomic types which will have acquire and release semantics.

References

  1. Volatile: Almost Useless for Multi-Threaded Programming
    A discussion of why the C++ volatile keyword is insufficient for multi-threaded programs, and why memory fences are needed instead.
  2. Memory Ordering in Modern Microprocessors, Part 1 and Part 2
    Explanation of re-ordering and memory barriers, from a Linux viewpoint. Part 2 describes the memory models of many processor types.
  3. Blog post about the .NET memory model
    Explains acquire/release, and how the .NET memory model compares to some real processors. Somewhat out of date now, the .NET framework 2.0 has a stronger model.
  4. Lockless Programming Considerations for Xbox 360 and Microsoft Windows
    Good article on lockless programming, with comparisons between Xbox 360 and x86.
  5. Trip Report: Ad-Hoc Meeting on Threads in C++
    The memory model and threading library for the future C++ standard.
  6. Understand the Impact of Low-Lock Techniques in Multithreaded Apps
    Good article from Vance Morrison explaining memory models, also in relation to the .NET memory model. See also an older blog post from him.
  7. Blog post from Kang Su
    Explains why explicit memory fence instructions are not usually needed on x86.