multithreading

Peterson's Algorithm

Peterson's Algorithm is a simple mutual exclusion algorithm for two threads. It uses only reads and writes, so I figured that if there were only two threads, maybe it would be faster than using a regular lock with a compare-and-swap or other interlocked memory operations.

class PetersonLock
{
public:
   PetersonLock()
   {
      m_flag[0] = m_flag[1] = 0;
   }
   void lock(int id)
   {
      assert((id==0) || (id==1));
      int otherId = 1 - id;

      m_flag[id] = true;
      m_victim = id;
      while( m_flag[otherId] && (m_victim == id) )      { }
   }
   void unlock(int id)
   {
      assert((id==0) || (id==1));
      m_flag[id] = false;
   }
private:
   Atomic<bool> m_flag[2];
   Atomic<int> m_victim;
};

The Atomic and Atomic classes implement reads and writes with acquire and release semantics respectively, which I thought would give the correct memory ordering. But it turns out that this is insufficient, as in the middle of the lock function we have this sequence:

   m_victim = id;      //store with release
   while( m_flag[otherId] ... )      //load with acquire

There is nothing to prevent these being re-ordered, doing the load before the store does not violate the acquire and release semantics. And it turns out that this does happen, even on the relatively strong memory model of x86 processors (stores are ordered with respect to other stores, and loads are ordered with respect to other loads, but unfortunately there is no ordering between loads and stores) which breaks the mutual exclusion property of the lock.

To fix it we need to add a memory fence between these instructions, which is just as slow as an interlocked instruction, so there goes any speed advantage.

Memory ordering, barriers, acquire and release semantics

The problem

Multithreading is difficult, and multithreading without locks is even more difficult. This post is about one of the problems: out-of-order memory accesses.

Here's an example of the unexpected behaviour caused by out-of-order processors (assume variables are initialized to 0 in all the following examples):

Example 1
Thread 1 Thread 2
a = 1
b = 2
x = b
y = a

After running this code, if x is 2, then is y guaranteed to be 1? Unfortunately the answer is no, x==2 and y==0 is a possible result. This can happen because the writes in thread 1 can be re-ordered:

Example 1, re-ordered execution
Thread 1 Thread 2
b = 2
...
...
a = 1
...
x = b
y = a
...

Here's another example:

Example 2
Thread 1 Thread 2
x = 1
a = y
y = 1
b = x

We would assume that after running this code then either a or b must be set to 1. Again, this is not the case, as the instructions can be re-ordered:

Example 2, re-ordered execution
Thread 1 Thread 2
a = y
x = 1
b = x
y = 1

What on earth is actually going on here?

So why are we seeing all these instructions being re-ordered? We don't see this when writing a program without multithreading, so why does it start happening in multithreaded code? The answer is that it always happens - it's just that the processor is nice enough to hide it from us when we're running in a single thread of execution. Processors have separate caches, memory writes can be buffered, the processor can re-order instructions, all of which will affect when the memory accesses actually happen. The processor hides this from single threaded code using careful cache protocols, tracking dependent instructions, etc.

For these reasons, memory ordering is not an issue when running multiple threads on a single core, and these problems only show up when running on a machine with multiple processors or multicore processors. (Not totally true, the compiler can re-order stuff too, which can break it on single core machines too, see below).

Traditional locking methods for writing multithreaded applications handle this automatically. The locks will include the necessary memory barriers for the platform (see acquire and release semantics below) to ensure that memory appears consistent to the other processors. Sometimes programmers writing multithreaded code try to be clever, and avoid locking when doing something simple like just setting a flag:

Example 3
Thread 1 Thread 2
a = 100
isReady = true
while( !isReady ) { } //wait for flag
x = a //do something with 'a'

And this code will then fail horribly when ran on a multicore CPU. I've written code like this many times, before I knew better. When writing lock based code, the rule is simple - always protect shared data using a lock. Trying to be cute is a bad idea.

Compiler re-ordering

Unfortunately it's not just the processor which can re-order memory accesses. The compiler will also re-order instructions as part of its optimizing pass. This means we can still see memory ordering issues on single core machines, and machines which would otherwise have strong memory consistency guarantees (e.g. x86). Compiler intrinsics such as _ReadWriteBarrier can be used to block the compiler from re-ordering instructions, or the volatile keyword can have the same effect. Sometimes the barrier can be implicit - an inline assembly block will also function as a barrier on most compilers, so the implementation of memory barriers using an inline assembly block for the barrier opcode will also function as a compiler barrier.

Acquire and release semantics

A refinement of the full memory barrier model is acquire and release semantics. This is often referred to as 'load-with-acquire' and 'store-with-release'. The naming comes from the types of barriers needed when implementing locks. Acquiring a lock is always associated with a load instruction, to confirm that the lock has been successfully acquired (often a compare-and-swap). Releasing a lock is likewise always associated with a store instruction.

The fences required for a lock to be consistent must ensure that all memory accesses between the acquire and release of the lock must remain inside the lock, they can not be moved either before the acquire or after the release. It is allowed to move instructions outside the lock to the inside however, this will not affect the lock property. Therefore a load-with-acquire is a one-way fence which allows movement downwards but not upwards, and a store-with-release is a fence which allows instructions to move upward but not downward.

In C# and Microsofts Visual C++ 2005, volatile reads have acquire semantics and volatile writes have release semantics. This actually makes volatile useful again in C++ (see first reference below for why it is currently useless). However it looks like the C++ standard will take a different approach, C++0x has defined atomic types which will have acquire and release semantics.

References

  1. Volatile: Almost Useless for Multi-Threaded Programming
    A discussion of why the C++ volatile keyword is insufficient for multi-threaded programs, and why memory fences are needed instead.
  2. Memory Ordering in Modern Microprocessors, Part 1 and Part 2
    Explanation of re-ordering and memory barriers, from a Linux viewpoint. Part 2 describes the memory models of many processor types.
  3. Blog post about the .NET memory model
    Explains acquire/release, and how the .NET memory model compares to some real processors. Somewhat out of date now, the .NET framework 2.0 has a stronger model.
  4. Lockless Programming Considerations for Xbox 360 and Microsoft Windows
    Good article on lockless programming, with comparisons between Xbox 360 and x86.
  5. Trip Report: Ad-Hoc Meeting on Threads in C++
    The memory model and threading library for the future C++ standard.
  6. Understand the Impact of Low-Lock Techniques in Multithreaded Apps
    Good article from Vance Morrison explaining memory models, also in relation to the .NET memory model. See also an older blog post from him.
  7. Blog post from Kang Su
    Explains why explicit memory fence instructions are not usually needed on x86.

64-bit atomic reads and writes on x86

I recently needed to implement a 64 bit read and write on x86 hardware. This is harder than it seems at first, since the 32-bit x86 instruction set only has one instruction designed for doing 64 bit atomic operations: cmpxchg8b. This is a compare and swap instruction, it atomically compares the value in edx:eax to the memory location, and if equal sets the memory location to ecx:ebx. The previous contents of the memory location are returned in edx:eax.

Here's how an atomic read is implemented using cmpxchg8b:

__int64 atomicRead64(void* p)
{
   __asm
   {
      mov edi, p
      xor eax, eax
      xor edx, edx
      xor ebx, ebx
      xor ecx, ecx
      lock cmpxchg8b [edi]
   }
}

We set edx:eax and ecx:ebx to 0. If the compare succeeds, then we write 0 to the memory location, which actually does nothing since the memory already contained zero. Whether the compare succeeds or not, the contents of the memory location are stored in edx:eax and returned from the function.

And here's an atomic write:

void atomicWrite64(void* p, __int64 value)
{
   int valueLow = (int)value;
   int valueHigh = (int)(value>>32);
   __asm
   {
      mov edi, p
      mov ebx, valueLow
      mov ecx, valueHigh
      mov eax, [edi]
      mov edx, [edi+4]
   tryAgain:
      lock cmpxchg8b [edi]
      jnz tryAgain
   }
}

Notice that to do a write we must keep trying until the compare succeeds.

Unfortunately the cmpxchg8b instruction is not very fast, due to the (implicit) lock prefix. The lock prefix ensures that only this processor is accessing the memory location for the duration of the instruction. It is this which gives the cmpxchg8b it's atomic nature.

However, there is an even better way, if we use a trick:

void atomicWrite64(void* p, __int64 value)
{
   __asm
   {
      mov edi, p
      fild qword ptr [value]
      fistp qword ptr [edi]
   }
}

This code uses the FPU to write the value atomically. The value is first converted to a floating point number and read into a floating point register. It is then converted back to an integer and stored to memory. Luckily the FPU registers are all 80-bit internally, with exactly 64 bits for the mantissa, which means they have just enough precision to do the conversions with no loss!

Note that the conversion to and from an integer is necessary. If we had attempted to read the 64 bit value and as a floating point type directly, with 'fld', then invalid representations could cause floating point exceptions.

Syndicate content