Peterson's Algorithm

Peterson's Algorithm is a simple mutual exclusion algorithm for two threads. It uses only reads and writes, so I figured that if there were only two threads, maybe it would be faster than using a regular lock with a compare-and-swap or other interlocked memory operations.

class PetersonLock
{
public:
   PetersonLock()
   {
      m_flag[0] = m_flag[1] = 0;
   }
   void lock(int id)
   {
      assert((id==0) || (id==1));
      int otherId = 1 - id;

      m_flag[id] = true;
      m_victim = id;
      while( m_flag[otherId] && (m_victim == id) )      { }
   }
   void unlock(int id)
   {
      assert((id==0) || (id==1));
      m_flag[id] = false;
   }
private:
   Atomic<bool> m_flag[2];
   Atomic<int> m_victim;
};

The Atomic and Atomic classes implement reads and writes with acquire and release semantics respectively, which I thought would give the correct memory ordering. But it turns out that this is insufficient, as in the middle of the lock function we have this sequence:

   m_victim = id;      //store with release
   while( m_flag[otherId] ... )      //load with acquire

There is nothing to prevent these being re-ordered, doing the load before the store does not violate the acquire and release semantics. And it turns out that this does happen, even on the relatively strong memory model of x86 processors (stores are ordered with respect to other stores, and loads are ordered with respect to other loads, but unfortunately there is no ordering between loads and stores) which breaks the mutual exclusion property of the lock.

To fix it we need to add a memory fence between these instructions, which is just as slow as an interlocked instruction, so there goes any speed advantage.