64-bit atomic reads and writes on x86

I recently needed to implement a 64 bit read and write on x86 hardware. This is harder than it seems at first, since the 32-bit x86 instruction set only has one instruction designed for doing 64 bit atomic operations: cmpxchg8b. This is a compare and swap instruction, it atomically compares the value in edx:eax to the memory location, and if equal sets the memory location to ecx:ebx. The previous contents of the memory location are returned in edx:eax.

Here's how an atomic read is implemented using cmpxchg8b:

__int64 atomicRead64(void* p)
{
   __asm
   {
      mov edi, p
      xor eax, eax
      xor edx, edx
      xor ebx, ebx
      xor ecx, ecx
      lock cmpxchg8b [edi]
   }
}

We set edx:eax and ecx:ebx to 0. If the compare succeeds, then we write 0 to the memory location, which actually does nothing since the memory already contained zero. Whether the compare succeeds or not, the contents of the memory location are stored in edx:eax and returned from the function.

And here's an atomic write:

void atomicWrite64(void* p, __int64 value)
{
   int valueLow = (int)value;
   int valueHigh = (int)(value>>32);
   __asm
   {
      mov edi, p
      mov ebx, valueLow
      mov ecx, valueHigh
      mov eax, [edi]
      mov edx, [edi+4]
   tryAgain:
      lock cmpxchg8b [edi]
      jnz tryAgain
   }
}

Notice that to do a write we must keep trying until the compare succeeds.

Unfortunately the cmpxchg8b instruction is not very fast, due to the (implicit) lock prefix. The lock prefix ensures that only this processor is accessing the memory location for the duration of the instruction. It is this which gives the cmpxchg8b it's atomic nature.

However, there is an even better way, if we use a trick:

void atomicWrite64(void* p, __int64 value)
{
   __asm
   {
      mov edi, p
      fild qword ptr [value]
      fistp qword ptr [edi]
   }
}

This code uses the FPU to write the value atomically. The value is first converted to a floating point number and read into a floating point register. It is then converted back to an integer and stored to memory. Luckily the FPU registers are all 80-bit internally, with exactly 64 bits for the mantissa, which means they have just enough precision to do the conversions with no loss!

Note that the conversion to and from an integer is necessary. If we had attempted to read the 64 bit value and as a floating point type directly, with 'fld', then invalid representations could cause floating point exceptions.