64-bit atomic reads and writes on x86

I recently needed to implement a 64 bit read and write on x86 hardware. This is harder than it seems at first, since the 32-bit x86 instruction set only has one instruction designed for doing 64 bit atomic operations: cmpxchg8b. This is a compare and swap instruction, it atomically compares the value in edx:eax to the memory location, and if equal sets the memory location to ecx:ebx. The previous contents of the memory location are returned in edx:eax.

Here's how an atomic read is implemented using cmpxchg8b:

__int64 atomicRead64(void* p)
{
   __asm
   {
      mov edi, p
      xor eax, eax
      xor edx, edx
      xor ebx, ebx
      xor ecx, ecx
      lock cmpxchg8b [edi]
   }
}

We set edx:eax and ecx:ebx to 0. If the compare succeeds, then we write 0 to the memory location, which actually does nothing since the memory already contained zero. Whether the compare succeeds or not, the contents of the memory location are stored in edx:eax and returned from the function.

And here's an atomic write:

void atomicWrite64(void* p, __int64 value)
{
   int valueLow = (int)value;
   int valueHigh = (int)(value>>32);
   __asm
   {
      mov edi, p
      mov ebx, valueLow
      mov ecx, valueHigh
      mov eax, [edi]
      mov edx, [edi+4]
   tryAgain:
      lock cmpxchg8b [edi]
      jnz tryAgain
   }
}

Notice that to do a write we must keep trying until the compare succeeds.

Unfortunately the cmpxchg8b instruction is not very fast, due to the (implicit) lock prefix. The lock prefix ensures that only this processor is accessing the memory location for the duration of the instruction. It is this which gives the cmpxchg8b it's atomic nature.

However, there is an even better way, if we use a trick:

void atomicWrite64(void* p, __int64 value)
{
   __asm
   {
      mov edi, p
      fild qword ptr [value]
      fistp qword ptr [edi]
   }
}

This code uses the FPU to write the value atomically. The value is first converted to a floating point number and read into a floating point register. It is then converted back to an integer and stored to memory. Luckily the FPU registers are all 80-bit internally, with exactly 64 bits for the mantissa, which means they have just enough precision to do the conversions with no loss!

Note that the conversion to and from an integer is necessary. If we had attempted to read the 64 bit value and as a floating point type directly, with 'fld', then invalid representations could cause floating point exceptions.

Doesn't alignment guarantee atomicity?

I was just reading this passage from Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 3A, section 8.1.1: Guaranteed Atomic Operations

Basically it says that each processor generation has evolving atomicity rules. For the above example of 64-bit atomic reads and writes, Pentium and all future processors perform atomic quad word (64-bit) memory operations if the address is 64-bit aligned.

So if you're running on Pentium or newer processors, just make sure the location is 64-bit aligned. I would assume the same goes for atomic increments, etc...

Yes, but you still need to

Yes, but you still need to use an instruction which operates on a 64 bit value. If you just read or write from a 64-bit aligned value, the compiler will generate two 32-bit read/write instructions - so not atomic. The only instructions operating on 64-bit values in IA-32 are the ones mentioned in the post, cmpxchg8b, the FPU instructions, and possibly MMX/SSE instructions.

Can these routines be used for atomic increment/decrement ?

Hi,

I came across this page after searching for a way to do 64-bit atomic operations when running on x86. Can I use atomicWrite64() to increment/decrement an __int64 ?

Also, will the routines work for unsigned __int64 (assuming I create new functions) ?

For example:

__int64 counter = 0;
cout << atomicRead64(&counter); // outputs 0
atomicWrite64(&counter, counter + 1); // increment counter by 1
cout << atomicRead64(&counter); // outputs 1
atomicWrite64(&counter, counter - 1); // decrement counter by 1
cout << atomicRead64(&counter); // outputs 0
atomicWrite64(&counter, counter + 55); // increment counter by 55
cout << atomicRead64(&counter); // outputs 55
atomicWrite64(&counter, counter - 32); // decrement counter by 32
cout << atomicRead64(&counter); // outputs 23

Alternatively, I came across the following code snippet for incrementing a 64-bit value:

//most common use of InterlockedCompareExchange
//It's more efficient to use the z flag than to do another compare
inline bool
InterlockedSetIfEqual(volatile unsigned __int64 *dest
                      ,unsigned __int64 exchange
                      ,unsigned __int64 comperand)
{
    //value returned in eax
    __asm {
        lea esi,comperand;
        lea edi,exchange;
       
        mov eax,[esi];
        mov edx,4[esi];
        mov ebx,[edi];
        mov ecx,4[edi];
        mov esi,dest;
        //lock CMPXCHG8B [esi] is equivalent to the following except
        //that it's atomic:
        //ZeroFlag = (edx:eax == *esi);
        //if (ZeroFlag) *esi = ecx:ebx;
        //else edx:eax = *esi;
        lock CMPXCHG8B [esi];           
        mov eax,0;
        setz al;
    }
}
#pragma warning(default:4035)

inline unsigned __int64 InterlockedIncrement(volatile unsigned __int64 * ptr)
{
    unsigned __int64 comperand;
    unsigned __int64 exchange;
    do {
        comperand = *ptr;
        exchange = comperand+1;
    }while(!InterlockedSetIfEqual(ptr,exchange,comperand));
    return exchange;
}

In your first example, the

In your first example, the operations are no longer atomic since you have broken them into separate read and write parts. If these operations were running on different threads then the final value would no longer be guaranteed to be 23.

The code snippet shows the correct way to do an atomic increment on x86. It uses a cmpxchg8b instruction and retries until it is successful.

Here's a more general version, a 64-bit atomic add:

inline __int64 getAndAdd64(void* p, __int64 addValue)
{
   int addValueLow = static_cast<int>(addValue);
   int addValueHigh = static_cast<int>(addValue>>32);
   __asm
   {
      mov edi, p
      mov eax, [edi]      //read current value non-atomically here...
      mov edx, [edi+4]    //  it's just a guess, if it's wrong we'll try again
   tryAgain:
      mov ebx, addValueLow
      mov ecx, addValueHigh
      add ebx, eax
      adc ecx, edx
      lock cmpxchg8b qword ptr [edi]
      jnz tryAgain
   }
}

Atomic increment/decrement of 64-bit counters on 32 bit system

Is there any atomic increment and decrement of 64-bit counter on 32-bit system (Windows/Linux)?

Yes, see my reply to Richard

Yes, see my reply to Richard B for an example. But don't forget you can also use the InterlockedIncrement64 on Windows, and I believe GCC has builtin functions which can be used on Linux.