Atomic Operations in C++ AMP

Hi, I’m Steve Deitz, a developer on the C++ AMP team, and in this post, I’ll review the atomic operations available in C++ AMP, mention some important caveats, and show a simple example.

Atomic operations are the only safe and correct way to communicate and synchronize between the threads in a non-tiled parallel_for_each and between the threads in different tiles of a tiled parallel_for_each.  Note that threads within the same tile of a tiled parallel_for_each can also communicate and synchronize using tile_static memory and calls to tiled_index::barrier.wait .

Let’s take a look at a list of these atomic operations.

A list of atomic operations

C++ AMP provides atomic operations for addition, subtraction, increment, decrement, exchange, compare-and-exchange, maximum, minimum, and, or, and xor.  These are defined in amp.h as follows:

int atomic_fetch_add(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_add(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic addition of val to the memory location pointed to by dest , and returns the original value at the memory location pointed to by dest.

int atomic_fetch_sub(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_sub(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic subtraction of val to the memory location pointed to by dest , and returns the original value at the memory location pointed to by dest.

int atomic_fetch_inc(int * dest) restrict(amp)
unsigned int atomic_fetch_inc(unsigned int * dest) restrict(amp)

Performs an atomic increment to the memory location pointed to by dest, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_dec(int * dest) restrict(amp)
unsigned int atomic_fetch_dec(unsigned int * dest) restrict(amp)

Performs an atomic decrement to the memory location pointed to by dest, and returns the original value at the memory location pointed to by dest.

int atomic_exchange(int * dest, int val) restrict(amp)
unsigned int atomic_exchange(unsigned int * dest, unsigned int val) restrict(amp)
float atomic_exchange(float * dest, float val) restrict(amp)

Sets the value at the memory location pointed to by dest to val as an atomic operation, and returns the original value at the memory location pointed to by dest.

bool atomic_compare_exchange(int * dest, int * expected, int val) restrict(amp)
bool atomic_compare_exchange(unsigned int * dest, unsigned int * expected, unsigned int val) restrict(amp)

Performs an atomic compare-and-exchange of val to the memory location pointed to by dest, storing val at the memory location only if *expected matches the value at the memory location. Returns true if the operation is successful, false otherwise. If the operation is unsuccessful, *expected is set to the value pointed to by dest.

int atomic_fetch_max(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_max(unsigned int * dest, unsigned int val) restrict(amp)

Atomically computes the maximum of val and the value at the memory location pointed to by dest, storing the maximum into the memory location, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_min(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_min(unsigned int * dest, unsigned int val) restrict(amp)

Atomically computes the minimum of val and the value at the memory location pointed to by dest, storing the minimum into the memory location, and returns the original value at the memory location pointed to by dest.

int atomic_fetch_and(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_and(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-and operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

int atomic_fetch_or(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_or(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-or operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

int atomic_fetch_xor(int * dest, int val) restrict(amp)
unsigned int atomic_fetch_xor(unsigned int * dest, unsigned int val) restrict(amp)

Performs an atomic bitwise-xor operation of val to the memory location pointed to by dest , and returns the original value of the memory location pointed to by dest .

Caveats

There are two important points to keep in mind when using C++ AMP’s atomic operations.  Fortunately, while these limitations are subtle, they shouldn’t affect most code.  They follow from the semantics of DirectX and the design of GPU hardware.  If you think that you’ve got a kernel that could benefit were these limitations somehow removed, please let us know.  We may be able to help you redesign your kernel, and we would be curious to see such examples.

  1. A normal read may not see the results of an atomic write to a particular memory location. Moreover, atomic operations should not be mixed with normal writes to the same memory location. This is sometimes referred to as weak atomicity.
  2. Unlike interlocked operations in C++, the above-listed atomic operations do not imply a memory fence of any kind.  Even the order of atomic operations on distinct variables may be rearranged. For example, if one thread atomically increments X and then atomically increments Y, it is possible for another thread to observe those increments in reverse order.

We will provide additional information about the finer details of the C++ AMP memory model in upcoming blog posts.

Example: Counting Exceptional Occurrences

Suppose you want to count the number of threads that run into an exceptional occurrence of some kind. This is similar to a reduction, but because there are presumably going to be very few occurrences, it may be better to simply use the atomic increment function.  Such a code may look as follows:

int numExceptionalOccurrences;
array_view<int> count(1, &numExceptionalOccurrences);
parallel_for_each(/* compute domain */, [=] (index<1> idx) restrict(amp)
    {
        if (/* exceptional occurrence */)
        {
            atomic_fetch_inc(&count(0));
        }
    });

Note the use of an array_view to write the number of exceptional occurrences to memory.  You must use either an array or an array_view to communicate results to the CPU, even when the results consist of just one item like in the example above.

Closing Thoughts

Atomic operations are a powerful abstraction and an important programming tool for getting certain kernels to perform well on GPUs.  C++ AMP exposes these operations through a simple interface so you can easily use them as necessary.