tile_barrier in C++ AMP

C++ AMP offers a tiled model to help user take advantage of the fast and programmable cache on GPU hardware, which is exposed as tile_static memory in C++ AMP.  In this blog post, I will first dive into a synchronization mechanism for coordinating threads in a tile, and I will then describe the C++ AMP APIs provided to perform the synchronization.

Barrier

Each tile of threads has its own tile_static memory storage that is shared and accessible by all threads in the tile.  These threads need to coordinate their accesses to tile_static memory by synchronizing their execution. 

An example will help ground the concepts, so consider these fictitious requirements

  • there are only two threads in a 1-dimensional tile – thread 0 and 1.
  • They share a tile_static variable “x”.
  • Both threads work in a loop.
  • During each iteration, thread 0 produces a value and writes to “x”, and thread 1 wants to read the “x” written by thread 0 and to use the value for its own computation. 

Assume we initially write code to satisfy the requirements as follows:

    tile_static int x;  // x is shared by thread 0 and 1, 
                       // which are in the same tile

   for(int i = 0; i < N; i++) {
     if (tid.local[0] == 0) {
        // some computation
        x = ... // write
     }
     if (tid.local[0] == 1) {
        ... = x; // read x
        // do something
     }
   }
  

The code above is incorrect, of course. We need to use some synchronization to ensure that the read from thread 1 happens after the write from thread 0. 

In C++ AMP, one synchronization mechanism available is embodied in a class we call tile_barrier. It blocks the execution of all threads in a tile until all of them have reached the point where it is called. In the rest of the post, I will simply refer to it as a “barrier”.  However, please keep in mind that it only takes effect for the threads in the same tile.

Revisiting the above example, this time using the barrier construct, we can try to make sure that the write happens before the read, as follows:

    for(int i = 0; i < N; i++) {
     if (tid.local[0] == 0) {
        // some computation
        x = ... // write
     }
     // barrier op
     if (tid.local[0] == 1) {
        ... = x; // read x
        // do something
     }
   }
  

Now,  thread 1 is ensured that it only reads, during a given iteration, the value of x after it has been written by the thread 0, in thread 0’s same iteration over the loop.

However, the code is still incorrect! Why?  After “x” is written, we need to block the execution of thread 0 to ensure that thread 1 has read the value before it’s overwritten by the next iteration performed by thread 0.  So, we should have another barrier operation as:

    for(int i = 0; i < N; i++) {
     if (tid.local[0] == 0) {
        // some computation
        x = ... // write
     }
     // barrier op
     if (tid.local[0] == 1) {
        ... = x; // read x
        // do something
     }
     // barrier op
   }
  

(Note though that the last barrier operation in the last go-around of the loop is redundant. One could structure the loop slightly differently in order to avoid this redundancy.)

Now you have seen how a barrier synchronizes the threads in a tile.

Barrier Also Implies Memory Fence

When I introduced the barrier concept earlier, I omitted a fact that a barrier always implies a memory fence in C++ AMP.  For more detail into the C++ AMP memory model, and memory fences, please read section 8.1.2 of the open spec.  We will also blog about it in the future. In this post, I will just briefly introduce the concept of a memory fence without rigorously defining it and explaining all the subtleties.

In C++ AMP, a memory fence ensures that memory accesses are visible to other threads in the thread tile, and are executed according to program order.  It means that compilers and processors will not reorder reads/writes across the fence.

In the earlier example, it means that the compiler/processor will not move the write after the first barrier or move the read before the first barrier. This in turn implies that when the read is performed (by thread 1), it will indeed read the value written by thread 0.

In C++ AMP, there are two kinds of memory accesses that may need to be synchronized – global memory accesses and tile_static memory accesses. Global memory (aka device memory in some circles) is the memory allocated through concurrency::array.  Note that concurrency::array_view can wrap both kinds of memories depending on how it was constructed.  As a result, there are three flavors of memory fences:  one for global memory only, one for tile_static memory only, and one for both.  Accordingly, there are also three corresponding flavors of barriers:

  • barrier with global memory fence
  • barrier with tile_static memory fence
  • barrier with all memory fence

So for example, if you use a global memory fence, it applies just to global memory accesses, and thus the compiler and hardware is still at liberty to reorder reads and writes to tile_static variables, on the two sides of the fence.

Correct Usage of Barriers

For a correct C++ AMP program, it’s required that all threads in a tile must hit all barriers uniformly.  In other words, the barrier should not occur in divergent code -- if one thread in a tile reaches a barrier call via a path of control flow, then all of threads in the tile must reach the call by the exact same path.

The previous code snippet is an example of uniformed control flow.  All threads in the tile are going to hit the two barriers along the same control flow path.

It is illegal to place a barrier in divergent code, e.g.,

    if (tid.local[0] == 0) {
       // barrier op
   }
  

In this case, only thread 0 in the tile would reach the particular barrier enclosed by the if condition, so C++ AMP reports an error like:

   test.cpp(34) : error C3561: tile barrier operation found in varying flow control when compiling the call graph for the entry function for an amp target at: 'test.cpp(28)'

The first line number (“34”) points to the location of the barrier call. The second (“28”) shows the location of the lambda/function supplied to tiled parallel_for_each.

For the rules about the control flow uniformity, please read section 8.1.1 of the open spec. We will also blog about the subtleties and caveats with examples in more detail. Stay tuned!

tile_barrier class

We have introduced the concepts and usage of barriers for C++ AMP. It’s time to go through exactly how they are surfaced in the API.

concurrency::tile_barrier is a class to provide the barriers. It is a class that can only be constructed by the system, and passed to a tiled parallel_for_each lambda/functor as part of the tiled_index parameter. It cannot be constructed by you, but it can be copy constructed.

First of all, the tile_barrier class offers a method:

  • void tile_barrier::wait() const

It provides the functionality of a barrier with an all memory fence.

The tile_barrier class also provides three more methods for barriers:

  • void tile_barrier::wait_with_all_memory_fence() const
  • void tile_barrier::wait_with_global_memory_fence() const
  • void tile_barrier::wait_with_tile_static_memory_fence() const

The first one is the same as “wait()”.  You can think of “wait()” as a shortcut for it to save typing. The other two should be self-explanatory by name (based on my earlier introduction in this blog post).  All these methods are annotated with restrict(amp).

By the way, the memory fence operations without "wait" are available as free functions in concurrency namespace as:

  • void all_memory_fence(const tile_barrier & _Barrier)
  • void global_memory_fence(const tile_barrier & _Barrier)
  • void tile_static_memory_fence(const tile_barrier & _Barrier)

One More Example

Now we have introduced the tile_barrier class, let’s revisit a more real example using the class.  I hope you have read or played with our Matrix Multiplication Sample.  In the tiled version of matrix multiplication, part of the code looks like this:

    for (int i = 0; i < N; i += tile_size)
   {
       tile_static float localB[tile_size][tile_size];
       tile_static float localA[tile_size][tile_size];

       localA[localIdx[0]][localIdx[1]] = av_a(globalIdx[0], i + localIdx[1]);
       localB[localIdx[0]][localIdx[1]] = av_b(i + localIdx[0], globalIdx[1]);

       tidx.barrier.wait();

       for (unsigned int k = 0; k < tile_size; k++)
       {
           temp_c += localA[localIdx[0]][k] * localB[k][localIdx[1]];
       }

       tidx.barrier.wait();

   }
  

You’ll recall that “tidx” is typed as “tiled_index<tile_size,tile_size>”, and is the only parameter to the enclosing lambda supplied to the parallel_for_each.   “barrier” is a member of tiled_index class and is of type tile_barrier.

The original blog post has explained the algorithm in detail, so I’m not going to repeat it here, but there is one thing I want to bring to your attention.  Note that the barrier is only used to coordinate the writes to/reads from tile_static arrays localA and localB.  So, you could actually replace “wait” with “wait_with_tile_static_memory_fence”, the program would still be correctly synchronized and may yield better performance depending on hardware vendor JIT and architecture.

In Closing

C++ AMP provides the barrier operations via the tile_barrier class to help programmers coordinate and synchronize the memory accesses among threads in the same tile. In most cases (as shown in the samples we have posted), you probably only need to use the tile_barrier::wait() method. In scenarios that other barrier operations can be useful (typically for performance reasons), please understand their implications and apply them carefully.