Divisibility Requirement of tiled_extent in C++ AMP

Article
02/26/2012

In this post I’ll discuss a requirement of C++ AMP’s tiled model, while subsequent posts will demonstrate how to deal with the requirement in various ways. The restriction is this:

The extent of a tiled parallel_for_each computation must be divisible by the static tile extent

Background

Let’s recall that C++ AMP offers a tiled model of parallel computing, which allows you to tap the GPU’s fast programmable cache where you can place tile_static variables and use intra-tile synchronization facilities such as tile barriers and atomic operations on tile_static variables. In the tiled model, which is available for 1, 2 and 3 dimensions, you have to statically provide, as template parameters, the tile extents. So for example, in order to perform a 480-by-640 parallel computation, you’d write:

       parallel_for_each(
             extent<2>(480,640),
             [=] (index<2> idx) restrict(amp) {
                 // my kernel's code goes here
           }
       );

In order to tile the above computation into 16-by-16 tiles, you’d specify a tiled extent thusly:

       parallel_for_each(
             tiled_extent<16,16>(extent<2>(480,640)),
             [=] (tiled_index<16,16> tidx) restrict(amp) {
                 // my kernel's code goes here
           }
       );

This would create a 480/16-by-640/16, or 30-by-40 extent of tiles, each of which has 16-by-16 threads. So overall you'll be getting 1200 tiles, each executing 256 threads.

Alternatively you can also use the tile “factory” function of class extent to create the tiled_extent:

       parallel_for_each(
             extent<2>(480,640).tile<16,16>(),
             [=] (tiled_index<16,16> tidx) restrict(amp) {
                 // my kernel's code goes here
           }
       );

Divisibility requirement

The extents that you are tiling (480 by 640 in the above example) must be divisible by the respective tile dimension (16 by 16 in the above example). This is trivially satisfied by the above example because it’s using compile-time-known extents, which are divisible by 16. However, the extents of a parallel_for_each call could be constructed and specified as dynamic entities and in that case the possibility arises that they are not a multiple of the tile extents.

“What happens if the divisibility requirement is violated?”

If parallel_for_each is invoked with a tiled_extent where the underlying extent is not evenly divided by the static tile extents, then an exception of type invalid_compute_domain is thrown, with text similar to the following :

Concurrency::parallel_for_each (tiling): unsupported compute domain, the extent of dimension 1 of the compute domain (100) cannot be evenly divided by the extent of dimension 1 of the tile (16).

Note that you may create an object of type tiled_extent which violates the divisibly requirement, e.g.,

auto myTiledExtent = extent<2>(100,200).tile<16,32>();

This in itself will not raise an exception, only when such a tiled extent is passed to parallel_for_each, will the exception be raised.

“Why doesn’t C++ AMP allow uneven tiled extents?”

C++ AMP is built on top of DirectCompute, which only allows the execution of “whole” tiles. In DirectCompute, when you author a shader, you have to statically select tile dimensions, then, at runtime, DirectCompute requires callers to dynamically specify the number of tiles to dispatch. DirectCompute deduces the total number of threads by multiplying the number of tiles with the static tile extents. Thus, DirectCompute always executes whole tiles.

“Is this restriction something Microsoft could have worked around in the compiler? “

You may be wondering how come this requirement doesn’t affect the simple execution model of parallel_for_each, which is also built on top of DirectCompute, and must also launch “whole tiles”? The way we made this work is by “massaging” the shape of arbitrary extents into a few built-in regular tiled extents which we schedule behind the scenes. Still, in many cases, we have some leftover DirectCompute threads. The accelerator stub code that C++ AMP generates knows how to translate the DirectCompute thread ID to the C++ AMP thread ID and it also knows to filter-out extraneous threads. In effect the body of your lambda resides inside an “if” statement, which is only executed if the DirectCompute thread ID is in range.

// Stub pseudo code
index<N> myAmpIdx = direct_compute_to_amp_index_mapping(myDirectComputeIdx)
if (myAmpIdx is within compute domain)
call kernel code with myAmpIdx as the thread ID

This filtering only affects a small number of tiles, and therefore, even though it causes divergence, which is something to avoid on the GPU, it is mostly harmless because actual divergence is low.

In the tiled model on the other hand, it would have been illegal for our compiler to filter out extraneous threads because that would have prevented the usage of tile barriers. A tile barrier can only be inserted into the code in a place where all threads are calling it uniformly (see C++ AMP open spec, section 8), i.e. the compiler must be able to tell that all or none of the threads in a tile will reach the barrier. The stub filtering code, by construction, would have rendered any usage of barriers inside the kernel code non-uniform, and would have precluded the use of barriers.

More complicated code transformations are possible, but at prohibitive costs. In the spirit of C++’s design, this is a case where the truth about the underlying execution model has to be exposed to the developer, who must deal with it, rather than the system trying to work around this limitation in costly ways.

”What can I do to work with or around this requirement?”

There are four ways to deal with the requirement so you can choose from one of the following:

Get some external guarantee that your input size is always divisible by the tile extents you chose. e.g., you are working with a known image format that always has the same extent. e.g., perhaps you can pad your data on the CPU to the next tile-size multiple
Pad the extent up, launching a small number of extra threads around the edges, and then in your code make sure they participate in barrier synchronization, but don’t have any other visible side effects.
Truncate the extent down, launching a fewer number of threads than originally needed. In this case, you have two sub-options to compensate for the missing threads:
1. In the first option, some of the remaining threads will have to do both the original work assigned to them, and the work of those threads that got truncated.
2. In the second option, your original kernel doesn’t take care of the work that was supposed to be done by truncated/eliminated threads. Instead, you perform it in a separate kernel, or you perform the remaining work on the CPU.

The route you’ll take depends on a few factors. If your application allows for it, option (1) is ideal. Options (2) and (3.a) will increase the complexity of your kernel’s code, which will make it harder to maintain and the extra conditionals (e.g., if tests again the thread ID) that will be sprinkled through it may also slow it down (due to additional register pressure, in addition to the need to evaluate such conditionals at runtime), but you’ll avoid the extra kernel dispatch of option (3.b). Option (3.b) has the reverse tradeoff---a more specialized kernel for each task (main processing vs. leftover processing), for the cost of an extra dispatch. Option (3.b) also has the benefit that sometimes it lends itself to a divide-and-conquer approach:

You start by writing a simple kernel that handles all input sizes
Then you write a tiled kernel that handles only evenly divided input sizes
Then you modify your code to dispatch to the optimized kernel for the truncated, evenly-divided portion of the data, and then dispatch to the simple kernel to process the leftover data around the boundaries (if necessary).

I will demonstrate each of these approaches in future blog posts describing tiled_extent::pad and tiled_extent::truncate .