Using Constant Memory in C++ AMP

In this blog post, I’m going to explain how you can make use of a GPU’s constant memory with C++ AMP.  I will first introduce what constant memory is and how it is accessible in C++ AMP. I will then share some implementation details and usage tips.

Constant Memory

On accelerators such as the GPU, constant memory is a type of memory for read-only data. It is usually backed up by a fast on-chip constant memory cache. The benefits of using constant memory are:

  • it is fast because it’s backed up by an on-chip cache per SM (streaming multiprocessor);
  • using constant memory to hold read-only data can reduce the register/tile_static memory usage, which should be used for thread/tile specific data, thus improving the occupancy rate in SM’s.

In DirectCompute, upon which the current version of C++ AMP is built, constant memory is exposed as a buffer resource type called Constant Buffer.

How to Use Constant Buffers in C++ AMP

As you may have noticed, C++ AMP does not directly expose DirectX (DX) constant buffers in the programming model – that is keeping true to our goal of offering a C++ data parallel programming model, rather than a hardware-specific one, at least at the public surface layer. However, there is still a (portable) way to make use of DX constant buffers and enjoy the performance benefits.

Remember that the entry point to C++ AMP (and hence to executing code on an accelerator) is the concurrency::parallel_for_each call, which takes a lambda or a functor.  In short, everything captured by the lambda or the functor are marshaled to the accelerator via DX constant buffer.

For example,

   int x = 10;
  parallel_for_each(
    e, // e is of type extent<1>
    [x](index<2> idx) restrict(amp) {
       ... = x; // read x
    }
  );
  

“x” is copied to the device using a constant buffer.  In the lambda, the read of “x” is served from constant buffer.

Sometimes, you may want to access a read-only array, for example

   int x[3];
  parallel_for_each(
    e, // e is of type extent<1>
    [x](index<2> idx) restrict(amp) {
       // ...
    }
  );
  

However, you will get a compilation error like:

    x.cpp(11) : error C3478: 'x' : an array cannot be captured by-value

    x.cpp(13) : error C3590: 'x': by-reference capture or 'this' capture is unsupported if the lambda is amp restricted

Basically, capturing an array (essentially capturing a pointer) is not allowed by restrict(amp).  A workaround would be:

   struct Wrapper
  { 
    int data[3];
  };

  Wrapper x;

  // init values in x.data

  parallel_for_each(
    e, // e is of type extent<1>
    [x](index<1> idx) restrict(amp) {
       ... = x.data[1]; // served by constant cache
    }
  );
  

Now, you have seen how to marshal data into the DX constant buffer and thus get reads from such read-only data being served by the constant buffer. The next question you may have is “How much data can I put there?”.

In DirectCompute, each constant buffer can hold up to 64K bytes, and each shader can use up to 14 constant buffers.  In this release of C++ AMP, however, you can only marshal in 16K bytes in total via a single constant buffer (the reason for the 16K upper bound is given in the next section that looks deeper under the covers).  The 16K bytes include some amount of metadata that compiler-generated code uses internally (for the extent information, and for each concurrency::array captured by reference) and also for the variables captured by value in the lambda. If the limit is exceeded, an error like the following will be reported using coming VS 11 Beta:

    amp.h(6494) : error C3578: The size of the function object passed to concurrency::parallel_for_each cannot exceed 16384 bytes

    x.cpp(16) : see reference to function template instantiation 'void Concurrency::parallel_for_each ...

x.cpp(16) is the location where the parallel_for_each is invoked.

Another advice is to not annotate the lambda as “mutable”, so as to avoid writing to the captured variable. I will explore the reason behind this advice in the next section.

So this is what you normally need to know about using constant memory in C++ AMP.  For those of you who are interested in knowing more about constant memory, follow me to the next section.

More (than you ever wanted to know) about Constant Buffers

As mentioned earlier, one benefit of using constant memory is to reduce register usage. In the context of DirectX, this is because a constant buffer reference can be directly used as input operand in arithmetic instructions.  So, unlike accessing global memory or tile_static memory, there is no need to have an extra instruction first load from constant buffer to register, then use registers as operands for the arithmetic instructions.  Constant buffer elements can be used equally as registers. This is true at least for the bytecode generated by the HLSL compiler (that we use under the covers at compile time). It is still up to the hardware vendor to JIT the bytecode to machine code for a specific device though.

As mentioned earlier, in our implementation, the lambda object (that contains all captured variables) is copied to the device using a constant buffer.  In the generated device code, the lambda object is initialized (re-created) using elements of the constant buffer.  This normally allows HLSL compiler to generate code that maps the whole lambda object to the constant buffer. This implies that there is no need to allocate registers for all the variables that are captured by value via the lambda at the parallel_for_each call site, because they are read-only and the reads can be served from a constant buffer. This is great.

There is another interesting aspect of constant buffer – it could be dynamically indexed. Keep this in mind!  Assume you have a local array:

   int x[3] = {1, 2, 3}; // local array
  ... = x[a % b];  // a or b is not known at compilation time
  

Given the value of a or b is only known at runtime, it’s not possible for the compiler to determine which element of “x” is going to be read by each thread at runtime. It means the compiler needs to generate code that dynamically indices into array “x”. That is, the index is not known at compilation time. However, normal registers are accessed by name and do not support indexing. So in DirectCompute, array “x” will be allocated to a special storage called thread-local indexable array. It’s up to the hardware vendor how they support the indexable array. However, it’s very likely that it will be spilled to global memory, and then rely on L1 data cache (if there is any) to reduce the latency of reading from it.

Now, going back to our constant buffer exploration, it’s fast and it does support dynamic indexing. If “x” is marshaled in via constant buffer, the read (though dynamically indexed) could be directly served from a constant buffer. As a result, if you have code like:

   Wrapper x;

  // init values in x.data

  parallel_for_each(
    e, // e is of type extent<1>
    [x](index<1> idx) restrict(amp) {
       ... = x.data[idx[0] % 3]; // served by constant buffer
    }
  );
  

The read is still served from a constant buffer.

There is one caveat in the implementation though.  In DirectX, a constant buffer is actually 4K of short vectors, each of which composes four 32-bit values.  A constant buffer is only indexable at the level of short vectors, which is 16-byte aligned.  In order to make use of the dynamic indexing capability of constant buffer, C++ AMP makes a trade-off here – only the first 4 bytes of each short vector is used to hold the read-only data marshaled in. This is the reason for the 16K byte limit of constant data allowed.

Normally, variables that are captured by value cannot be modified it in the lambda’s body. Otherwise, you would get an error, like:

   error C2166: l-value specifies const object

However, if you annotate the lambda with the keyword mutable, you are then allowed to modify these variables ( Note, we are considering disallowing the use of entry function object with non-const call operator in the future, thus effectively making “mutable” not useful. Please read Mutable lambdas considered harmful in C++ AMP. ). Note that in the current implementation your update is to the copy of the lambda, the one which is recreated by the compiler for each thread on the accelerator side, not to the original lambda which lives on the host side.  What does it mean to modify the captured variable?

Basically, for the lambda object (which contains all the captured variables), if there are either

  1. dynamically indexed writes,
  2. dynamically indexed reads coexisting with any writes,

then the current compiler will not be able to map the lambda object to the constant buffer, thus will not generate code that serves the reads using a constant buffer directly. For example,

     Wrapper x;
    // init x
    int y, z = 2;
    parallel_for_each(e, [&, x, y, z] (index<1> idx) mutable restrict(amp) {
        y = 5;
        ... = x.data[idx[0] % z];
    });
  

In above example, “x” and “y” are captured by value, and the lambda is mutable. “y” is written in the kernel, then some element of “x.data” is read via dynamic indexing.  In this case, the read would not be served directly from the constant buffer. Instead, thread-local an indexable array is allocated. Each captured variable is load from the constant buffer to the local indexable  array.  Afterwards, all the accesses to the captured variables, regardless of whether they are reads or writes, are severed using the indexable array.

So using “mutable” and changing a captured variable like in the above example is something that you would want to avoid from a performance perspective.  If you need to write into anything that you captured, allocate local variables, make a copy, then modify the copy. As a good practice, you probably should avoid using the “mutable” keyword altogether with C++ AMP ( Again, please note that we are considering disallowing the use of entry function object with non-const call operator in the future, thus effectively making “mutable” not useful. Please read Mutable lambdas considered harmful in C++ AMP. ).

Now you know how to take advantage of the DX constant buffer to serve normal reads as well as dynamic reads of read-only data.

However, whether a specific GPU is good for a specific accessing pattern is hardware dependent. Please refer to the related documentation provided by hardware vendors. For example, in Programming Guide for AMD Accelerated Parallel Processing (section 5.3), it explains the three levels of performance for the constant memory type, which also apply to DX constant buffer when used on ATI hardware.

In Closing

This blog shows you how to use the constant memory on GPUs in C++ AMP (essentially anything you capture by value, up to the aforementioned limit).  Your feedback is welcome in our forum!