Mutable lambdas considered harmful in C++ AMP

Hello, my name is Jonathan Emmett and I’m a developer on the C++ AMP team. In this post I’ll be talking about the problems with mutable lambdas and function objects with non-const function call operators in C++ AMP kernels. There are a number of potential problems with these objects, and while they are currently allowed by Microsoft’s implementation of C++ AMP, they may be disallowed in a future release or update, as well as in other implementations.

Throughout this post I’ll be talking about const kernel functions, so let’s define that term. A const kernel function is a const qualified operator() function in a function object passed as the kernel parameter of parallel_for_each. When lambda expressions are used as the kernel parameter, a const kernel function is a lambda that does not have the keyword mutable. Like any const C++ member function, a const kernel function cannot modify member variables (captured variables, in the case of lambdas) or call non-const functions on itself or its members.

Some examples of const kernel functions:

//a lambda that does not use the mutable keyword is a const kernel function
parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) {
...
});

class krnl
{
public:
krnl(array_view<uint32_t, 1> a) : av(a) {}
//an operator() function declared const is a const kernel function
void operator()(index<1> idx) const restrict(amp) { ... }
private:
array_view<uint32_t,1> av;
}

 

Const-ness in C++ AMP kernels

If you take a look at the declaration of the parallel_for_each template function in amp.h you can see that the kernel parameter is passed as a reference to a const object in each of the overloads:

 template <int _Rank, typename _Kernel_type> 
void parallel_for_each(const extent<_Rank>& _Compute_domain, 
    const _Kernel_type &_Kernel)

One would expect that when the implementation invokes the operator() member function on the kernel, it will require a const version to exist. However, a known bug in the first release of C++ AMP allows you to pass an object with a non-const kernel function to parallel_for_each even though the parameter is const. Even more surprisingly, if you happen to provide both a const and non-const operator() , it will bind to the non-const version, as will the calls to any member functions of the function object or its members!

This behavior is an artifact of early developer previews of C++ AMP where parallel_for_each required non-const operator() functions, but has since changed. We may, in the future, correct this bug and make passing function objects with non-const operator() functions an error, and/or cause the implementation to bind to the const operator() over the non-const if both are available.

More importantly, it’s a bad idea to modify data that is not local to the kernel function, except data in C++ AMP-supported data containers like array, array_view, and writeonly_texture_view. The kernel parameter is declared as a reference-to-const by design, and the implementation assumes that the captured data will not be modified. Allowing the opportunity to modify captured data may have unexpected or accelerator-dependent behavior due to how data is copied to and from the accelerator.

In the current v1 implementation each thread sees a copy of captured data and changes to the data are not copied back to the host, and so existing C++-AMP code that relies on a mutable kernel function is likely using mutable values only as a per-thread temporary scratch space. For example, the following code will compile today, despite the mutable lambda, but is relying on this undefined behavior:

  1: void find_nth_set_bit(int32_t n, const uint32_t * pIn, int32_t * pOut, int32_t len)
  2: {
  3:     array_view<const uint32_t> in(len, pIn);
  4:     array_view<int32_t> out(len, pOut);
  5:     out.discard_data();
  6:     parallel_for_each(in.extent, [=](index<1> idx) mutable restrict(amp) {
  7:         uint32_t val = in[idx];
  8:         for (int32_t i = 0; i < 32; i++)
  9:         {
  10:             if (val & 1)
  11:             {
  12:                 //Undefined behavior - modification of captured variable
  13:                 if (--n == 0) 
  14:                 {
  15:                     out[idx] = i;
  16:                     return;
  17:                 }
  18:             }
  19:             val >>= 1;
  20:         } 
  21:         out[idx] = -1;
  22:     });
  23:     out.synchronize();
  24: }

This code finds the n-th set bit in each element of an array of 32-bit integers and outputs an array containing those indices (error checking and handling corner cases have been elided for space). This code requires each thread to see its own copy of ‘n’ to work correctly. This may be the case today, but is not required by the C++ AMP specification.

Why reference to const?

By declaring the kernel parameter as a reference-to-const, the implementation gets a lot of flexibility in how to store the captured data on various accelerator types. In particular, this allows the implementation the freedom to make copies of the captured data, or not, depending on which would be more efficient. On a CPU-based accelerator it is likely to be far more efficient to not have to make copies of the data for each thread. If writes to the captured data were supported, C++ AMP would need to define semantics for how these writes behave and ensure they are consistent across all accelerators, potentially at a heavy cost in efficiency when the underlying hardware does not match the model chosen. On a GPU accelerator, writing to the captured data can impact how the variables are mapped to GPU hardware.

When you rely on this bug to make writes to captured data, you are subject to the whims of the implementation. What could be changes to per-thread copies on one accelerator may manifest as race conditions on another. This effect could become more pronounced as new types of accelerators are supported by C++ AMP.

Making mutable kernels const

If you happen to be using non-const kernel functions in existing C++ AMP code, we strongly suggest changing them to be const, by removing the mutable keyword from the lambda or adding the const keyword to functors. It’s possible that existing non-const kernel functions may only be non-const because they were written using an early developer preview that required the kernels to be mutable, or simply because of a missing const keyword that wasn’t caught earlier because of the type-checking bug. These cases are simple to change.

If the kernel function modifies captured data only as a kind of scratch space, as in the above example, you may be able to correct the problem by introducing local variables. The example could be rewritten to avoid mutable by introducing a local variable to modify instead of the captured variable:

  1: void find_nth_set_bit(int32_t n, const uint32_t * pIn, int32_t * pOut, int32_t len)
  2: {
  3:     array_view<const uint32_t> in(len, pIn);
  4:     array_view<int32_t> out(len, pOut);
  5:     out.discard_data();
  6:     parallel_for_each(in.extent, [=](index<1> idx) restrict(amp) {
  7:         uint32_t val = in[idx];
  8:         //Per-thread copy of n
  9:         int32_t tmp = n;
  10:         for (int32_t i = 0; i < 32; i++)
  11:         {
  12:             if (val & 1)
  13:             {
  14:                 //Well-defined - modification of per-thread local variable
  15:                 if (--tmp == 0) 
  16:                 {
  17:                     out[idx] = i;
  18:                     return;
  19:                 }
  20:             }
  21:             val >>= 1;
  22:         } 
  23:         out[idx] = 0;
  24:     });
  25: }

Each thread is guaranteed to have its own private set of local variables, so the behavior is well-defined across accelerators.

For changes that you actually need to see outside the kernel, you will need to look into one of the AMP-provided data structures like array, array_view, or writeonly_texture_view. You may also need to make use of the various atomic operations provided by C++ AMP to avoid race conditions when reading and writing the values. Since the current implementation doesn’t copy changes back from the accelerator to host, no existing kernels should be relying on mutability to achieve this.

Bypassing the system – don’t!

Using only const kernel functions will allow the compiler to detect and emit errors for unintended or unsupported writes in most cases, as well as ensure compliance with the intended API. Of course, C++ also has some mechanisms that allow a determined programmer to override the const-ness of an object. In C++ AMP kernels they are likely to lead to undefined behavior. The two most obvious ways are casting away const and declaring data members mutable.

A const_cast or C-style cast within a kernel can allow you to remove the const-ness from a variable in order to write to it. This is not supported, and if detected the compiler will respond with warning C4880.

By declaring a data member mutable, the compiler allows it to be modified even from a const function. Use of mutable data members in C++ AMP kernels may be detected by the compiler in a future release and is currently undefined behavior.

Conclusion

Unless you have been using C++ AMP since early developer previews, your parallel_for_each kernel functions are probably already const, especially if you are making use of lambdas. If you happen to have non-const kernel functions, updating them to be const will ensure they will remain compatible with potential updates of C++ AMP that could disallow non-const kernels, ensure compatibility with 3rd party implementations of C++ AMP, as well as help to avoid relying on implementation- or accelerator-specific behavior. As always, comments and questions are welcome below or on the MSDN Forum.