Simplifying single-dimensional C++ AMP Code

Hello! My name is Daniel Griffing and I’m a test engineer on the C++ AMP team.

C++ AMP provides a great set of capabilities that scale well for N-Rank (dimensional) data by providing data types to encapsulate the shape (concurrency::extent) and specify a given element (concurrency::index) across the dimensions. Additionally, data is wrapped in a concurrency::array_view exposing it in a multi-dimensional way. For data with Rank=1, however, programmers are accustomed to specifying size & index using integer values (e.g. when using std::vector). This blog will look at some ways in which C++ AMP code can be simplified when dealing with 1-D data.

We’ll use a simple example of a function that multiplies each value in a std::vector by some scalar value for illustration.

A non-C++ AMP version of this code could be:

 std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
    std::vector<float> outv(data.size());
    for (int i = 0; i < outv.size(); ++i)
    {
        outv[i] = data[i] * multiplier;
    }
    return outv;
}

Here’s an equivalent function in C++ AMP code:

 std::vector<float> multiply(std::vector<float>& data, float multiplier) 
{ 
    std::vector<float> outv(data.size()); 
    array_view<float, 1> input(data.size(), data); 
    array_view<float, 1> output(outv.size(), outv); 
    parallel_for_each(extent<1>(data.size()), [=](index<1> idx) restrict(amp) 
    { 
        output[idx] = input[idx] * multiplier; 
    }); 
    output.synchronize(); 
    return outv; 
} 

This code differs from the non-C++ AMP code in two ways. We’ll look at ways to simplify the code for each of these items.

· The use of array_view<float, 1> allowing the transfer of data to and from the accelerator device.

· The use of the C++ AMP concurrency::parallel_for_each construct taking two arguments:

  • A extent<1> defining the length of the data set
  • An amp-restricted lambda function with an index<1> value as its only argument.
array_view<T, Rank = 1>

As you may have seen used in previous blog posts, the array_view type is defined with a default value of 1 for its Rank template parameter.

So, we can write:

 array_view<float> input(data.size(), data);

…instead of:

 array_view<float, 1> input(data.size(), data);
parallel_for_each with integers

In our Rank=1 example code, the extent<1> type is used to express the size of the input data and the lambda function takes an index<1> argument providing the index into the data set. To minimize the changes from the original, non-C++ AMP code, it would be convenient to express these values using an integer type as was the case in the original non-C++ AMP code.

Writing a wrapper function for parallel_for_each over 1-D data that uses int arguments in place of extent<1> and index<1> can be created quickly and with a small amount of code, as follows.

 template <typename Kernel>
void parallel_for_each(int ext_size, Kernel kernel)
{
    auto krn = [=] (index<1> idx) restrict(amp)
    {
        kernel(idx[0]);
    };
    concurrency::parallel_for_each(extent<1>(ext_size), krn);
}

We first wrap the user-provided kernel in a lambda invoking the kernel with the integer value idx[0]. The wrapper is then used in a call to the C++ AMP parallel_for_each signature.

In our example, the call to parallel_for_each can now be rewritten to use int arguments:

 parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
    output[idx] = input[idx] * multiplier;
});

Summary

The resulting code using our helper functions is:

 std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
    std::vector<float> outv(data.size());
    array_view<float> input(data.size(), data);
    array_view<float> output(outv.size(), outv);
    // multiply each element of the input array_view by multiplier
    parallel_for_each(data.size(), [=](int idx) restrict(amp)
    {
        output[idx] = input[idx] * multiplier;
    });
    output.synchronize();
    return outv;
}

In this post we looked at a few ways to simplify C++ AMP code for data with Rank=1. We made use of the default value for Rank in array_view<> and implemented a wrapper for parallel_for_each allowing the use of int indices in our execution kernel. This is one example of a higher level abstraction, on top of parallel_for_each, that can be achieved using template programming.

If you have feedback, I’d love to hear it in the comments section below.