Simplifying single-dimensional C++ AMP Code
Hello! My name is Daniel Griffing and I’m a test engineer on the C++ AMP team.
C++ AMP provides a great set of capabilities that scale well for N-Rank (dimensional) data by providing data types to encapsulate the shape (concurrency::extent) and specify a given element (concurrency::index) across the dimensions. Additionally, data is wrapped in a concurrency::array_view exposing it in a multi-dimensional way. For data with Rank=1, however, programmers are accustomed to specifying size & index using integer values (e.g. when using std::vector). This blog will look at some ways in which C++ AMP code can be simplified when dealing with 1-D data.
We’ll use a simple example of a function that multiplies each value in a std::vector by some scalar value for illustration.
A non-C++ AMP version of this code could be:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
for (int i = 0; i < outv.size(); ++i)
{
outv[i] = data[i] * multiplier;
}
return outv;
}
Here’s an equivalent function in C++ AMP code:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float, 1> input(data.size(), data);
array_view<float, 1> output(outv.size(), outv);
parallel_for_each(extent<1>(data.size()), [=](index<1> idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
This code differs from the non-C++ AMP code in two ways. We’ll look at ways to simplify the code for each of these items.
· The use of array_view<float, 1> allowing the transfer of data to and from the accelerator device.
· The use of the C++ AMP concurrency::parallel_for_each construct taking two arguments:
- A extent<1> defining the length of the data set
- An amp-restricted lambda function with an index<1> value as its only argument.
array_view<T, Rank = 1>
As you may have seen used in previous blog posts, the array_view type is defined with a default value of 1 for its Rank template parameter.
So, we can write:
array_view<float> input(data.size(), data);
…instead of:
array_view<float, 1> input(data.size(), data);
parallel_for_each with integers
In our Rank=1 example code, the extent<1> type is used to express the size of the input data and the lambda function takes an index<1> argument providing the index into the data set. To minimize the changes from the original, non-C++ AMP code, it would be convenient to express these values using an integer type as was the case in the original non-C++ AMP code.
Writing a wrapper function for parallel_for_each over 1-D data that uses int arguments in place of extent<1> and index<1> can be created quickly and with a small amount of code, as follows.
template <typename Kernel>
void parallel_for_each(int ext_size, Kernel kernel)
{
auto krn = [=] (index<1> idx) restrict(amp)
{
kernel(idx[0]);
};
concurrency::parallel_for_each(extent<1>(ext_size), krn);
}
We first wrap the user-provided kernel in a lambda invoking the kernel with the integer value idx[0]. The wrapper is then used in a call to the C++ AMP parallel_for_each signature.
In our example, the call to parallel_for_each can now be rewritten to use int arguments:
parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
Summary
The resulting code using our helper functions is:
std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float> input(data.size(), data);
array_view<float> output(outv.size(), outv);
// multiply each element of the input array_view by multiplier
parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
In this post we looked at a few ways to simplify C++ AMP code for data with Rank=1. We made use of the default value for Rank in array_view<> and implemented a wrapper for parallel_for_each allowing the use of int indices in our execution kernel. This is one example of a higher level abstraction, on top of parallel_for_each, that can be achieved using template programming.
If you have feedback, I’d love to hear it in the comments section below.