Simplifying single-dimensional C++ AMP Code

Hello! My name is Daniel Griffing and I’m a test engineer on the C++ AMP team.

C++ AMP provides a great set of capabilities that scale well for N-Rank (dimensional) data by providing data types to encapsulate the shape (concurrency::extent) and specify a given element (concurrency::index) across the dimensions. Additionally, data is wrapped in a concurrency::array_view exposing it in a multi-dimensional way. For data with Rank=1, however, programmers are accustomed to specifying size & index using integer values (e.g. when using std::vector). This blog will look at some ways in which C++ AMP code can be simplified when dealing with 1-D data.

We’ll use a simple example of a function that multiplies each value in a std::vector by some scalar value for illustration.

A non-C++ AMP version of this code could be:

```std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
for (int i = 0; i < outv.size(); ++i)
{
outv[i] = data[i] * multiplier;
}
return outv;
}```

Here’s an equivalent function in C++ AMP code:

```std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float, 1> input(data.size(), data);
array_view<float, 1> output(outv.size(), outv);
parallel_for_each(extent<1>(data.size()), [=](index<1> idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
```

This code differs from the non-C++ AMP code in two ways. We’ll look at ways to simplify the code for each of these items.

· The use of array_view<float, 1> allowing the transfer of data to and from the accelerator device.

· The use of the C++ AMP concurrency::parallel_for_each construct taking two arguments:

• A extent<1> defining the length of the data set
• An amp-restricted lambda function with an index<1> value as its only argument.
array_view<T, Rank = 1>

As you may have seen used in previous blog posts, the array_view type is defined with a default value of 1 for its Rank template parameter.

So, we can write:

`array_view<float> input(data.size(), data);`

`array_view<float, 1> input(data.size(), data);`
parallel_for_each with integers

In our Rank=1 example code, the extent<1> type is used to express the size of the input data and the lambda function takes an index<1> argument providing the index into the data set. To minimize the changes from the original, non-C++ AMP code, it would be convenient to express these values using an integer type as was the case in the original non-C++ AMP code.

Writing a wrapper function for parallel_for_each over 1-D data that uses int arguments in place of extent<1> and index<1> can be created quickly and with a small amount of code, as follows.

```template <typename Kernel>
void parallel_for_each(int ext_size, Kernel kernel)
{
auto krn = [=] (index<1> idx) restrict(amp)
{
kernel(idx[0]);
};
concurrency::parallel_for_each(extent<1>(ext_size), krn);
}
```

We first wrap the user-provided kernel in a lambda invoking the kernel with the integer value idx[0]. The wrapper is then used in a call to the C++ AMP parallel_for_each signature.

In our example, the call to parallel_for_each can now be rewritten to use int arguments:

```parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
```

Summary

The resulting code using our helper functions is:

```std::vector<float> multiply(std::vector<float>& data, float multiplier)
{
std::vector<float> outv(data.size());
array_view<float> input(data.size(), data);
array_view<float> output(outv.size(), outv);
// multiply each element of the input array_view by multiplier
parallel_for_each(data.size(), [=](int idx) restrict(amp)
{
output[idx] = input[idx] * multiplier;
});
output.synchronize();
return outv;
}
```

In this post we looked at a few ways to simplify C++ AMP code for data with Rank=1. We made use of the default value for Rank in array_view<> and implemented a wrapper for parallel_for_each allowing the use of int indices in our execution kernel. This is one example of a higher level abstraction, on top of parallel_for_each, that can be achieved using template programming.

If you have feedback, I’d love to hear it in the comments section below.

Tags

1. Matthew - Developer says:

FYI The html markup/formatting of the second section of code is not quite right. Displays as plain text not <pre>

2. Thanks Matthew!  This has been fixed.

3. Ben Voigt [Visual C++ MVP] says:

It still doesn't appear to be fixed.  Talking about the code snippet immediately following "Here’s an equivalent function in C++ AMP code:".

4. Ken Domino says:

This is a great trick.  Thanks for sharing.