Data under the covers in C++ AMP

In a previous post I explained how data is captured and passed to the accelerator in C++ AMP. In this post I will share the details of how host data is made available and accessed on the accelerator in DirectCompute terms.

A couple of disclaimers:

1. Everything I am going to share should be treated as an implementation detail that could change.
2. Because I am describing our current underpinnings (DirectCompute), probably very little of this would be applicable for other implementations of the C++ AMP open spec

Data in a Compute Shader (aka kernel)

Since the lambda passed to a parallel_for_each under the covers maps to a compute shader, host data needs to be transformed into a form consumable by the shader. DirectX requires that data that needs to be passed to the shader is first wrapped as a resource object which describes the properties of the underlying data such as size in bytes. A resource is then made available to a shader by using a view which defines how the underlying resource will be accessed on the accelerator.

Mapping host data to resources and views

C++ AMP abstracts away these shader concepts and it packages the host data into the corresponding resource and view objects. The captured data containers array and array_view are used to store host side using the ID3D11Buffer resource. All other host data like local variables are stored in constant buffers. Here is an example showing the underlying resource for each host data variable. Here we have 2 arrays in and out which are accessed using views in_view and out_view. The data is incremented by value inc in the parallel_for_each.

extent<1> e(size);

array<int, 1> in(e, data.begin());    // ID3D11Buffer
array<int, 1> out(e);                 // ID3D11Buffer
array_view<int, 1> in_view(in);       // ID3D11Buffer
array_view<int, 1> out_view(out);     // ID3D11Buffer
int inc = 1;                          // constant buffer

parallel_for_each(e, [=](index<1> idx) restrict(amp)
{
   out_view[idx] = in_view[idx] + inc;
});
out_view.synchronize();

In this blog post, we will go in to the details of how C++ AMP maps data containers to different types of ID3D11Buffer. To read more about how local host variables are stored in constant buffers, please refer this blog post on Using Constant Memory in C++ AMP.

Read-Only vs. Read-Write access (SRV vs. UAV)

Resources are made available inside a parallel_for_each using either an ID3D11ShaderResourceView (SRV) or ID3D11UnorderedAccessView (UAV). An SRV is used when the underlying data will only be read from. A UAV is used when the underlying data will be both read from and written to.

It is useful to know when C++ AMP prefers read-only access (SRV) or read-write access (UAV). Using SRVs can have potential performance benefits. Both the compiler and accelerator hardware may take advantage of the read-only nature of the SRV buffer and optimize the code and execution for this.

Further, DirectX has resource limits on the number of UAVs and SRVs that can be used in a compute shader. For example, a shader can use a maximum of 64 UAVs and 128 SRVs.

The table below summarizes when C++ AMP prefers to use SRV (read-only access) or UAV (read-write). Note that the mappings below are preferences and not mandatory. For example: the compiler might decide to use an SRV regardless of the mapping if it can detect that a buffer is only used for reading and not for writing.

Data Container

Data Type

const qualified1

Preferred access

array

n/a

No

read-write (UAV)

Yes

read-only (SRV)

array_view

T

n/a

read-write (UAV)

const T

read-only (SRV)

**1**Marking the lambda mutable has no effect on the choice of view

If you accidently use more SRVs than the limit, the app will fail with the compilation error shown below 

error C3579: The number of read-only instances of concurrency::array and concurrency::graphics::texture passed to concurrency::parallel_for_each cannot exceed 128

If you use more UAVs than the limit, your app will fail to compile with the error shown below:

error C3580: The number of writable instances of concurrency::array and concurrency::graphics::texture passed to concurrency::parallel_for_each cannot exceed 64

Examples and error messages

Let us look the sample from the previous section. The two arrays in and out are stored in read-write buffers and accessed using read-write views in_view and out_view:

extent<1> e(size);

array<int, 1> in(e, data.begin());  // read-write buffer
array<int, 1> out(e);               // read-write buffer
array_view<int, 1> in_view(in);     // read-write view
array_view<int, 1> out_view(out);   // read-write view
int inc = 1;                        // constant buffer
        
// 2 UAVs used
parallel_for_each(e, [=](index<1> idx) restrict(amp)
{
     out_view[idx] = in_view[idx] + inc;
});
out_view.synchronize();

In this next piece of code, the array in is used only for reading and can benefit by using read-only buffers and read-only views:

extent<1> e(size);

const array<int, 1> in(e,data.begin()); // read-only buffer
array<int, 1> out(e); // read-write buffer
array_view<const int, 1> in_view(in); // read-only view
array_view<int, 1> out_view(out); // read-write view
int inc = 1;                             // constant buffer

// 1 SRV and 1 UAVs used
parallel_for_each(e, [=](index<1> idx) restrict(amp)
{
    out_view[idx] = in_view[idx] + inc;
});
out_view.synchronize();

If a buffer is used as read-write in some kernel and read-only in other, we can declare only the view as read-only as shown in this next snippet:

extent<1> e(size);

array<int, 1> in(e,data.begin()); // read-write buffer
array<int, 1> out(e); // read-write buffer
array_view<const int, 1> in_view(in); // read-only view
array_view<int, 1> out_view(out); // read-write view
int inc = 1; // constant buffer
      
// 1 SRV and 1 UAVs used
parallel_for_each(e, [=](index<1> idx) restrict(amp)
{
     out_view[idx] = in_view[idx] + inc;
});
out_view.synchronize();

If you accidently create a read-write array_view over a read-only array:

// read-only array
const array<int, 1> arr(e);
      
//read-write view
array_view<int, 1> arr_view(arr); // Error!

the app will fail with a compilation error like the one shown below:

error C2664: Concurrency::array_view<_Value_type,_Rank>::array_view(Concurrency::array<_Value_type,1> &) restrict(cpu, amp)' : cannot convert parameter 1 from 'const Concurrency::array<_Value_type,_Rank>' to 'Concurrency::array<_Value_type,_Rank> &'
with
[
_Value_type=int,
_Rank=1
]
Conversion loses qualifiers

That concludes my behind the scenes tour of how the C++ data you capture in the lambda gets mapped to equivalent DirectCompute/HLSL programming model concepts.