Staging Arrays in C++ AMP

Hi there, my name is Weirong Zhu, and I’m a developer on the C++ AMP team.  In this blog post, I will show you how to use staging arrays to optimize your data transfers between the host and a C++ AMP accelerator_view.

When you use the new concurrency::array data container introduced by C++ AMP, you are in charge of the data transfer between host memory (e.g. STL data containers) and the GPU memory (e.g. arrays on GPU accelerator_view’s). Below is a simple example that contains copy between a std::vector and a device array.

 // An std::vector allocated on host
 std::vector<int> hostVector(numElems);
  
 // populate hostVector
 int i = 0;
 std::generate(hostVector.begin(), hostVector.end(), [&i](){return i++;});
  
 // A concurrency::array allocated on an accelerator_view
 array<int, 1> deviceArray(numElems, deviceAcclView);
  
 // Copy from the host vector to the device array
 copy(hostVector.begin(), hostVector.end(), deviceArray);
  
 // process deviceArray on device
  
  
 // Copy from the device array to the host buffer
 copy(deviceArray, hostVector.begin());
  
 // process hostVector on host

Note that the construction of array and the copying in data into array could also be combined into a single statement with a constructor like:

 array<int, 1> deviceArray(numElems, hostVector.begin(), hostVector.end(), deviceAcclView);

Now, let’s dive under the covers of the copy function.  As you know, the current version of C++ AMP is implemented on top of DirectX 11.  When a copy takes place from a host buffer to a device buffer, the following steps are involved:

  1. Create a CPU side staging buffer (A buffer with usage flag D3D11_USAGE_STAGING).
  2. The staging buffer can be mapped and returns a CPU pointer.  The pointer can be used to copy host data (e.g. hostVector in above example) into the staging buffer.
  3. Copy from the host buffer to the mapped staging buffer.
  4. Unmap the staging buffer, then perform a copy from the staging buffer to the GPU buffer (e.g. deviceArray in above example).
  5. Release the staging buffer

For D3D users who use APIs like UpdateSubresource to directly perform the copy in, some of the above steps are hidden by the API.

The steps of copying from device buffer to host buffer is similar except that the copy direction is reversed.

Staging buffer is system memory that meets the alignment requirement for DMA transfer to/from GPU.  During the transfer, the memory allocated for staging buffer is pinned so device can access it using DMA without the intervention of CPU.  However, in DirectX, they are only allowed to be the source or destination of data transfer, and cannot be accessed by a shader.

As shown above, when a “copy” from C++ AMP code is performed, there are two underlying copies involved, as well as the creation, map, unmap, and release of the staging resource. It can be costly, especially if repeated copying in and out from/to the same host buffer is needed for the application.

For many applications, the data stays on the GPU for repeated computations once it has been transferred.  The performance of the data transfer is thus not the dominant part of the execution, thus it is not that important. However, there are also plenty of cases where the data transfer performance does matter.  For those scenarios, C++ AMP offers staging arrays.

A staging array is still a concurrency::array.  In this release, it is backed up directly by a DirectX 11 staging buffer and is accessible on CPU (with cautions).

Staging arrays are differentiated from normal arrays by their construction.  A normal array is constructed as:

 array<int, 1> deviceArray(numElems, deviceAcclView);

Note if you don’t specify the second parameter, the default view of the default accelerator will be selected for you. A staging array is constructed with a second accelerator_view, as:

 array<int, 1> stagingArray(numElems, acclView1, acclView2);

The interpretation is:

  1. The stagingArray is physically located on acclView1.  So the “accelerator_view” property of stagingArray returns the value of acclView1.  As a result, it is accessible on acclView1.
  2. The second accelerator_view parameter is called associated accelerator_view, it is a hint to the runtime that you will often transfer data between this array and other arrays on acclView2, therefore, the implementation should be optimized for such data transfers.

C++ AMP runtime will try to honor the hint as much as it can.  In the first release, it only helps in the case where acclView1 is a cpu accelerator_view, and acclView2 is a device accelerator_view, as:

 accelerator cpuAccelerator = accelerator(accelerator::cpu_accelerator);
 array<int, 1> stagingArray(numElems, cpuAccelerator.default_view, deviceAcclView);

As constructed above, the stagingArray can be read/written on CPU code, just like an array constructed with cpuAcceleator without the second parameter acclView2, and it could be copied over to another array on deviceAcclView efficiently without extra copy and buffer creation/release, omitting steps 1, 3, and 5 above when we looked under the covers of copy.  You can access the stagingArray on host by either using the [] operator or use the pointer obtained from the data() member method.  Now, we can re-write our first example by replacing the std::vector with staging array, so the staging array would be used as the container for the host side processing.

 // A staging array allocated on host
 array<int, 1> stagingArray(numElems, cpuAccelerator.default_view, deviceAcclView);
  
 // populate stagingArray on host
 int i = 0;
 std::generate(stagingArray.data(), stagingArray.data() + numElems, 
              [&i](){return i++;});
  
 // A concurrency::array allocated on an accelerator_view
 array<int, 1> deviceArray(numElems, deviceAcclView);
  
 // Copy from the staging array to the device array
 copy(stagingArray, deviceArray);
  
 // process deviceArray on device
  
  
 // Copy from the device array to staging array
 copy(deviceArray, stagingArray);
  
 // process stagingArray on host

We also recommend that you use concurrency::array_view, and let the runtime take care of the copy operations implicitly. This way, you can get the benefit of improved copy performance by wrapping a staging array with an array_view, and use the array_view afterwards.  For example,

 // create an array_view over the stagingArray located on the CPU accelerator 
 array_view<int, 1> av(stagingArray);   
  
 // Use av on deviceAcclView, runtime manages the copy to deviceAcclView 
 parallel_for_each(deviceAcclView, av.extent, [=] (index<1> idx) restrict(amp) 
 {     av[idx] = idx[0]; 
 });   
  
 // Access av on the CPU, runtime manages the copy out
 const int * p = av.data();
 for (int i = 0; i < numElems; i++)  
 {     
    cout << "av[ " << i << "] = " << p[i] << endl; 
 }

In above example, because the synchronization of the array_view is managed by runtime implicitly, the runtime can take advantage of the fact that av is an array_view of a staging array that has deviceAcclView as the associated accelerator_view and optimize for the data transfer to/from deviceAcclView.

In the future releases, we may extend the interpretation of staging array for other purposes. For example, we may allow direct access of the stagingArray (which is physically located on acclView1) from computation executing on acclView2 (i.e. zero-copy).  Thus, it is future proof to create an array_view over the staging array, and use the array_view afterwards.  In above example, if computation on acclView2 is able to directly access stagingArray, there is no need to modify the code, and the C++ AMP runtime will just skip the copy.

In summary, you can do all CPU side data preparation, initialization, and computation using stagingArray or av directly.  Data transfer between it and the arrays on stagingArray’s associated accelerator_view is more efficient.

However, you need to be careful when use staging array. As mentioned before, it is backed up by DirectX 11 staging buffer that is allocated from system memory. Such memory could be precious resource.  Also, when a copy operation is in process, the CPU pointer to the memory becomes invalid.  To safely use staging array, the rules are: 

(1) you cannot access the array while a copy operation that uses the staging array or a parallel_for_each that uses an array_view over that staging array is in flight concurrently, for example

 CPU Thread 1

CPU Thread 2

// Bad, there is a copy using stagingArray

// in proceeding on thread 2.

int x = stagingArray[0];

stagingArray[1] = 1;

copy(stagingArray, deviceArray);

 

 

 

CPU Thread 1

CPU Thread 2

// Bad, there is a parallel_for_each

// using av in proceeding on

// thread 2.

int x = av[0];

av[1] = 1;

parallel_for_each(

deviceAcclView, av.extent,

 [=] (index<1> idx) restrict(amp) {

   av[idx] = idx[0];

});

 

(2) you should be careful on caching the pointers obtained via the array::data() method or &stagingArray[idx]. These pointers are not guaranteed to be valid once a copy operation or parallel_for_each involving the staging array or its array_view starts. Using these pointers can cause undefined behavior. Therefore, these pointers must not be cached for use across such intervening operations.  Otherwise, you will get undefined behavior (e.g. access violation).  For example,

 // CPU thread 1
  
 int * p1 = stagingArray.data();
 const int * p2 = &stagingArray[0];
  
 qsort(p1, stagingArray.extent.size(), sizeof(int), cmpFunc); // ok
  
 int x = *p2; // ok 
  
 copy(stagingArray, deviceArray);
  
 int y = stagingArray[0]; // ok, use [] operator is safe
 stagingArray[0] = y; // ok, use [] operator is safe
  
 *p1 = 0; // undefined, do not use cached pointer 
          // across copy that uses stagingArray
 int z = *p2; // undefined, do not use cached pointer across copy
  
 int * p3 = &av[0];
 int w = *p3; // ok
  
 const int * p4 = &stagingArray[0];
  
 // Use av on deviceAcclView
 parallel_for_each(deviceAcclView, av.extent, [=] (index<1> idx) restrict(amp) {
     av[idx] = idx[0];
 });
  
 int r = *p3; // undefined, do not use cached pointer across parallel_for_each
 r += *p4; // undefined, do not use cached pointer across parallel_for_each

In general, you should consider using staging array only if the data transfer performance is critical for your application.

Saying all that, you may be wondering what kind of performance improvement you can get if you use a staging array?  The answer is classic – it depends :-). There are many factors (CPU architecture, GPU architecture, PCI express architecture, motherboard chipset, memory controller, memory speed, etc.)  that can impact the data transfer performance: so try it out with your workload in your hardware environment to see if the performance gain is worth the use of this feature for your scenario.