Shared memory support in C++ AMP – array_view

Article
07/11/2013

The previous post in this series provided an introduction of CPU/GPU shared memory support in C++ AMP. In this post we will take a closer look at how shared memory is implicitly used by the C++ AMP runtime underlying array_view objects,to boost the performance of your existing C++ AMP code.

Implicit use of shared memory by array_view

The array_view abstraction in C++ AMP is designed to enable users to program against a container that is accessible for reading and writing both on the CPU and any accelerator. This is meant to free the programmer from having to manage synchronization of data between the CPU and GPU which may need to be performed differently for different types of accelerators (such as on-die integrated vs. PCIe bus connected discrete GPU).

Starting with Visual Studio 2013, on Windows 8.1 the C++ AMP runtime will leverage shared memory support on accelerators where appropriate (or explicitly indicated by the programmer) to eliminate or reduce the cost of implicit data transfers for synchronizing array_view contents between the CPU and accelerator_view(s). Any implicit memory allocations when accessing an array_view on an accelerator_view will be created with the accelerator’s default_cpu_access_type. The extent of performance benefits achieved through implicit use of shared memory underlying an array_view, largely depends on what is the data source of the array_view.

array_view with CPU memory data source (heap memory, STL array or vector containers, concurrency::array allocated on CPU accelerator)

For such array_view(s), the cost of synchronizing between the CPU and an accelerator_view on an accelerator with a default CPU access_type other than access_type_none will be greatly reduced but not completely eliminated. Until Visual Studio 2012, synchronizing between the CPU and such an accelerator_view comprised of 2 steps – one copy between the CPU source and a temporary intermediate staging buffer and a second copy between the staging buffer and accelerator_view memory. With shared memory support, the copy to an intermediate staging buffer will no longer be needed for such accelerators and the data can directly be copied between the CPU source memory and the accelerator_view memory.

Note that how this saving kicks in depends on the accelerator’s default CPU access_type. For accelerators with a default CPU access_type of access_type_read_write, the intermediate copy will be eliminated both when copying from the CPU to accelerator memory and vice-versa. If an accelerator has a default CPU access_type of access_type_write, the intermediate copy to a staging buffer will be eliminated only when synchronizing modifications from the CPU or another accelerator_view to the target accelerator_view; synchronizing from the accelerator_view to the CPU or another accelerator_view will still need an intermediate copy to a temporary staging array since the source accelerator_view buffers are only accessible on the CPU for write accesses and cannot be read from on the CPU.

 // Set the default CPU access_type on the target accelerator to be access_type_write
 accelerator accl = accelerator(…);
 accl.set_default_cpu_access_type(access_type_write);
 accelerator_view av = accl.default_view;
  
 std::vector<float> src(size, 0);
 array_view<float> arrView(src);
  
 // When synchronizing the contents from CPU to the accelerator_view “av”
 // the runtime will directly copy from the source vector memory to the
 // accelerator_view memory buffer without requiring an intermediate copy
 // to a temporary staging buffer
 parallel_for_each(av, arrView.extent, [=](const index<1>& idx) restrict(amp) {
     arrView[idx] += idx[0];
 });
  
 // When synchronizing the contents from the accelerator_view to the CPU
 // the runtime will need to copy from the accelerator_view memory 
 // to the source vector memory through an intermediate temporary staging buffer, 
 // since the accelerator_view buffer is not accessible for reading on the CPU by default.
 const float* pData = arrView.data();

array_view with a staging array as data source

When constructing a staging array, if the specified associated_accelerator_view has a default CPU access_type of access_type_read_write, the runtime will create a shared memory allocation which is directly accessible both on the CPU as well as the associated_accelerator_view. Hence synchronization between the accelerator_view and the CPU for an array_view created with such a staging array as data source, will not incur any data transfers. If the associated_accelerator_view for the staging array does not have a default CPU access_type of access_type_read_write, a normal staging array will be constructed and the synchronization characteristics for such an array_view would be same as in Visual Studio 2012.

 // The accelerator corresponding to the accelerator_view “av” has default CPU
 // access type of access_type_read_write.
 accelerator_view av = accelerator(…).default_view;
  
 // Create a staging array associated with the target accelerator_view. 
 // The runtime implicitly creates a shared memory allocation accessible both on 
 // CPU and the accelerator.
 accelerator_view cpuAv = accelerator(accelerator::cpu_accelerator).default_view;
 concurrency::array<float> src(size, initData.begin(), cpuAv, av);
 array_view<float> arrView(src);
  
 // When synchronizing the contents from CPU to the accelerator_view 
 // the runtime will not require any data copying at all
 parallel_for_each(av, arrView.extent, [=](const index<1>& idx) restrict(amp) {
     arrView[idx] += idx[0];
 });
  
 // When synchronizing the contents from the accelerator_view to the CPU
 // the runtime will not need any data copying at all
 const float* pData = arrView.data();

array_view with an array allocated on an accelerator_view and that is CPU accessible, as data source

For such array_view(s), the cost of synchronizing between the CPU and the source array’s accelerator_view will be completely eliminated if the source array has CPU access of type access_type_read_write (by default the CPU access_type of an array is determined by the target accelerator’s default_cpu_access_type or can be explicitly specified when constructing an array as will be described in a later post). Otherwise synchronization between the accelerator_view and CPU will have the same characteristics as an array_view created with a CPU memory data source.

array_view without a data source

An array_view without a data source internally uses the first allocation corresponding to the array_view as its internal data source. Hence if such an array_view is first accessed on the CPU, it will have the same behavior as an array_view created over a CPU memory data source. If the array_view is first accessed on an accelerator_view, the behavior would be the same as an array_view whose data source is an array created on that accelerator_view (with access_type_auto CPU access type).

Concurrently accessing a cpu_shared array on the CPU and the GPU

Similar to a staging array, it is illegal to access an array with CPU access concurrently on the CPU and the GPU and will result in undefined behavior. Note that this is illegal even if non-overlapping portions of the data are being referenced through an array_view and even if the data is only read from and not written to.

Also (again similar to staging arrays), CPU references/pointers to elements of a CPU accessible array allocated on an accelerator with shared memory support, are invalidated after the array is accessed on the accelerator and thus should not be cached across accesses on the accelerator. Following accelerator access, data() method on array/array_view can be used to obtain a valid CPU pointer which can be safely used only until the next accelerator access for the array.

In closing

If you are already using the array_view type for accessing data across the CPU and an accelerator, in your C++ AMP code, simply recompiling your application with Visual Studio 2013 can significantly boost your application’s performance when running on a Windows 8.1 system with an integrated GPU or WARP accelerator. The elimination of (or reduction in) the cost of data synchronization between the CPU and the accelerator, presents huge acceleration opportunities for your application; you can offload data parallel computations to a C++ AMP integrated GPU accelerator, without worrying about the overhead of transferring data to and from the accelerator.

In the next post, we will look at how C++ AMP allows fine-grained control over use of shared memory allocations in your code through the concurrency::array type. Till then, please keep your feedback, comments and questions coming - below or in our MSDN forum