Shared memory support in C++ AMP – a deeper dive

The previous posts in this series on shared memory support in C++ AMP covered:

  1. Introduction to shared memory support in C++ AMP
  2. Implicit use of shared memory underlying array_view by the C++ AMP runtime

In this final post of the series, we will dive deeper into the topic and look at:

· How C++ AMP programmers can exercise fine grained control on the use of shared memory allocations through the array type.

· Discuss use of shared memory on discrete GPU accelerators (connected to CPU over PCIe)

· Discuss the rationale behind design choices made by C++ AMP w.r.t. implicit use of shared memory.

Constructing array objects accessible both on the accelerator and the CPU

The default_cpu_access_type setting for an accelerator globally determines if and how shared memory would be used when allocating memory on that accelerator. Additionally, C++ AMP programmers can also explicitly control the usage of shared memory in their application by specifying the desired CPU access_type for an accelerator memory allocation corresponding to an array object. This provides fine grained control over the use of shared memory allocations (and the type of CPU access they have) for each array data container used in your C++ AMP code.

explicit array(const extent<N>& extent, accelerator_view av, access_type _Cpu_access_type = access_type_auto)

The _Cpu_access_type parameter has been added to array constructors (only where an accelerator_view is explicitly specified), which specifies the CPU access type for the constructed array. By default, the parameter has a value of access_type_auto which results in the creation of an array with accelerator::get_default_cpu_access_type() CPU access. Also, creation of an array with the access_type_auto CPU access type results in the default_cpu_access_type setting for the accelerator to be finalized and users will be not be able to subsequently change the setting through the accelerator::set_default_cpu_access_type method.

An attempt to construct an array with CPU access on an accelerator without shared memory support will result in a runtime_exception.

An array object constructed with a CPU access_type specified, can be both read from and written to on the accelerator_view where the array is allocated. Additionally, the array is also accessible on the CPU for the type of accesses allowed by array::cpu_access_type. Any accesses of a type other than that allowed by the array::cpu_access_type will result in undefined behavior; e.g. reading an array created with access_type_write CPU access on the CPU, has undefined behavior.

 

The CPU access_type of an array object can be queried using a new method in the array type:

__declspec(property(get=get_cpu_access_type)) access_type cpu_access_type

access_type get_cpu_access_type() const

Returns the CPU access type for this array.

 

In the next section we will look at an example of explicit shared memory usage through the array type.

Shared memory support on discrete GPU accelerators

So far we have talked about shared memory in the context of integrated GPU and WARP accelerators which share the physical RAM with the CPU. However, discrete GPUs (connected to the CPU through PCIe), which have dedicated local video RAM separate from the CPU RAM, are also (typically) capable of directly accessing CPU memory over the PCIe bus (aka zero-copy). Such remote accesses to CPU RAM over the PCIe have much higher latency and lower bandwidth compared to dedicated local GPU RAM accesses. Regardless, depending on the memory access pattern of the compute kernel, use of high-latency and low-bandwidth shared memory accesses on discrete GPU accelerators may still be beneficial in some scenarios compared to the combined cost of copying data to dedicated GPU memory and local accesses from the dedicated GPU memory.

The typical scenario where use of shared memory on discrete GPU accelerators is beneficial is when each byte of some data located in shared CPU memory is accessed just once during the compute kernel’s execution on the accelerator. Since each byte of data is accessed just once during the kernel’s execution, the slower accesses over PCIe are still relatively better compared to explicitly copying the data to local GPU RAM and subsequently performing local accesses to this data from the GPU memory. On the other hand, if a byte of data is accessed multiple times by different threads during the kernel’s execution, shared memory accesses to that byte require the byte to be transferred multiple times across the slow PCIe bus. A more performant approach in such scenarios would be to copy the data to the local GPU RAM once and subsequently have much faster accesses to the (cached) data locally from different threads on the accelerator. 

 accelerator accl = accelerator();
 array<float, 2> arrA(numRowsA, numColsA, matA, accl.default_view);
 array<float, 2> arrB(numColsA, numColsB, matB, accl.default_view);
  
 array<float, 2> *pArrC = NULL;
 // The elements of the output array are just written to once on the
 // accelerator. Let's allocate a CPU accesible (shared memory) array
 // to directly access the shared memory array on the accelerator instead
 // of allocating a dedicated GPU memory array and copying it to the CPU
 // after the computation
 if (accl.supports_cpu_shared_memory) {
     pArrC = array<float, 2>(numRowsA, numColsB, accl.default_view, access_type_read);
 }
 else {
     pArrC = array<float, 2>(numRowsA, numColsB, accl.default_view);
 }
  
 array<float, 2> &arrC = *pArrC;
  
 parallel_for_each(arrC.extent, [&](const index<2>& idx) restrict(amp) {
     float temp = 0;
     for (int k = 0; k < arrA.extent[1]; ++k) {
         temp += arrA(idx[0], k) * arrB(k, idx[1]);
     }
  
     arrC[idx] = temp;
 });
  
  
 float* matC = NULL;
  
 // If the outputArray is CPU accessible lets directly get a CPU pointer
 // on the CPU for the array data. Otherwise, we need to copy the arrC contents 
 // to teh CPU from accelerator memory
 if ((arrC.cpu_access_type == access_type_read) ||
     (arrC.cpu_access_type == access_type_read_write)) {
     matC = arrC.data();
 }
 else {
     matC = new float[arrC.extent.size()];
     copy(arrC, matC);
 }

 

Understanding C++ AMP runtime’s design choices w.r.t. automatic use of shared memory

 Current state of the hardware

One would think, on-die integrated GPUs share the same physical RAM with the CPU and there should never be a need for a separate allocation and copying for accessing data in CPU memory on the GPU – right? Unfortunately, that is not quite true for all current hardware (though it is getting there). There are 2 key factors at play here.

Firstly, integrated GPUs from different hardware vendors have different CPU and GPU access bandwidth and latency characteristics for shared memory. On AMD’s Llano and Trinity APUs the GPU can access memory either via a low-bandwidth bus that allows the GPU to snoop the CPU cache (for coherence) or a high-bandwidth Radeon bus that bypasses the CPU caches. The latter offers nearly 3 times the bandwidth of the former but can only be used for accessing memory that is not readable on the CPU. Thus, on such parts, using shared memory for GPU accesses when the memory is also to be read on the CPU, would result in reduced memory access bandwidth on the GPU compared to a memory allocation dedicated for GPU accesses or only allowed to be written to by the CPU. The next generation of AMD APUs (codenamed Kaveri), are expected to narrow this gap. On Intel Ivy Bridge and Haswell CPUs the GPU/CPU memory access bandwidth is unaffected by sharing.

Secondly, currently both on Intel and AMD CPUs with integrated GPUs, the GPU accessible memory is required to be aligned at certain hardware specific boundaries. Regular CPU memory allocations (heap, stack, process data segments) may not meet these alignment requirements; allocations shared across the CPU and the GPU have to be made through the drivers to ensure appropriate alignment requirements.

Automatic use of shared memory by C++ AMP runtime

C++ AMP automatically uses shared memory only on Intel integrated CPU accelerators and WARP. Use of shared memory on these accelerators can only improve performance and never hurt – shared memory access performance on these accelerators is at par with dedicated accelerator memory and the sharing obviously helps improve performance by eliminating or reducing copying of data between the accelerator and CPU.

On all other accelerators, C++ programmers are required to explicitly opt into the use of shared memory based on the memory access pattern of their C++ AMP compute kernels, and the memory access performance characteristics of the target accelerator. For example, on AMD APUs the memory access bandwidth on the accelerator for dedicated memory is much higher compared to shared memory that is both readable and writable on the CPU. Consequently automatic use of shared memory on these APUs may result in a loss of perf – we will save some copying of data but the kernels may run slower due to the decreased bandwidth on the accelerator. Having said that, there would still be scenarios (depending on your kernel’s memory access patterns) that would benefit from use of shared memory on these cards – just that the runtime cannot currently determine this since whether it will be beneficial or not depends on the kernel’s memory access pattern. Thus C++ AMP users are advised to explicitly employ use of shared memory on accelerators based on their kernels’ memory access patterns and the memory access performance characteristics of the target accelerator.

Another notable piece of information in this context is that on AMD Llano and Trinity APUs shared memory that is only writable (not readable) on the CPU has roughly the same accelerator memory access bandwidth as dedicated memory. Thus on these AMD APUs typically it will be a win to set the “default_cpu_access_type” to “access_type_write” since it will instruct the runtime to automatically use shared memory that is only writable on the CPU. Such memory will continue to have high accelerator access bandwidth and at the same time there will be reduced copying when transferring data from the CPU to the accelerator (when synchronizing from the accelerator to the CPU the old behavior of copying through an intermediate staging buffer).

No shared memory support for concurrency::texture

If you are wondering whether shared memory is also supported for textures, the answer is no. Texture memory layouts are opaque to the end user since different GPU vendors lay them out in swizzled form in memory for better 2D or 3D special locality. They are not laid out linearly in memory like array or array_view allocations. Thus use of shared memory for texture data would have a performance downside – either the textures would have to be linearly laid out in memory resulting in relatively poor 2D spatial locality or the data would have to be unswizzled on the fly to present a linear layout on the CPU. Consequently, C++ AMP textures currently do not support shared memory for textures.

To sum it all up

Starting with Visual Studio 2013, on Windows 8.1 C++ AMP applications can avail sharing of memory between the CPU and the GPU. On a specific set of accelerators (Intel integrated GPUs and WARP), the runtime automatically uses shared memory by default. When targeting these accelerators, if you use array_view objects for managing data access across the CPU and accelerator, simply recompiling your application with Visual Studio 2013 is likely to significantly improve your application’s performance.

To avail the performance benefits of shared memory on other integrated GPU accelerators, you can globally specify the type of shared memory to be used by default for allocations on this accelerator, through a single API call during application startup. Finally, fine grained control over use of shared memory is available through the ability to explicitly specify the CPU access_type during construction of array containers.

Use of shared memory can significantly boost the performance of C++ AMP code on integrated GPU and WARP accelerators by eliminating or reducing the cost of data transfers between the CPU and the accelerator. On discrete GPU accelerators too, use of shared memory can yield modest performance gains in some scenarios. The availability of shared memory presents great acceleration opportunities for data parallel computations, by enabling offloading of such computations to a C++ AMP accelerator with no (or little) data transfer overheads to worry about.

I would love to hear your feedback, comments and questions below or in our MSDN forum.