Shared memory support in C++ AMP – Introduction


Several modern CPU parts such as 3rd Gen (and later) Intel Core Processors (e.g. Ivy Bridge and Haswell) and all AMD APUs (e.g. Llano and Trinity), have an on-die integrated GPU accelerator which shares the same physical RAM with the CPU. On such parts creating copies of data from CPU memory for GPU access and (vice-versa) is mostly unnecessary. The same is applicable for the C++ AMP CPU fallback WARP accelerator.

As described in an earlier post, the concurrency::array_view type in C++ AMP is designed to free programmers from the burden of managing data transfers to/from accelerators. Users can program against the array_view abstraction, leaving the job of determining whether copying is required to provide access to the data on the target hardware, to the runtime. However, before Visual Studio 2013, the C++ AMP runtime always copies data between the CPU and an accelerator; even on WARP and integrated GPU accelerators where a shared system memory allocation can be efficiently accessed both on the CPU and the accelerator. Starting with Visual Studio 2013, on Windows 8.1 C++ AMP supports CPU/GPU shared memory to eliminate or significantly improve performance of data transfers on such accelerators, details of which I will share in a series of blog posts starting with this.

Querying shared memory support on a C++ AMP accelerator

With Visual Studio 2013, on Windows 8.1 (and later OS versions) C++ AMP accelerators will optionally support shared memory which is directly accessible both on the CPU and the accelerator. The following property on the accelerator type can be used to query shared memory support on a specific accelerator.

bool get_supports_cpu_shared_memory () const
__declspec(property(get=get_supports_cpu_shared_memory)) bool supports_cpu_shared_memory

Returns a boolean flag indicating whether this accelerator supports shared memory that is accessible both on the accelerator and the CPU.

 

Automatic use of shared memory by C++ AMP runtime

On select accelerators, where the CPU/GPU memory access performance characteristics (bandwidth and latency) for shared memory are known to be exactly same as dedicated CPU/GPU only memory, the C++ AMP runtime will use shared memory by default. This will eliminate and/or significantly reduce the cost of copying data to/from such accelerators, without requiring any changes in existing code. For Visual Studio 2013, this list comprises all 3rd Gen and later Intel Core Processors (e.g. Ivy Bridge and Haswell) with integrated GPU accelerators, and the WARP (C++ AMP CPU fallback) accelerator. This list of accelerators, where shared memory is used by default, will be revised in the future to include new accelerators depending on their shared memory access performance characteristics.

The following method of the accelerator type can be used to query the default CPU access_type for memory allocations. The default_cpu_access_type setting on an accelerator determines if and how shared memory allocations will be used by default when allocating array(s) on that accelerator or when array_view objects are accessed on the accelerator. In other words, this setting controls the default CPU accessibility of memory allocations on the accelerator.

enum access_type
{
    access_type_none,
    access_type_read,
    access_type_write,
    access_type_read_write = access_type_read | access_type_write,
    access_type_auto,
};

 

access_type accelerator::get_default_cpu_access_type() const
__declspec(property(get=get_default_cpu_access_type)) access_type default_cpu_access_type

Returns the default CPU access_type for array allocations on this accelerator and for implicit allocations when array_view objects are accessed on this accelerator. Following is the meaning for different possible return values of this method:

  • access_type_none: Dedicated allocation only accessible on the accelerator and not on the CPU.
  • access_type_read: Shared memory allocation accessible on the accelerator and only for reading on the CPU.
  • access_type_write: Shared memory allocation accessible on the accelerator and only for writing on the CPU.
  • access_type_read_write: Shared memory allocation accessible on the accelerator and for both reading and writing on the CPU.

                      

For integrated GPU accelerators in 3rd Gen and later Intel Core Processors (e.g. Ivy Bridge and Haswell) and the WARP accelerator, by default the above method will return a value of access_type_read_write.

On all other accelerators, the runtime will not use shared memory by default and the above method will return a value of access_type_none. This means that by default the Visual Studio 2012 behavior of copying between the CPU and the GPU would result on such accelerators. However, the following API on the accelerator type is provided to enable users to override the default CPU access_type to be used for all memory allocations underlying array/array_view objects on that accelerator.

bool set_default_cpu_access_type(access_type _Default_cpu_access_type)

Sets the default CPU access_type for all array allocations (and implicit array_view allocations) on this accelerator’s accelerator_view(s). Note that this method can be used for setting the default CPU access_type only if no array allocations have been made on this accelerator with the access_type_auto value, and no array_view objects have been accessed on this accelerator. Also, this method throws a runtime_exception with the E_INVALIDARG error code if a _Default_cpu_access_type argument other than access_type_none is passed to an accelerator that does not support zero-copy.

The function returns a boolean flag indicating if the default CPU access type for the accelerator was successfully overridden.

 

The recommended usage of this feature is that during startup, an application sets the appropriate default CPU access_type for memory allocations underlying array/array_view objects on the target accelerator. Whether use of shared memory on a specific accelerator will be beneficial, and if so what would be the appropriate CPU access_type for an allocation, depends both on the CPU/GPU memory access pattern of the application and CPU/GPU memory access performance characteristics for shared memory allocations on that accelerator. The memory access performance characteristics for shared memory allocations on an accelerator can be profiled using micro-benchmarks. Such profiling should either be performed offline (like during application installation) or may be performed during the first time the application is run on a system, and the results cached for use in subsequent runs of the application for setting the accelerator’s default CPU access_type for array/array_view allocations without having to repeat the time consuming profiling step in each run of the application.

In closing

In this post we looked at the introductory concepts pertaining shared memory support in C++ AMP. Subsequent posts in this series will dive deeper into the functional and performance aspects of shared memory use in C++ AMP runtime – stay tuned!

I would love to hear your feedback, comments and questions below or in our MSDN forum.


Comments (4)

  1. LKeene says:

    Why is it that the AMD APUs are not listed as automatically making use of shared memory by default?

  2. ronag says:

    Do we have shared memory support for textures as well?

  3. Alex Voicu says:

    LKeene: I'd hazard a guess that it is due to the unequal performance that AMD APUs have when dealing with shared memory, based on the intended access pattern, due to the fact that they have separate buses as opposed to a single, homogeneous fabric (the L3 in Intel's case). Slide 35 from the following presentation by Boudier and Sellers is particularly telling: amddevcentral.com/…/1004_final.pdf. The same constraints pretty much apply to Trinity (the newer APUs) and, as seen in the above reference, makes it so that APU performance with shared memory is strictly inferior than without, thus violating one of the requirements Amit mentioned. One would hope that the upcoming Kaveri APUs will ease these constraints.

  4. ronag,

    Shared memory support is not available for textures.

    LKeene,

    Automatic use of shared memory is not turned on by default for AMD APUs because of the differences in memory bandwidth when accessing the data on the accelerator, between shared and dedicated (accessible only on the accelerator) memory.

    On the current generation of AMP APUs (Llano and Trinity), the memory access bandwidth on the accelerator for dedicated memory is much higher compared to shared memory that is both readable and writable on the CPU. Consequently automatic use of shared memory on these APUs may result in a loss of perf – we will save some copying of data but the kernels may run slower due to the decreased bandwidth on the accelerator. Having said that, there would still be scenarios (depending on your kernel’s memory access patterns) that would benefit from use of shared memory on these cards – just that the runtime cannot currently determine this since whether it will be beneficial or not depends on the kernel’s memory access pattern. Thus C++ AMP users are advised to explicitly employ use of shared memory on these accelerators based on their kernels’ memory access patterns.

    Another important piece of information in this context is that shared memory that is only writable (not readable) on the CPU has roughly the same accelerator memory access bandwidth as dedicated memory. Thus on AMD APUs typically it will be a win to set the “default_cpu_access_type” to “access_type_write” since it will instruct the runtime to automatically use shared memory that is only writable on the CPU. Such memory will continue to have high accelerator access bandwidth and at the same time there will be reduced copying when transferring data from the CPU to the accelerator (when synchronizing from the accelerator to the CPU the old behavior of copying through an intermediate staging buffer). The reason, we did not automatically turn on this behavior for AMP APUs since (unfortunately) there is no good way of identifying whether an accelerator is an AMP APU accelerator.

    Finally, as Alex noted, the upcoming Kaveri generation of AMD APUs will narrow the difference in bandwidth between shared and dedicated memory (but likely not eliminate the gap completely). Overall, the safest strategy for C++ AMP programmers is to profile the memory access characteristics for shared memory on their accelerator and use that to set the “default_cpu_access_type” for the accelerator during application startup. We will share some sample code for such profiling in a future post.

    I will cover both of these topics in greater detail in a subsequent post.