Understanding accelerator_view queuing_mode in C++ AMP

Hello, I am Amit Agarwal, a developer on the C++ AMP team. In this blog post I will share some details pertaining to scheduling of commands (resulting from C++ AMP API calls) on C++ AMP accelerators and how you can exercise control over these scheduling aspects in C++ AMP as per the needs of your application.

C++ AMP accelerator_views and GPU commands

A concurrency::accelerator_view is an isolated resource and execution context/domain on an accelerator. Memory resources such as the concurrency::array data container are allocated on and associated with exactly one accelerator_view and any operations on this data are scheduled as commands on that accelerator_view.

Basically, two types of operations result in command(s) being scheduled on accelerator_view(s).

1) Copying data to and/or from data containers (such as concurrency::array) on accelerator_views, results in copy command(s) being scheduled on the accelerator_view(s) associated with the source and/or destination data containers.

2) A concurrency::parallel_for_each call results in a data-parallel kernel execution command being scheduled on the selected accelerator_view.

The following code snippet illustrates this:

 // A std::vector container allocated on host
vector hostVector(numElements);

// Default accelerator_view on the WARP accelerator
accelerator warpAccl = accelerator(accelerator::direct3d_warp);
accelerarator_view warpAcclView = warpAccl.default_view;

// Default accelerator_view on the REF accelerator
accelerator refAccl = accelerator(accelerator::direct3d_ref);
accelerarator_view refAcclView = refAccl.default_view;

// A 1-D integer array of length numElements allocated on warpAcclView
array<int> deviceArray1(numElements, warpAcclView);

// A 1-D integer array of length numElements allocated on refAcclView
array<int> deviceArray2(numElements, refAcclView);

// Copy from hostVector to deviceArray1 results in a copy
// command being scheduled on warpAcclView
copy(hostVector.begin(), hostVector.end(), deviceArray1);

// Copy from deviceArray1 to deviceArray2 results in a copy command being 
// scheduled on warpAcclView followed by another copy command on refAcclView
copy(deviceArray1, deviceArray2);

const int constVal = 10;

// parallel_for_each call results in kernel execution command being
// scheduled on warpAcclView
parallel_for_each(deviceArray1.extent, [&](index<1> idx) restrict(amp) {
    deviceArray1(idx) += constVal;
});
  

DMA buffers – Inner workings of DirectX runtime

Let us briefly look at the inner workings of the DirectX runtime with regard to submitting a GPU command for execution on the hardware. The unit of scheduling on hardware is a DMA buffer which is an aggregation of commands and references a set of “allocations” which correspond to the GPU memory resources used by the commands in that DMA buffer. The Windows Device Driver Model (WDDM) virtualizes GPU resources and as part of preparing a DMA buffer for execution on hardware, the DirectX kernel ensures that all “allocations” referenced by the DMA buffer are paged into GPU memory if needed before scheduling the DMA buffer for execution on the hardware.

As described above, there is a non-trivial cost associated with submitting commands to the hardware. Batching multiple commands into a single DMA buffer for submission to the hardware, amortizes the cost of DMA buffer preparation across multiple commands compared to preparing a separate DMA buffer for each GPU command. The savings achieved through such batching of commands may be significant when each command by itself performs a small amount of work making the DMA buffer preparation cost non-trivial in comparison to the command execution time.

accelerator_view queuing_mode – Exposing device command queue through C++ AMP API

C++ AMP allows you to control the behavior described in the previous section through the “queuing_mode” setting on accelerator_views of a DirectX 11 capable accelerator. The queueing_mode setting determines how commands scheduled on that accelerator_view are submitted to the hardware for execution. The queuing_mode for an accelerator_view can be specified (as an optional parameter) when creating a new accelerator_view using the accelerator::create_ view method, there being two possible values.

 enum queuing_mode {
    queuing_mode_immediate,
    queuing_mode_automatic
};

A queuing_mode setting of “queuing_mode_immediate” signifies that all commands scheduled on that accelerator_view are immediately submitted to the hardware for execution. Thus for this queuing_mode setting, each GPU command scheduled on the accelerator_view results in an additional cost of preparing and submitting a DMA buffer to the hardware for execution.

On the other hand, a setting of “queuing_mode_automatic” indicates to the runtime that it is allowed to delay the submission of commands scheduled on that accelerator_view and submit them in batches to the hardware for execution, thereby amortizing the cost of DMA buffer preparation and submission (as discussed in the previous section). Any commands scheduled on an accelerator_view with a “queuing_mode_automatic”, may NOT be immediately submitted to the hardware for execution but instead queued to an internal command queue by the device driver. You may be wondering: “how does the runtime decide when to submit a batch of commands to the hardware for execution?” Keep on reading.

Flushing the command queue

There are three events that can cause a queue of commands to be flushed and resulting in the commands being submitted to the hardware:

1) Copying the contents of an array to the host or another accelerator_view results in all previous commands referencing that array resource (including the copy command itself) to be submitted for execution on the hardware.

2) Calling the “accelerator_view::flush” or “accelerator_view::wait” method results in all queued commands to be submitted to the hardware for execution. While the method “flush” just submits the commands for execution, the “wait” method additionally waits (blocks) till the execution of the commands finishes.

3) The IHV device driver internally uses a heuristic to determine when commands are submitted to the hardware for execution. While the actual heuristic used differs across IHVs a common determinant is the amount of GPU memory resources referenced by a collection of commands to be batched – when a new command is queued, if the driver finds that aggregating this command with the previous commands in the queue would result in the total resource requirements of the collection to exceed available physical resources on the device, the previous commands are batched into a DMA buffer for submission to the device.

Choosing between immediate and deferred queuing_mode

This begs the question, if "queuing_mode_automatic” offers better performance compared to “queuing_mode_immediate” why not always use the former?

1) Windows uses a mechanism called “Timeout Detection and Recovery” (TDR) to prevent processes from hogging the GPU and rendering the system display unresponsive or denying other processes a fair share of the GPU. If a DMA buffer takes longer than 2 seconds to execute, the Windows kernel initiates TDR to reset the offending accelerator_view and aborts all commands scheduled on that accelerator_view. As described previously, the heuristic used by device drivers for batching commands for submission to the hardware, largely depends on the amount of resources used by the commands. The device driver cannot predict how long it will take for a command to execute. Hence if the nature of commands is such that they use small amount of resources but take very long to execute, the driver may batch multiple such long running commands together creating a DMA buffer which may take longer than 2 seconds to execute. When this happens, TDR will be encountered resulting in resetting of the accelerator_view and abortion of all commands and resources on that accelerator_view.

Thus if your commands are long running, it is recommended that you use "queuing_mode_immediate” to prevent batching of multiple long running commands. For long running commands the DMA buffer creation overhead will be insignificant compared to the execution time for the command and hence the performance impact is negligible. An alternate option (on Windows8 only) is to create the accelerator_view with a setting that allows DMA buffers originating from that accelerator_view to run longer than 2 seconds as long as there is no contention for that GPU from the OS or other applications. Further details of this are beyond the scope of this post and will be covered in a future post.

2) On an accelerator_view with a "queuing_mode_automatic” setting, a command will sit in the software queue till its submission to hardware is triggered by one of the three reasons specified above, which may effectively increase the latency of the command’s execution. This may be undesirable for applications which are sensitive to the latency of command execution, in which case, a setting of "queuing_mode_immediate” is more appropriate. Having said that, this is probably not a common scenario, unless you wish to exercise very fine grained control over the scheduling of your commands on accelerator_views.

The C++ AMP runtime by default uses a queuing_mode setting of “queuing_mode_automatic” on accelerator_views unless explicitly overridden by you in your code when creating new accelerator_views.

Changes in Beta, since the Developer Preview

The descriptions above apply to the Visual Studio Beta release of C++ AMP. I would like to note the couple of changes we made in this area in Beta, since Developer Preview.

First, in developer preview names of the members in the queuing_mode enumeration were "immediate" and "deferred" which have been changed in Beta to "queuing_mode_immediate and "queuing_mode_automatic" respectively.

Second, the default queuing_mode setting in developer preview was "queuing_mode_immediate" (formerly named "immediate" in preview) which has been switched to “queuing_mode_automatic” in Beta.

In closing

To summarize, if your application results in numerous short running GPU commands, the default setting of “queuing_mode_automatic” is the recommended choice of queuing_mode for your accelerator computations. However, if long running kernels are a norm in your application or command execution latency is critical for your applications, you may want to carefully consider the choice of queuing_mode for your accelerator_views – a setting of “queuing_mode_immediate” is generally more appropriate for such scenarios.

Happy acceleration, and feel free to ask questions below, or even better in our MSDN concurrency forum!