accelerator_view selection for C++ AMP parallel_for_each and copy operations

Article
10/10/2012

parallel_for_each and copy are among the most common accelerator operations performed in C++ AMP code. A parallel_for_each invocation denotes the execution of a data parallel C++ AMP kernel on a specific accelerator_view , while a copy invocation denotes transfer of data between a source and destination container allocated in host or accelerator memory. This blog post will dive into details of how the C++ AMP runtime selects the target accelerator_view(s) for these common accelerator operations.

As prerequisite reading, I encourage you to read our previous blog post on the notion of default accelerator in the C++ AMP programming model.

parallel_for_each target selection

An accelerator_view denotes an execution context on an accelerator and a parallel_for_each kernel executes on a specific accelerator_view. C++ AMP provides several overloads of the parallel_for_each template function, which can be broadly divided into 2 categories for the purposes of this discussion regarding the target accelerator_view where a parallel_for_each kernel is executed:

Flavors that explicitly specify the target accelerator_view.
Others that leave the task of selecting the target accelerator_view (for the compute placement), to the C++ AMP runtime.

Target accelerator_view explicitly specified

When an accelerator_view argument is explicitly specified in a parallel_for_each call, the kernel is executed on the specified accelerator_view. Any array_views captured in the parallel_for_each kernel, are implicitly synchronized to the specified target accelerator_view. array, texture and writeonly_texture_view objects are strictly bound to a specific accelerator_view for their entire lifetime, and any objects of these types captured in the kernel must be associated with the specified target accelerator_view in the parallel_for_each call. Capturing an array, texture or writeonly_texture_view object which is associated with a different accelerator_view than the specified parallel_for_each execution target, will result in a runtime_exception.

No target accelerator_view specified

When an accelerator_view execution target is not explicitly specified in the parallel_for_each call, the target accelerator_view for executing the kernel is automatically selected by the runtime.

If the kernel of such a parallel_for_each invocation captures an array, texture or writeonly_texture_view object, the runtime chooses the target accelerator_view to be the one which the captured array, texture, writeonly_texture_view object(s) are associated with. When multiple objects of these types are captured, they must all be associated with the same accelerator_view – a violation of this rule will result in a runtime_exception. Any array_view objects captured in the kernel are implicitly synchronized to the accelerator_view target thus selected.

If no array, texture or writeonly_texture_view objects are captured in a parallel_for_each kernel and neither is a target accelerator_view explicitly specified, the C++ AMP open specification does not mandate any policies regarding the runtime’s selection of the target accelerator_view. For such scenarios the selection of the target accelerator_view for executing the kernel is the prerogative of the C++ AMP runtime implementation.

Following is the algorithm used by Microsoft’s C++ AMP implementation for selecting the target accelerator_view in such scenarios. The goal of this algorithm is to find a target accelerator_view on the default accelerator based on the current caching state of the array_view objects captured in the parallel_for_each kernel, such that no implicit data transfers (for referenced array_view objects) are required to execute the kernel on the selected accelerator_view.

Determine the set of accelerator_views where ALL array_views referenced in the p_f_e kernel have valid cached copies. An array_view whose contents have been discarded, can be accessed on any accelerator_view without requiring an implicit data transfer, and hence does not participate in determination of the aforementioned accelerator_view set.
From the above set, filter out any accelerator_views that are not on the default accelerator. Additionally filter out accelerator_views that do not have the capabilities required by the parallel_for_each kernel (debug intrinsics, number of writable array_view/array/writeonly_texture_view objects).
The default accelerator_view of the default accelerator is selected as the parallel_for_each execution target, if it is contained in the set from step 2, or if that set is empty. Otherwise, any accelerator_view from the set is arbitrarily selected as the target.

Let us look at some examples.

Consider a scenario where two array_views; viz. “arrViewB” and “arrViewC” are captured in a parallel_for_each kernel, and the target accelerator_view for the parallel_for_each invocation is not explicitly specified. The content of “arrViewB” is previously cached on accelerator_view “acclView1” and the content of “arrViewC” has been discarded. Following the algorithm described above, the C++ AMP runtime would select “acclView1” as the target accelerator_view for executing the kernel and this choice would not require any implicit data transfers for the captured array_views.

 std::vector<float> vA(size), vB(size);
  
 // Initialize vA and vB
 ...
  
 array_view<float> arrViewA(size, vA), arrViewB(size, vB);
 accelerator_view acclView1 = accelerator().create_view();
  
 // The target accelerator_view "acclView1" is explicitly specified for this 
 // parallel_for_each invocation. Following this parallel_for_each invocation,
 // the content of “arrViewB” would be cached on “acclView1”
 parallel_for_each(acclView1, vB.extent, [arrViewA, arrViewB](index<1> idx) restrict(amp) {
     arrViewB[idx] += arrViewA[idx];
 });
  
 std::vector<float> vC(size);
 array_view<float> arrViewC(size, vC);
 arrViewC.discard_data();
  
 // The target accelerator_view for this parallel_for_each invocation is not
 // explicitly specified and the runtime chooses “acclView1" as the target
 // accelerator_view to execute this kernel based on the caching state of the
 // captured array_views.
 parallel_for_each(arrViewC.extent, [=](index<1> idx) restrict(amp) {
     arrViewC[idx] = sqrt(arrViewB[idx]);
 });

In the next example, an array explicitly allocated on accelerator_view “acclView2” is the data source of the array_view “arrViewD” . The parallel_for_each invocation captures array_view “arrViewD” in its kernel and an explicit accelerator_view target has not been specified. Here, the runtime selects accelerator_view “acclView2” as the execution target, since “acclView2” is the only location with a valid cached copy of the array_view “arrViewD” content, and this choice of execution target avoids any implicit data transfers for the captured array_view.

 std::vector<float> vD(size);
  
 // Initialize vD
 ...
  
 accelerator_view acclView2 = accelerator().create_view();
  
 // An array explicitly allocated on accelerator_view “acclView2”
 array<float> arrD(size, vD.begin(), acclView2);
 array_view<float> arrViewD(arrD);
  
 // The target accelerator_view for this parallel_for_each invocation is not
 // explicitly specified and the runtime chooses "acclView2" as the target
 // accelerator_view for executing this kernel.
 parallel_for_each(arrViewD.extent, [=](index<1> idx) restrict(amp) {
     arrViewD[idx] *= arrViewD[idx];
 });

Selecting source and destination locations for copy

A C++ AMP copy operation denotes transfer of data between a source and destination buffer allocated in host or accelerator memory. An STL iterator or an array are bound to a specific memory allocation – STL iterators are references to host memory while an array is bound to a specific accelerator_view and references a memory allocation on that accelerator_view. When either of these is used as the source or destination of a copy operation, the actual buffer to be copied to/from and its location (host CPU or an accelerator_view) is unambiguously defined. However, an array_view denotes an abstract view of a data container which is ubiquitously accessible on the host or any accelerator_view, and is not bound to a specific location. So when an array_view object is the source or destination of a copy operation, what is the underlying buffer (and its location) that the data is copied from/to? Well, this choice is left to the implementation since functionally, choice of any location would be correct – after all an array_view is accessible on any location. Here I will describe the choices made by the Microsoft’s C++ AMP implementation in such scenarios, which is driven by the current caching state of the source/destination array_view object(s) with an objective of optimizing the performance of such copy operations.

First the destination location is determined. As mentioned earlier, it is unambiguously defined when the destination is an STL iterator or an array object. If the destination is an array_view, the location of the source container/allocation underlying the array_view is the chosen copy destination. Now, if the source of the copy is an array_view object, there may be multiple valid cached copies of the data on different locations. Among the available options, the runtime picks the source copy from which transfers to the destination location would be fastest. Its order of preference for choosing the source copy is as follows:

A copy on the same location as the destination location for the copy. Copies between buffers on the same location are the fastest.
A copy on the CPU – copies between the CPU and accelerator memory are faster than copying across accelerators.
Any valid copy of the source data.

 accelerator_view acclView = accelerator().create_view();
 accelerator_view acclView2 = accelerator(accelerator::direct3d_warp).create_view();
  
 std::vector<float> srcData(size);
 array_view<const float> srcArrView(size, srcData)
  
 array<float> dstData(size, acclView), dstData2(size, acclView2);
 array_view<float> dstArrView(dstData);
  
 parallel_for_each(acclView, srcArrView.extent, [srcArrView, dstArrView](index<1> idx) restrict(amp) {
     dstArrView[idx] += srcArrView[idx];
 });
  
 // Destination location for this copy is "acclView" (dstArrView's data source location)
 // The contents of "srcArrView" are currently cached both on the CPU and the
 // accelerator_view "acclView". This copy operation uses the cached copy on "acclView"
 // as the copy source, since copying data between buffers on the same accelerator_view
 // is much faster (using a kernel) compared to transferring data from the CPU.
 copy(srcArrView, dstArrView);
  
 // Destination location for this copy is "acclView2" (array dstData2's location)
 // The contents of srcArrView are currently cached both on the CPU and the
 // accelerator_view "acclView". This copy operation uses the cached copy on the CPU
 // as the copy source, since it is faster to copy from the CPU to an accelerator,
 // compared to copying across different accelerators.
 copy(srcArrView, dstData2);

In closing

I hope the information in this post will help you better understand the performance characteristics of parallel_for_each and copy operations involving array_views. This information is particularly useful for C++ AMP programmers in making an appropriate choice between explicitly specifying the target accelerator_view and letting the runtime pick one, for your parallel_for_each invocations.

As always, I would love to hear your questions, thoughts, comments or feedback below or in our MSDN forum.