concurrency::array_view - Caching and data coherence

The previous posts in this series on C++ AMP array_view covered:

  1. Introduction to array_view and some of its key semantic aspects
  2. Implicit synchronization on destruction of array_views
  3. array_view discard_data function

In this post I will talk about the caching scheme underlying array_views.

Caching and the core coherence protocol

The runtime is responsible for presenting a coherent view of the underlying data source to the application when accessed on any location through an array_view. There are two aspects to providing the coherence. Firstly, synchronization is (only) performed on-demand at the point in the program where the array_view is accessed. The amount of data transferred (if required as part of the synchronization) is kept to the minimum essential for providing access only to the section of data referenced by the array_view. Secondly, the runtime caches any copy on a location (accelerator_view or CPU) other than the array_view data source’s home location. These cached copies are reused for subsequent accesses until a copy is invalidated by a modification on another location (different from the cached copy’s location).

The previous blog post defines a “data source” of an array_view. For each unique data source (basically concurrency::array objects and the top-level array_views created from CPU data), the runtime maintains a directory for tracking the caching state of all array_views that reference that data source. At any given point in the program, an array_view’s cachingstatecan either be invalid (aka un-cached), read-only (aka shared) or read-write (aka dirty and exclusive) which denotes the caching state of the section of the data source that the array_view references (note that from a caching perspective write-only and read-write are the same).

Two array_views are said to overlap when they have the same data source and at least one element of the underlying data source is referenced by both the array_views.

The following invariants always hold for a data source’s cache directory in a well formed C++ AMP program (as defined by sections 8.2, 8.3 and 8.4 of the C++ AMP open specification and as described in the introductory post of this series):

a) An array_view when dirty (has read-write access) can have a valid exclusive cached copy of data on one and only one location.

b) An array_view can have multiple read-only (shared) cached copies of data on different locations.

c) When an array_view is dirty (has read - write access) with an exclusive cached copy on a specific location, other overlapping array_views can simultaneously have shared and/or exclusive cached copies on (and only on) the same location.

d) When an array_view has shared cached copies on one or more locations, other overlapping array_views can simultaneously have shared cached copies on the same or other locations.

At the heart of the array_view cache management is an algorithm used to perform the necessary data transfers and cache directory state updates, when a read and/or write access to an array_view is required on a specific accelerator_view or the CPU. Following is the very simplified gist of this algorithm:

a) No data transfer is needed if the array_view for which access is requested (referred to as current array_view henceforth), or another array_view that subsumes (entirely contains) the section of data referenced, already has a valid cached copy on the target accelerator_view.

b) Otherwise:

a. All overlapping array_views that are dirty (have read-write access) are synchronized to the home storage location. The caching state of the overlapping array_views is also appropriately updated

b. If the location on which access to the current array_view has been requested is different from the home storage location of the data source, the section of data referenced by the current array_view is copied from the home storage location to the target location. Also the caching state of the current array_view is appropriately update

[Note: Details about how “discarded” data or “refresh” is handled have been intentionally elided for simplicity.]

Guidelines with regard to caching and coherence

Below are some general guidelines that you should follow to avoid unnecessary data transfers when using array_views in your C++ AMP code. The Concurrency Visualizer in Visual Studio provides data on the actual amount of data implicitly transferred when using array_views. This is a useful tool for detecting if your use of array_views results in unnecessary data transfers that exceed the optimum actual requirements, due to violation of one or more of these guidelines.

Use section or projection

Guideline A: Use section or projection to obtain an array_view that references just the portion of the data that needs to be accessed.

This would minimize the size of the implicit data transfers that happen on accessing the array_view limiting it to the portion of the data source that is referenced by the array_view.

Note that even when an array_view references just a portion of a data source, in its first release (Visual Studio 2012) the C++ AMP runtime allocates memory for the entire size of the data source, on each location where it is accessed. Hence while use of sections helps minimize the size of the implicit data transfers (just the section of data referenced by the array_view is transferred), you cannot use sections to reduce the working set size of your algorithm, when the entire data source does not fit in the available accelerator memory. One of our previous blog posts describes various techniques for achieving chunking large data across multiple C++ AMP kernels.

The following example code violates “Guideline A” resulting in more data than required to be transferred from the host to the accelerator:

 // Multiply a matrix with a specified row (rowIndex) of another matrix
 array_view<const float, 2> matrix1View(numRows, numCols, pMatrix1Data);
  
 // Guideline A violation: Captures the entire matrix2 in the parallel_for_each
 // though just one of the rows needs to be accessed on the accelerator
 array_view<const float, 2> matrix2View(numRows, numCols, pMatrix2Data);
  
 array_view<float> outVectorView(numRows, pOutVectorData);
 outVectorView.discard_data();
  
 parallel_for_each(extent<1>(numRows), [=](index<1> idx) restrict(amp) {
     array_view<const float> currMatrix1Vector = matrix1View[idx[0]];
     array_view<const float> matrix2Vector = matrix2View(rowIndex);
  
     float temp = 0.0f;
     for (int i = 0; i < matrix2Vector.extent[0]; ++i) {
         temp += currMatrix1Vector(i) * matrix2Vector(i);
     }
  
     outVectorView(idx) = temp;
 });

 

Use of “Guideline A” minimizes the amount of data transferred and leads to better performing code, as shown below: 

 // Multiply a matrix with a specified row of another matrix
 array_view<const float, 2> matrix1View(numRows, numCols, pMatrix1Data);
 array_view<const float, 2> matrix2View(numRows, numCols, pMatrix2Data);
  
 // Using Guideline A: Use projection to create an array_view on the host
 // representing the row of matrix2 than needs to be accessed on the accelerator
 array_view<const float> matrix2VectorView = matrix2View(rowIndex);
  
 array_view<float> outVectorView(numRows, pOutVectorData);
 outVectorView.discard_data();
  
 parallel_for_each(extent<1>(numRows), [=](index<1> idx) restrict(amp) {
     array_view<const float> currMatrix1Vector = matrix1View[idx[0]];
  
     float temp = 0.0f;
     for (int i = 0; i < matrix2Vector.extent[0]; ++i) {
         temp += currMatrix1Vector(i) * matrix2VectorView(i);
     }
  
     outVectorView(idx) = temp;
 });

 

array_view object lifetime

Guideline B: If you need to access all or a section of a data source’s contents multiple times (such as in multiple concurrency::parallel_for_each invocations), create an array_view referencing the section of data to be accessed, and do not delete this array_view object until all accesses have been made.

When an array_view is destructed, its entry is removed from the cache directory of the array_view’s data source and the runtime loses the existing caching information/state for the array_view. Creating a new array_view in such scenarios may result in additional data transfers even if the cached copies corresponding to the old destructed array_view objects were valid.

The example code below violates “Guideline B” resulting in unnecessary data transfers: 

 void MatrixVectorMultiply(const float *pMatrixData,
                           const float *pInVectorData,
                           float *pOutVectorData,
                           int numRows,
                           int numCols)
 {
     array_view<const float, 2> matrixView(numRows, numCols, pMatrixData);
     array_view<const float> inVectorView(numCols, pInVectorData);
  
     array_view<float> outVectorView(numRows, pOutVectorData);
     outVectorView.discard_data();
  
     parallel_for_each(extent<1>(numRows), [=](index<1> idx) restrict(amp) {
         array_view<const float> currMatrixVector = matrixView(idx[0]);
  
         float temp = 0.0f;
         for (int i = 0; i < inVectorView.extent[0]; ++i) {
             temp += currMatrixVector(i) * inVectorView(i);
         }
  
         outVectorView(idx) = temp;
     });
  
     outVectorView.synchronize();
 }
  
 // Guildeline B violation: The matrix data is used on the accelerator 
 // in multiple parallel_for_each invocations for multiplying with 2
 // different vectors. But since we recreate the matrix array_view each time 
 // is not able to reuse the cached data from the first parallel_for_each. 
 MatrixVectorMultiply(pMatrixData, pInVector1Data, pOutVector1Data, numRows, numCols);
 MatrixVectorMultiply(pMatrixData, pInVector2Data, pOutVector2Data, numRows, numCols);

 

Using “Guideline B” the amount of data transferred can be significantly reduced: 

 void MatrixVectorMultiply(const array_view<const float, 2> &matrixView,
                           const float *pInVectorData,
                           float *pOutVectorData)
 {
     array_view<const float> inVectorView(matrixView.extent[1], pInVectorData);
  
     array_view<float> outVectorView(matrixView.extent[0], pOutVectorData);
     outVectorView.discard_data();
  
     parallel_for_each(outVectorView.extent, [=](index<1> idx) restrict(amp) {
         array_view<const float> currMatrixVector = matrixView(idx[0]);
  
         float temp = 0.0f;
         for (int i = 0; i < inVectorView.extent[0]; ++i) {
             temp += currMatrixVector(i) * inVectorView(i);
         }
  
         outVectorView(idx) = temp;
     });
  
     outVectorView.synchronize();
 }
  
 // Using Guildeline B: Create an array_view representing the matrix and
 // reuse it for multiplication with both vectors to avoid transferring the
 // matrix data to the accelerator multiple times
 array_view<const float, 2> matrixView(numRows, numCols, pMatrixData);
  
 MatrixVectorMultiply(matrixView, pInVector1Data, pOutVector1Data);
 MatrixVectorMultiply(matrixView, pInVector2Data, pOutVector2Data);

 

Choose right home location for array_view's data source

Guideline C: When possible choose an appropriate home storage location for the array_view’s data source.

- If the data is accessed through array_views only on the host and one accelerator_view, pick the data source’s location to be where most of the data is modified. If the appropriate home location is an accelerator_view, you can create a concurrency::array on the required accelerator_view and use it as the data source to the array_view.

- If the data is accessed through array_views on multiple accelerator_views and the host, choose the CPU as the home location of the data source.

If you are creating a data source from scratch for use in your C++ AMP code (as opposed to using an existing data source shared by other modules of your application), you have greater liberty in choosing the home location of the data source (such as through the accelerator_view argument to the concurrency::array constructor), without affecting other parts of the application.

When an array_view is modified on a location and subsequently accessed on another location, the implicit data transfers are routed through the home storage location of the data source. Hence the home storage location of the data source influences the performance of these implicit data transfers. For example, in the first release of C++ AMP, data transfers across different accelerator_views are routed through the CPU and thus a CPU data source is optimal for such usage scenarios.

The code below does not adhere to “Guideline A.3” causing the amount of data transferred to exceed what is really needed: 

 template <typename BinaryFunction>
 float ReduceRow(array_view<const float> rowView,
                 const BinaryFunction &func)
 {
     // Guildeline C violation: The temporary data source used for intermediate 
     // computations is located in CPU memory though it is mostly modified and used 
     // on the accelerator_view and just as small portion finally read on the CPU.
     std::vector<float> tempVec(rowView.extent[0] / 2);
     array_view<float> tempView(rowView.extent / 2, tempVec);
     tempView.discard_data();
  
     parallel_for_each(tempView.extent, [=](index<1> idx) restrict(amp) {
         float a = matrixView(rowToReduce, idx[0]);
         float b = matrixView(rowToReduce, idx[0] + (numCols / 2));
         tempView(idx) = func(a, b); 
     });
  
     for (int stride = tempView.extent[0] / 2; stride > 0; stride /= 2) 
     {
         parallel_for_each(extent<1>(stride), [=](index<1> idx) restrict(amp) {
             float a = tempView(idx);
             float b = tempView(idx[0] + stride);
             tempView(idx) = func(a, b);
         });
     }
  
     // Though just one element of the array_view is read on the CPU
     // this results in the entire "tempView" to be synchronized to the CPU
     // since modifications on other accelerator_views are first synchronized
     // to the home location of the data source before being provided access to
     // on the target location
     return tempView.section(0, 1)[0];
 }

 

Use of “Guideline C” minimizes the implicit data transferred when accessing the array_view tempView on the CPU: 

 template <typename BinaryFunction>
 float ReduceRow(array_view<const float> rowView,
                 const BinaryFunction &func)
 {
     // Guideline C: Allocate the temporary data source on the accelerator_view
     // where the data is mostly modified and accessed; viz. the accelerator_view
     // where the parallel_for_each is launched.
     concurrency::array<float> tempArr(rowView.extent / 2);
     array_view<float> tempView(tempArr);
  
     parallel_for_each(tempView.extent, [=](index<1> idx) restrict(amp) {
         float a = matrixView(rowToReduce, idx[0]);
         float b = matrixView(rowToReduce, idx[0] + (numCols / 2));
         tempView(idx) = func(a, b); 
     });
  
     for (int stride = tempView.extent[0] / 2; stride > 0; stride /= 2) 
     {
         parallel_for_each(extent<1>(stride), [=](index<1> idx) restrict(amp) {
             float a = tempView(idx);
             float b = tempView(idx[0] + stride);
             tempView(idx) = func(a, b);
         });
     }
  
     // Since the home location of the data source of "tempView" is the
     // accelerator_view where its contents are modified, when an element
     // of tempView is read on the CPU, synchronization of the array_view
     // to the data source's home location is not required. Just the part
     // (one element in this case) is transferred from the accelerator_view
     // to the CPU
     return tempView.section(0, 1)[0];
 }
  

In closing

In this post we looked at the caching and coherence policies underlying the array_view implementation. Later posts will dive into other functional and performance aspects of array_view - stay tuned!

I would love to hear your feedback, comments and questions below or in our MSDN forum.

[Note that some of the information in this blog post is specific to Microsoft’s implementation of the C++ AMP open specification – other implementations may have different characteristics as long as they adhere to the semantics stipulated by the open specification.]