Using multiple accelerators from C++ AMP

Hi, my name is Jerry Higgins and I am the test lead for C++ AMP and PPL. In this post, I will be describing a fun little project to distribute matrix multiplication across multiple accelerators or even one accelerator for matrices that are too large to fit in the accelerators memory. I've taken advantage of fork and join parallelism in PPL as well as the massive parallelism unleashed by C++ AMP.

Overview of solution

I used a chunk and stream approach, chunking the A & C matrices to the accelerators in parallel by taking advantage of one of my favorite tools from PPL, parallel_for, then streaming the B matrix across eachaccelerator with C++ AMP parallel_for_each. This approach limits the amount of redundant data transfers but does not eliminate them.

One of the things that I really like about the programming model for C++ AMP is that it allows you to remain in the dimensionality of your data as well as the dimensionality of your computation. Rather than nesting linear constructs to iterate over multiple dimensions, our data containers and the way parallel_for_each is expressed with an extent and index, allow the programmer to remain closer to the original problem domain. In this context, it is easier to think of matrix multiplication as a two dimensional computation across the C matrix. See chunking data across multiple C++ AMP kernels for more insight into this and other strategies.

In an earlier iteration of this algorithm, I first partitioned the matrices across accelerators before chunking and streaming. This had a couple of disadvantages; for starters it added an unnecessary constraint that the matrices be divisible by the number of accelerators, secondly the memory locality on the CPU side was not as good, as it created larger gaps in memory during concurrent memory accesses. The current approach achieves good data locality with simple load balancing between accelerators and the control flow is more straightforward.

I'll be describing the code top down, starting from main.

Main

First we create the 3 single precision floating point matrices in flat memory on the CPU and initialize the input matrices, A and B with random values using the Mersenne Twister random number generator that comes with the standard template library. We also zero initialize the output matrix for cleanliness. Next we get all available accelerators and for the ones that are not emulated, we push their default accelerator views into a vector that we will later pass to the mxm_amp_distribute function.

 void main() {
  const int MATRIX_SZ = 1024;
  const int STREAM_WIDTH = 512;
  const unsigned int M = MATRIX_SZ;
  const unsigned int N = MATRIX_SZ;
  const unsigned int W = MATRIX_SZ;
  float *A = new float[M*N];
  float *B = new float[N*W];
  float *C = new float[M*W];
  rand_init(A, M*N); // initialize with random numbers
  rand_init(B, N*W); // initialize with random numbers
  zero_init(C, M*W); // initialize with zero
  std::vector<accelerator> accelerators = accelerator::get_all();
  std::vector<accelerator_view> accelerator_views;
  std::for_each(accelerators.begin(), accelerators.end(), [&] (accelerator& accl) {
    if(!accl.is_emulated) { // if not emulated
      std::wstring wdescr = std::wstring(accl.description);
      std::wcout << accl.description << std::endl;
      accelerator_views.push_back(accl.default_view); // get the default accelerator view
    }
  });
  mxm_amp_distribute(accelerator_views, STREAM_WIDTH, A, B, C, M, N, W);
}

mxm_amp_distribute

The mxm_amp_distribute function is where the coarse grained parallelism is used to distribute the work across accelerators occurs. First the parameters are checked, then each accelerator_view gets its own staging buffer for streaming the B matrix from the CPU and these are pushed into a vector. Using staging arrays for the B matrix streams increased performance for this workload.

 void mxm_amp_distribute(std::vector<accelerator_view> acc_views, const unsigned int stream_width, const float* A, const float* B, float* C,
                             const unsigned int M, const unsigned int N, const unsigned int W) {
  // Performs matrix multiplication C = A * B
  // throw runtime error if M, N and stream_width are not all multiples of TILE_SIZE
  if( ((M % stream_width) != 0) || ((N % stream_width) != 0) || ((W % stream_width) != 0)) {
    throw std::runtime_error("M, N and W must all be multiples of stream_width");
  }
  if(acc_views.size() == 0) {
    throw std::runtime_error("acc_views is empty, nowhere to run");
  }
  // Create a vector of staging arrays for streams of matrix B, one for each accelerator
  std::vector<array<float, 2>> b_vect;
  b_vect.reserve(acc_views.size());
  accelerator cpu_accelerator = accelerator(accelerator::cpu_accelerator);
  std::for_each(acc_views.begin(), acc_views.end(), [&] (accelerator_view acc_vw) {
    b_vect.emplace_back(extent<2>(N, stream_width),cpu_accelerator.default_view, acc_vw);
  });

The outer while loop will continue execution until the end of data has been reached. The inner parallel_for loop calculates the offset where the particular accelerator will start its work using the accelerator’s index and the stream_width parameter, each task checking to make sure it does not overreach its bounds. This last check is necessary as the matrix size may not be evenly divisible by the number of accelerators. Next, the dispatch_chunk function is called and the party moves to the accelerator.

 unsigned int w_start = 0; // indicates where to start processing
while(w_start < W) {
  parallel_for((size_t) 0, acc_views.size(), [&] (size_t accelerator_ix) {
    unsigned int stream_offset = w_start + (stream_width * accelerator_ix);
    if (stream_offset < W) {
      dispatch_chunk(acc_views[accelerator_ix], A, B, C, M, N, W, stream_width, stream_offset, b_vect[accelerator_ix]);
    }
  });
  w_start += (acc_views.size() * stream_width);
}

dispatch_chunk

The dispatch_chunk function creates C++ AMP arrays that are fixed to the underlying storage for the A & C matrices. Extents are created for the data as well as the computation to be done for each invocation of the parallel_for_each. The shape of the computation is square in this case and its size is fixed to the stream_width parameter.

 void dispatch_chunk(accelerator_view acc_vw, const float* A, const float* B, float* C, const unsigned int M, const unsigned int N, const unsigned int W, 
                     const unsigned int stream_width, const unsigned int stream_offset, array<float,2> &staging_b) {
  extent<2> extent_A(stream_width, N);
  extent<2> extent_C(stream_width, M);
  extent<2> extent_compute(stream_width, stream_width);
  // chunk storage on GPU for A & C
  array<float, 2> matrix_a(extent_A, A + (stream_offset * N), acc_vw);
  array<float, 2> matrix_c(extent_C, C + (stream_offset * M), acc_vw);

Next, we wrap an array_view around this accelerator’s staging array for the B matrix, which was passed in as a parameter.

 array_view<float,2> matrix_b(staging_b)

The outer for loop is used to break the B matrix into stream_width streams using the populate_stream function and passing it to the mxm_calc_stream function which calculates that stream's corresponding column of C. This is done until an entire stream_width chunk of C is resolved and copied back out to its CPU backed storage.

// stream B to GPU

 for(unsigned int start_column = 0; start_column < M; start_column += stream_width) {
  populate_stream(staging_b.data(), B, start_column, stream_width, N, M);
  mxm_calc_stream(extent_compute, N, start_column, matrix_a, matrix_b, matrix_c);
}
copy(matrix_c, C + (stream_offset * M));

populate_stream

The populate_stream function copies a column stream from a matrix to a staging array, maintaining row major order. The underlying data pointer is used for performance reasons, because on the cpu side using the index operator is slower (note that on the accelerator, this is not an issue).

 // populate B matrix column stream
void populate_stream(float* column_stream_ptr, const float* const arr_in, unsigned int start_column, unsigned int stream_width, 
                        unsigned int arr_length, unsigned int arr_width) {
  for(unsigned int row = 0; row < arr_length; row++) {
    memcpy(column_stream_ptr + (row * stream_width), arr_in + (row * arr_width)+ start_column, stream_width * sizeof(float));
  }
}

mxm_calc_stream

The mxm_calc_stream function is just another matrix multiplication implementation. I currently use a matrix multiplication which takes advantage of tile_static memory.

Since the A matrix is read from within the parallel_for_each kernel, it is copied to the accelerator. As C is only written to, data for C is not copied from the CPU to the accelerator. However, since C is captured in the lambda and written to, storage is allocated on the accelerator.

 // matrix multiply, resolve C for this stream
void mxm_calc_stream(extent<2> extent_compute, unsigned int N, unsigned int start_column, array<float,2>const &matrix_a, 
                            array_view<float,2> const &matrix_b, array<float,2> &matrix_c) {
  parallel_for_each(extent_compute.tile<TILE_SIZE,TILE_SIZE>(), [=, &matrix_a, &matrix_c] (tiled_index<TILE_SIZE,TILE_SIZE> t_idx) restrict(amp) {
    float accum_c = 0; // accumulate sum in local register to avoid extra round
    // trips to global memory
    for (unsigned int i = 0; i < N; i += TILE_SIZE)
    {
      tile_static float local_B[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix B
      tile_static float local_A[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix A
      local_A[t_idx.local[0]][t_idx.local[1]] = matrix_a(t_idx.global[0], i + t_idx.local[1]);
      local_B[t_idx.local[0]][t_idx.local[1]] = matrix_b(i + t_idx.local[0], t_idx.global[1]);
      t_idx.barrier.wait(); // wait for all threads in the tile to pull their corresponding values into tile_static memory
      for (unsigned int k = 0; k < TILE_SIZE; k++)
      {
        accum_c += local_A[t_idx.local[0]][k] * local_B[k][t_idx.local[1]]; // accumulate partial sum for current tile
      }
      t_idx.barrier.wait(); // wait for all threads in the tile 
    } 
    matrix_c(t_idx.global[0], t_idx.global[1] + start_column) = accum_c; // done with this tile, update global memory
  });
}

That's all for this post. Thanks for tuning in and I hope you enjoy programming in C++ AMP as much as we do. Feedback as always welcome in the comments or in our MSDN Forum.

multi_gpu_mxm.zip