Using multiple accelerators from C++ AMP

Hi, my name is Jerry Higgins and I am the test lead for C++ AMP and PPL. In this post, I will be describing a fun little project to distribute matrix multiplication across multiple accelerators or even one accelerator for matrices that are too large to fit in the accelerators memory. I’ve taken advantage of fork and join parallelism in PPL as well as the massive parallelism unleashed by C++ AMP.

Overview of solution

I used a chunk and stream approach, chunking the A & C matrices to the accelerators in parallel by taking advantage of one of my favorite tools from PPL, parallel_for, then streaming the B matrix across each accelerator with C++ AMP parallel_for_each. This approach limits the amount of redundant data transfers but does not eliminate them.

One of the things that I really like about the programming model for C++ AMP is that it allows you to remain in the dimensionality of your data as well as the dimensionality of your computation. Rather than nesting linear constructs to iterate over multiple dimensions, our data containers and the way parallel_for_each is expressed with an extent and index, allow the programmer to remain closer to the original problem domain. In this context, it is easier to think of matrix multiplication as a two dimensional computation across the C matrix. See chunking data across multiple C++ AMP kernels for more insight into this and other strategies.

In an earlier iteration of this algorithm, I first partitioned the matrices across accelerators before chunking and streaming. This had a couple of disadvantages; for starters it added an unnecessary constraint that the matrices be divisible by the number of accelerators, secondly the memory locality on the CPU side was not as good, as it created larger gaps in memory during concurrent memory accesses. The current approach achieves good data locality with simple load balancing between accelerators and the control flow is more straightforward.

I’ll be describing the code top down, starting from main.


First we create the 3 single precision floating point matrices in flat memory on the CPU and initialize the input matrices, A and B with random values using the Mersenne Twister random number generator that comes with the standard template library. We also zero initialize the output matrix for cleanliness. Next we get all available accelerators and for the ones that are not emulated, we push their default accelerator views into a vector that we will later pass to the mxm_amp_distribute function.

void main() {
  const int MATRIX_SZ = 1024;
  const int STREAM_WIDTH = 512;
  const unsigned int M = MATRIX_SZ;
  const unsigned int N = MATRIX_SZ;
  const unsigned int W = MATRIX_SZ;
  float *A = new float[M*N];
  float *B = new float[N*W];
  float *C = new float[M*W];
  rand_init(A, M*N); // initialize with random numbers
  rand_init(B, N*W); // initialize with random numbers
  zero_init(C, M*W); // initialize with zero
  std::vector<accelerator> accelerators = accelerator::get_all();
  std::vector<accelerator_view> accelerator_views;
  std::for_each(accelerators.begin(), accelerators.end(), [&] (accelerator& accl) {
    if(!accl.is_emulated) { // if not emulated
      std::wstring wdescr = std::wstring(accl.description);
      std::wcout << accl.description << std::endl;
      accelerator_views.push_back(accl.default_view); // get the default accelerator view
  mxm_amp_distribute(accelerator_views, STREAM_WIDTH, A, B, C, M, N, W);


The mxm_amp_distribute function is where the coarse grained parallelism is used to distribute the work across accelerators occurs. First the parameters are checked, then each accelerator_view gets its own staging buffer for streaming the B matrix from the CPU and these are pushed into a vector. Using staging arrays for the B matrix streams increased performance for this workload.

void mxm_amp_distribute(std::vector<accelerator_view> acc_views, const unsigned int stream_width, const float* A, const float* B, float* C,
								const unsigned int M, const unsigned int N, const unsigned int W) {
  // Performs matrix multiplication C = A * B
  // throw runtime error if M, N and stream_width are not all multiples of TILE_SIZE
  if( ((M % stream_width) != 0) || ((N % stream_width) != 0) || ((W % stream_width) != 0)) {
    throw std::runtime_error("M, N and W must all be multiples of stream_width");
  if(acc_views.size() == 0) {
    throw std::runtime_error("acc_views is empty, nowhere to run");
  // Create a vector of staging arrays for streams of matrix B, one for each accelerator
  std::vector<array<float, 2>> b_vect;
  accelerator cpu_accelerator = accelerator(accelerator::cpu_accelerator);
  std::for_each(acc_views.begin(), acc_views.end(), [&] (accelerator_view acc_vw) {
    b_vect.emplace_back(extent<2>(N, stream_width),cpu_accelerator.default_view, acc_vw);

The outer while loop will continue execution until the end of data has been reached. The inner parallel_for loop calculates the offset where the particular accelerator will start its work using the accelerator’s index and the stream_width parameter, each task checking to make sure it does not overreach its bounds. This last check is necessary as the matrix size may not be evenly divisible by the number of accelerators. Next, the dispatch_chunk function is called and the party moves to the accelerator.

unsigned int w_start = 0; // indicates where to start processing
while(w_start < W) {
  parallel_for((size_t) 0, acc_views.size(), [&] (size_t accelerator_ix) {
    unsigned int stream_offset = w_start + (stream_width * accelerator_ix);
    if (stream_offset < W) {
      dispatch_chunk(acc_views[accelerator_ix], A, B, C, M, N, W, stream_width, stream_offset, b_vect[accelerator_ix]);
  w_start += (acc_views.size() * stream_width);


The dispatch_chunk function creates C++ AMP arrays that are fixed to the underlying storage for the A & C matrices. Extents are created for the data as well as the computation to be done for each invocation of the parallel_for_each. The shape of the computation is square in this case and its size is fixed to the stream_width parameter.

void dispatch_chunk(accelerator_view acc_vw, const float* A, const float* B, float* C, const unsigned int M, const unsigned int N, const unsigned int W, 
						const unsigned int stream_width, const unsigned int stream_offset, array<float,2> &staging_b) {
  extent<2> extent_A(stream_width, N);
  extent<2> extent_C(stream_width, M);
  extent<2> extent_compute(stream_width, stream_width);
  // chunk storage on GPU for A & C
  array<float, 2> matrix_a(extent_A, A + (stream_offset * N), acc_vw);
  array<float, 2> matrix_c(extent_C, C + (stream_offset * M), acc_vw);

Next, we wrap an array_view around this accelerator’s staging array for the B matrix, which was passed in as a parameter.

array_view<float,2> matrix_b(staging_b)

The outer for loop is used to break the B matrix into stream_width streams using the populate_stream function and passing it to the mxm_calc_stream function which calculates that stream’s corresponding column of C. This is done until an entire stream_width chunk of C is resolved and copied back out to its CPU backed storage.

// stream B to GPU

for(unsigned int start_column = 0; start_column < M; start_column += stream_width) {
  populate_stream(, B, start_column, stream_width, N, M);
  mxm_calc_stream(extent_compute, N, start_column, matrix_a, matrix_b, matrix_c);
copy(matrix_c, C + (stream_offset * M));


The populate_stream function copies a column stream from a matrix to a staging array, maintaining row major order. The underlying data pointer is used for performance reasons, because on the cpu side using the index operator is slower (note that on the accelerator, this is not an issue).

// populate B matrix column stream
void populate_stream(float* column_stream_ptr, const float* const arr_in, unsigned int start_column, unsigned int stream_width, 
						unsigned int arr_length, unsigned int arr_width) {
  for(unsigned int row = 0; row < arr_length; row++) {
    memcpy(column_stream_ptr + (row * stream_width), arr_in + (row * arr_width)+ start_column, stream_width * sizeof(float));


The mxm_calc_stream function is just another matrix multiplication implementation. I currently use a matrix multiplication which takes advantage of tile_static memory.

Since the A matrix is read from within the parallel_for_each kernel, it is copied to the accelerator. As C is only written to, data for C is not copied from the CPU to the accelerator. However, since C is captured in the lambda and written to, storage is allocated on the accelerator.

// matrix multiply, resolve C for this stream
void mxm_calc_stream(extent<2> extent_compute, unsigned int N, unsigned int start_column, array<float,2>const &matrix_a, 
							array_view<float,2> const &matrix_b, array<float,2> &matrix_c) {
  parallel_for_each(extent_compute.tile<TILE_SIZE,TILE_SIZE>(), [=, &matrix_a, &matrix_c] (tiled_index<TILE_SIZE,TILE_SIZE> t_idx) restrict(amp) {
    float accum_c = 0; // accumulate sum in local register to avoid extra round
    // trips to global memory
    for (unsigned int i = 0; i < N; i += TILE_SIZE)
      tile_static float local_B[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix B
      tile_static float local_A[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix A
      local_A[t_idx.local[0]][t_idx.local[1]] = matrix_a([0], i + t_idx.local[1]);
      local_B[t_idx.local[0]][t_idx.local[1]] = matrix_b(i + t_idx.local[0],[1]);
      t_idx.barrier.wait(); // wait for all threads in the tile to pull their corresponding values into tile_static memory
      for (unsigned int k = 0; k < TILE_SIZE; k++)
        accum_c += local_A[t_idx.local[0]][k] * local_B[k][t_idx.local[1]]; // accumulate partial sum for current tile
      t_idx.barrier.wait(); // wait for all threads in the tile 
    matrix_c([0],[1] + start_column) = accum_c; // done with this tile, update global memory

That’s all for this post. Thanks for tuning in and I hope you enjoy programming in C++ AMP as much as we do. Feedback as always welcome in the comments or in our MSDN Forum.

Comments (6)

  1. Adam says:

    Can I create FDTD program with multiGPU? I would like to create dynamic table of "array_view" or "array" objects depend on number of ACC. In FDTD I calculate FDTD loop many many times on ACC, i don't want copy back data after one iteration. In your example you create array object in function caused by paraller_for. I don't want do that because i want calculate my loop non stop without initialize date with every iteration. Can I initialize array or array_view object at the beginning and after that calculate FDTD loop on few ACCs?

  2. bobyg says:

    Hi Adam,

    Just for clarification, by FDTD I assume you mean Finite-difference time-domain method, correct?

    Yes, you can initialize array or array_view object at the beginning and pass it along to parallel_for_each loops. array_view is smart that it will copy data to GPU only if the data is not present in the GPU already. please do let us know if you run into any issues. thanks

  3. Adam says:

    Thank you Boby,

    I was able to do it. Now I have problem with transfer data. In FDTD I need transfer part of my data from one GPU to another(here it that method:…/stamp.jsp) . Could you please let me know how I can quickly transfer part data from array from one to another GPU? I don't need copy all data only one part from array. Thanks.

  4. Łukasz Mendakiewicz says:

    Hi Adam,

    To copy only a part of an array you need to specify the source section you are interested in, for example to copy only the first element of A to B you'd write:

    copy(A.section(0, 1), B);

    You can read more about sections here:…/handling-subsections-of-data-in-c-amp.aspx

    And more about global copy functions here:…/transferring-data-between-accelerator-and-host-memory.aspx

    Hope that helps.

  5. AdamWUT says:

    Thank you so much Łukasz, I've just been looking for this 🙂

  6. andrei says:


    I'm trying to prove that using two or more GPUs can improve your time, but it doesn't seem like that.

    can anybody please tell me what i have done wrong?

    the time for a 2 sli is almost 50%-90% bigger than on an single GPU

    Thank you

    struct TaskData


    int id;

    accelerator_view view;

    int startRow;

    extent<2> ext;

    TaskData(accelerator a, int i) : view(a.default_view), id(i) {}

    static std::vector<TaskData> Configure(const std::vector<accelerator>& accls, int rows, int cols)


    std::vector<TaskData> tasks;

    int startRow = 0;

    int rowsPerTask = int(rows / accls.size());

    if (rows%accls.size() != 0)

    rowsPerTask += 1;

    int i = 0;

    std::for_each(accls.cbegin(), accls.cend(), [=, &tasks, &i, &startRow](const accelerator& a)


    TaskData t(a, i++);

    t.startRow = std::max(0, startRow);

    // std::cout << "start row: " << t.startRow<<"n";

    int endRow = std::min(startRow + rowsPerTask, rows);

    // std::cout << "end row: " << endRow<<"n";

    t.ext = extent<2>(endRow – t.startRow, cols);

    // int nrOperatii = endRow – t.startRow;

    // std::cout << "operatii: " << nrOperatii << "n";


    startRow += rowsPerTask;


    return tasks;



    void MatrixMultiGpuExample(const std::vector<accelerator>& accls, const int rows, const int cols)


    std::vector<TaskData> tasks = TaskData::Configure(accls, rows, cols);

    //  Initialize matrices

    std::vector<int> vA(rows * cols);

    std::vector<int> vB(rows * cols);

    std::vector<int> vC(rows * cols);

    for (int i = 0; i < rows * cols; i++)

    vA[i] = i;

    for (int i = 0; i < rows * cols; i++)

    vB[i] = 100;

    /*std::cout << "a:n";

    for (int unu = 0; unu < rows; unu++)


    for (int doi = 0; doi < cols; doi++)

    std::cout << vA[unu*cols + doi] << " ";

    std::cout << std::endl;


    std::cout << "nb:n";

    for (int unu = 0; unu < rows; unu++)


    for (int doi = 0; doi < cols; doi++)

    std::cout << vB[unu*cols + doi] << " ";

    std::cout << std::endl;


    //  Calculation

    double time2 = TimeFunc(tasks[0].view, [&]()


    parallel_for_each(tasks.begin(), tasks.end(), [=, &vC](TaskData& t)


    array_view<const int, 2> a(t.ext, &vA[t.startRow * cols]);

    array_view<const int, 2> b(t.ext, &vB[t.startRow * cols]);

    array<int, 2> c(t.ext, t.view);

    parallel_for_each(t.view, t.ext, [=, &c](index<2> idx) restrict(amp)


    c[idx] = a[idx] + b[idx];


    // Wait is required as copy_to may cause all GPUs to block.


    array_view<int, 2> outData(t.ext, &vC[t.startRow  * cols]);




    std::cout << "nnnnoul c:n";

    for (int unu = 0; unu < rows; unu++)


    for (int doi = 0; doi < cols; doi++)

    std::cout << vC[unu*cols + doi] << " ";

    std::cout << std::endl;


    std::wcout << " " << tasks.size() << " GPU matrix weighted average (p_f_e) took                       " << time2 << " (ms)" << std::endl;