Matrix Multiplication Sample

Matrix multiplication is a ubiquitous algorithm and no parallel library is complete without a matrix multiplication sample. Moth has described this previously - see Matrix Multiplication with C++ AMP for serial and C++ AMP non-tiled version and also see the C++ AMP tiled version of matrix multiplication. Now we have it in a Visual Studio project too so you can download it at the end of my blog post here.

main – Entry point

Let’s start with function main, here two input matrix are filled with random data, then serial - C++ implementation (used as reference to verify result) is run, followed by C++ AMP versions (simple and tiled version) and finally verify the C++ AMP implementation results with the Serial implementation.

mxm_single_cpu - C++ Serial

This is a simple N cube function.

mxm_amp_simple - C++ AMP Non-Tiled

When compared to Serial or CPU implementation, two outer for loops are replaced with parallel_for_each and the innermost loop is similar to CPU implementation. In the parallel loop we specify the creation of one GPU thread per element of the output matrix. In this sample array_view is being used to pass in/out data. Input matrixes are passed in as array_view<const T, n> and output matrix is specified to be array_view<T, n>. Using discard_data() method, data transfer can be optimized, I have explained in more detail about this in my earlier blog post about the Binomial Options sample.

mxm_amp_tiled - C++ AMP Tiled

In this implementation the same number of GPU threads is created as in mxm_amp_simple. In mxm_amp_simple, each thread is reading all its operands from GPU global memory. Accessing GPU global memory is expensive in time, when compared to using tile_static memory. Also between threads, the same data is read again and again. By making threads co-operatively read and synchronize, each thread can reduce unnecessary reads. This can be done by reading lesser number of elements per thread into tile_static memory and then calculating the result from the data in tile_static memory.

Here is how it is implemented: The kernel consists of two nested for loops. In the outer for loop, each thread will read one element from both the input matrixes. So a total of tile_size by tile_size threads will read sub matrixes of size tile_size by tile_size from two input matrixes and save it to tile_static memory (in our code the localA and localB tile_static variables).

Below is the kernel code snippet from the sample (for float data type):

    index<2> localIdx = tidx.local;
    index<2> globalIdx =;

    for (int i = 0; i < N; i += tile_size)
       tile_static float localB[tile_size][tile_size];
       tile_static float localA[tile_size][tile_size];

       localA[localIdx[0]][localIdx[1]] = av_a(globalIdx[0], i + localIdx[1]);
       localB[localIdx[0]][localIdx[1]] = av_b(i + localIdx[0], globalIdx[1]);


       for (unsigned int k = 0; k < tile_size; k++)
           temp_c += localA[localIdx[0]][k] * localB[k][localIdx[1]];


    av_c[tidx] = temp_c;

av_a and av_b are input matrixes, where each iteration of the outer loop brings in data along the columns of av_a and along rows while accessing av_b. A thread’s global index (globalIdx) helps map input elements along one of the dimensions, and the other dimension is calculated. So in our code, when a thread accesses an element in av_a matrix, the row is constant (derived from global property of tiled_index), and the column index is calculated by summing the thread’s tile local index plus the number of tile_size advances made so far (represented by the loop control variable i). The loop control variable, gives the new index in terms of tile_size data like 0*tile_size, 1*tile_size, 2*tile_size, 3*tile_size, etc. For matrix av_b, the column is constant while the row is calculated similar to what we just described for av_a. Below is the code showing how input matrix indexes are calculated:

    localA[localIdx[0]][localIdx[1]] = av_a(globalIdx[0], i + localIdx[1]);
    localB[localIdx[0]][localIdx[1]] = av_b(i + localIdx[0], globalIdx[1]);

Now each thread should synchronize (using tile_barrier) with all other threads in its tile. This will ensure that the necessary data are in tile_static memory before the innermost loop gets executed.

The innermost loop is similar to mxm_amp_simple except that the data is read from the tile_static memory (instead from global memory). The partial result calculated is stored in the thread’s local variable (temp_c) until the outer loop has traversed along a dimension. After completing the inner loop, threads needs to synchronize again so to ensure that threads which complete early won’t overwrite the tile_static memory that other threads may still be reading.

Finally, before a GPU thread completes, it will store the calculated result in the output matrix (av_c). When this function returns, the array_view destructor is called and the data associated it is synchronized.

Download the sample

Please download the attached sample project of the matrix multiplication that I mentioned here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

Comments (2)

  1. zzz says:

    Hello. I benched the FindString vs findstr and it was over 10x slower.

    Since I was looking for something that performs FindString but needs to be fast and preferably able to run in parallel, I came across this paper  "ERA Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings".

    Could you incorporate the algorithm in the paper in a sample show casing the parallel APIs in a FindString type of application? Given a list of strings, find *some*string*parts* from a list containing abcsomedefstringghparts – so it would match that string in the list.

    I think this could be very practical in many scenarios where you have large list of strings and want to do a partial match.

  2. BharathM says:

    Hi zzz,

    From your comment I couldn’t determine if you are using C++ AMP or not. The confusing part is you have added link to a paper using MPI. If you are looking for a MPI based implementation this is not the right forum.

    If you are looking for a C++ AMP based solution, then you may have to wait. I will add it to my list of samples and will get to it later (can't really give any ETA). On the other hand, I can point to some paper which talk about GPU based solution and should help you implement it.

    One good paper to start with is:…/Mars.html

    Hope this helps


Skip to main content