Memory Coalescing with C++ AMP

The primary goal of offloading work to a GPU is improve performance – either because you need to reduce your coffee consumption waiting for an answer or you are tired of rushing to the wall to plug in your laptop. However, performance may be limited by different aspects of GPU execution. This blog post will illustrate an important aspect: efficient use of memory bandwidth. As discussed in our top-level post on C++ AMP performance guidance, this important for memory bound loops and later in this post we will illustrate the difference between memory-bound and compute bound.

Hardware Background

Some basic hardware background will help motivate this topic. When data is read from memory into a processor, there is a natural width for such transfers. The NVIDIA Quadro 2000M in my laptop has a 128 bit memory width. NVIDIA documentation indicates as much as 128 bytes may be loaded per request. This means that no matter how small a data item you are requesting, at least 128 bits will come into the processor and perhaps 128 bytes. If you don’t use the extra data, it is discarded. Thus if you load a 32-bit integer but don’t need anything else, you are only getting one fourth of the available bandwidth. That is not very efficient.

This basic structure is true on a CPU as well but on a CPU, there is a data cache which will hold onto the extra data and so as long as you get around to using it “fairly soon” this problem can be avoided. “Fairly soon” depends on how much other data you use. On a GPU, kernels typically use a lot of data pretty quickly and so the CPU data cache approach is not generally very effective for these workloads. That said, some GPUs use data caching techniques to help when they can and the good news is that the techniques we discuss will also be valuable for your CPU code when you do similar problems which don’t use the cache well since data hanging around unused in the cache pushes out other data that might be used.

One important thing to remember about GPUs is that they schedule a set of contiguous threads together. We talked about this earlier in this blog when we introduced the term warp of threads. . What this means is that when thread number i loads a value from memory, thread i+1 almost always also performs a load at the very same time. When these loads refer to adjacent memory locations, the GPU hardware will coalesce them to guarantee that only a single memory transfer will be used to fetch them. If all of the threads in a warp make adjacent loads this means that data transferred from memory is efficiently used.

Applying it to C++ AMP: contiguous and strided indexing

For example, consider a simple copy kernel:

void my_copy(array_view<float,1> x, array_view<const float,1> y) {

parallel_for_each(x.extent, [=](index<1> idx) restrict(amp) {

         x[idx] = y[idx];

});

}

Here, values of “idx” are consecutive for threads in the same warp and so a warp will read a contiguous vector of values from “y” and then write a contiguous vector into “x”. An important design decision about C++ AMP was to make the natural mapping of index values to warps fit well with the row-major layout of arrays. One implication of that was we decided that any rank-1 array or array view would always refer to contiguous memory. When you form a section of a high-dimensional object, there can be gaps between elements but never within a single row.

In contrast, we might have a strided access:

void my_copy_strided(array_view<float,1> x, array_view<const float,1> y) restrict(amp) {

parallel_for_each(x.extent, [=](index<1> idx) {

     x[idx] = y[idx*2];

});

}

In this case, consecutive threads access every other value of “y” and so half of the data values fetched from memory by the hardware are discarded. Does this mean you should not have strided access? No it doesn’t, but it does mean such accesses are more expensive and if you can eliminate them your code will use memory more efficiently and likely run faster. In this simple example we can show that “how much faster” depends on the accelerator. Here are some examples of machines I have access to (names have been changed to protect the innocent) where performance is reported in gigabytes per second performed over multi-megabyte arrays (largest value over several runs).

Kernel

GPU A

GPU B

GPU C

GPU D

my_copy

24.1

98.7

150.8

205.1

my_copy_strided

16.2

81.4

107.2

139.5

Note that for “A”, “C”, and “D”, the examples with strided accesses achieves about 2/3rds of the bandwidth which is right in line with expectation that we are using only half the data we read. For “B”, it is worse but not as bad as one might have thought which implies the strided loop might not be completely memory bound.

Data Structure Choice: SoA vs AoS

The choice of a data structure will impact memory efficiency even if the indexing looks good. For example, consider:

struct S {

float left, right, other;

};

void my_copy_AOS(array_view<float,1> x, array_view<const S,1> y) {

parallel_for_each(x.extent, [=](index<1> idx) restrict(amp) {

    x[idx] = y[idx].left;

});

}

Even though the indexing pattern is the same here, the fields of structure “S” are adjacent in memory while successive “left” fields are actually separated by 8 bytes. This kernel uses only one third of the data fetched from memory. The remaining fields are loaded but discarded. This is the well-known “array of structures” problem. For performance reasons, we recommend using the “structures of arrays” pattern whenever possible. Thus, if it is possible to change the program to the following, you would expect to see improved performance. Conveniently this will also be the case for CPU when the arrays are large enough.

struct S2 {

array_view<float,1> left, right, other;

};

void my_copy_SOA (accelerator_view av, array_view<float,1> x, S2 & y) {

parallel_for_each(av, x.extent, [=](index<1> idx) restrict(amp) {

    x[idx] = y.left[idx];

});

}

While it might appear that since structure “y” is captured in the lambda that the data referenced by array_views “y.right” and “y.other” might be copied to the GPU or seen by the runtime as used in this kernel causing DirectX to treat them as inputs to the kernel. However, the C++ AMP compiler determines these fields are not referenced and so their presence is effectively ignored and has no influence on the implementation of the kernel.

Here are the summary results for this example, again reported in measured gigabytes per second:

Kernel

GPU A

GPU B

GPU C

GPU D

my_copy_AOS

12.3

63.5

81.9

107.1

my_copy_ SOA

24.0

98.6

150.4

204.7

Again for GPUs "A", “C” and "D", the impact of using on one word in three read means that the measured bandwidth is cut in half (and the running time doubles).

Effects when kernel becomes compute bound

The above examples are fairly extreme where the kernel does little more than move bits around. If we look at a synthetic kernel that has more “work” in it then we can see differences in access patterns have less effect as the kernel changes from “memory bound” to “compute bound”.

void unit_test(array_view<float,1> x, array_view<const float,1> y, int n) {

parallel_for_each(x.extent, [=](index<1> idx) restrict(amp) {

    float f = y[idx];

    float s = f;

    for(int i = 0; i < n; i++)

        s = s*s + f;

    x[idx] = s;

});

}

void stride_test(array_view<float,1> x, array_view<const float,1> y, int n) {

parallel_for_each(x.extent, [=](index<1> idx) restrict(amp) {

    float f = y[2*idx];

    float s = f;

    for(int i = 0; i < n; i++)

        s = s*s + f;

    x[idx] = s;

});

}

Since the number of loads and stores is fixed in each of these kernels, as “n” increases there is more arithmetic and eventually the arithmetic dominates memory traffic to determine the time it takes to evaluate a kernel. We know that the unit_test uses memory more efficiently but this really only matters when the loops are memory bound as we can show below. If we vary the “n” parameter on GPU “C” (below, the horizontal access) and measure billions of iterations per second (the vertical axis) we see the performance difference between the two loops shrink as more arithmetic is added to each iteration:

clip_image002[4]

The stride example is so memory bound that adding arithmetic is essentially free until you add about five iterations of the loop at which point both loops start spending more time doing arithmetic than loading data.

What’s Next

Revisit your kernels to determine if they are compute bound or memory bound. If it is memory bound, revisit the algorithm with the above in mind. Another technique to apply is using tile_static memory to improve memory coalescing, which we’ll look at in a future post.