warp or wavefront of GPU threads

When programming GPUs we know that we typically schedule many 1000s of threads and we also know that we can further organize them in many tiles of threads.

Aside: These concepts also exist in other programming models, so in HLSL they are called “threads” and “thread groups”. In CUDA they are called “CUDA threads” and “thread blocks”. In OpenCL they are called “work items” and “work groups”. But we’ll stick with the C++ AMP terms of “threads” and “tiles (of threads)”.

From a correctness perspective and from a programming model concepts perspective, that is the end of the story.

The hardware scheduling unit

However, from a performance perspective, it is interesting to know that the hardware has an additional bunching of threads which in NVIDIA hardware is called a “warp”, in AMD hardware is called a “wavefront”, and in other hardware that at the time of writing is not available on the market, it will probably be called something else. If I had it my way, they would be called a “team” of threads, but I lost that battle.

A “warp” (or “wavefront”) is the most basic unit of scheduling of the NVIDIA (or AMD) GPU. Other equivalent definitions include: “is the smallest executable unit of code” OR “processes a single instruction over all of the threads in it at the same time” OR “is the minimum size of the data processed in SIMD fashion”.

A “warp” currently consists of 32 threads on NVIDIA hardware. A “wavefront” currently consists of 64 threads in AMD hardware. Each vendor may decide to change that, since this whole concept is literally an implementation detail, and new hardware vendors may decide to come up with other sizes.

Note that on CPU hardware this concept of most basic level of parallelism is often called a “vector width” (for example when using the SSE instructions on Intel and AMD processors). The vector width is characterized by the total number of bits in it, which you can populate, e.g. with a given number of floats, or a given number of doubles. The upper limits of CPU vector widths is currently lower than GPU hardware.

So without going to any undesirable extremes of tying your implementation to a specific card, or specific family of cards or a hardware vendor’s cards, how can you easily use this information?

Avoid having diverged warps/wavefronts

Note: below every occurrence of the term “warp” can be replaced with the term “wavefront” and the meaning of the text will not change. I am just using the shorter of the two terms :-).

All the threads in a warp execute the same instruction in lock-step, the only difference being the data that they operate on in that instruction. So if your code does anything that causes the threads in a warp to be unable to execute the same instruction, then some threads in the warp will be diverged during the execution of that instruction. So you’d be leaving some compute power on the table.

Obvious examples of divergence are

  1. When you pick a tile size that is not a multiple of the warp size.
  2. When you have branching where the whole warp is unable to take the same branch. Note that beyond the very obvious if statement, a for loop or the C++ short-circuit evaluation or the tertiary operator can also result in branching.

Consider this simple line of code, in some restrict(amp) code, working on 1-dimensional data, with an index<1> variable named idx, essentially representing the thread ID

 if (idx[0] >= XYZ)
{
    my_array[idx] += A;
}
else
{
    my_array [idx] *= B;
}

If XYZ is equal to or a multiple of the warp size, then no divergence occurs. So while your code has branching, from a warp divergence perspective there is no harm and all threads are kept busy all the time. Good.

If XYZ is smaller than the warp size OR it is larger but not a multiple of the warp size, then divergence will occur. For example, if warp is of an imaginary size 4, and XYZ is 7, consider what happens during the execution of the first two warps.

The first warp of 4 threads (with idx[0] of 0,1,2,3) all evaluate the conditional in one cycle and it is false for all of them so all threads take the else clause, where in a single cycle they all multiply their respective element in the array by B , which they access with their thread id, i.e. idx. No problem.

The second warp of 4 threads (with idx[0] of 4,5,6,7) all evaluate the conditional and it is false for the first 3 but true for the last one. So at that point the first three threads proceed to add A to their respective element in the array BUT the fourth thread is idling at that point since it is diverged so you are essentially wasting it. The next thing that happens is that the else path is taken by the fourth thread so it can add B to its array element BUT now the other 3 threads are diverged at this point, so you are wasting the opportunity of having them do work during that cycle. Bad.

Now, there are many scenarios where warp divergence is unavoidable, and you have to balance the tradeoffs between this and other considerations. However, it is good to know that if you have the choice, for today’s hardware, starting with a tile size of a multiple of 64 and ensuring that conditionals and loops would not diverge threads at the 64 boundary, is going to result in better thread utilization.

To be clear, there are many other considerations for fully utilizing your hardware, but this is just one fairly easy one to take care of; for example, perhaps you can pre-sort your data, such that close-by threads follow the same code paths...

warp is not the same as direct3d_warp

Do not confuse the term warp as introduced in this blog post to mean a team of threads, with the accelerator WARP that I introduced in a previous blog post. No relationship.