Gaussian Blur using C++ AMP

In image processing, applying filter function is very common and Gaussian Blur is one such method. In this blog post I’ll share a C++ AMP implementation.

main – Program entry point

In main(), create an instance of the gaussian_blur class, apply the filter (execute) and validate results (verify). The constructor generates random input data for calculation.


In the gaussian_blur function, kernel input data are stored in a concurrency::array. Then it invokes a parallel_for_each computation (kernel) that uses the simple model. The kernel is implemented in the gaussian_blur::gaussian_blur_simple_amp_kernel function. At the end result is copied out of GPU to host memory.


This function implements a C++ AMP kernel. For each input data point, a GPU thread is used to apply the filter. Each GPU thread for a given data point reads in neighboring data points along both x-axis and y-axis and apply the filter to the point. The filter is applied in the nested “for” loop which is traversing neighbors and the “if” statement inside bounds access to the array dimension. The result is then stored in the output array.


This function validates results computed on the GPU. Here the same input data is used to calculate results on CPU again. Finally results of CPU and GPU are compared to determine correctness.

Download the sample

Please download the attached sample of the Gaussian Blur that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

Comments (2)

  1. David Cuccia says:

    Hi, thanks for the sample. I was curious – wouldn't the logic at line 36:

    if ((x >= 0) & (y >= 0) & (x < size) & (y < size))

    …cause divergent warps near the boundaries? Curious how you might suggest optimizing this (if it deserves optimizing).


  2. Yes, it causes divergent warps near the boundaries. As the most of warp threads don’t have divergent code, this shouldn’t be a problem. So it doesn’t deserve optimizing in this simple sample.

    However, another way to improve performance is to use tiles which improves resource utilization like parallel_for_each(extent<2>(P1*TS, P2*TS).tile<TS, TS>(), […] (tiled_index<TS, TS> tidx)