Bitonic Sort is one of the parallel sort algorithms suitable for GPU, and am sharing a C++ AMP implementation in this post. My sample is ported from a corresponding DirectX SDK sample which explains the algorithm in details, so please read the DirectCompute sort description on MSDN.

#### main – Entry point

Here the input data is randomly generated, then sorted using Bitonic sort algorithm and then verify if we have the sorted output data.

#### bitonic_sort_amp – C++ AMP Tiled Implementation

This function is called from *main* with input and output buffers. These buffers are encapsulated in *concurrency::array* objects and passed to kernels for computation. Initially the data is sorted in chunks of *BITONIC_TILE_SIZE*. If the data size is larger than *BITONIC_TILE_SIZE*, the algorithm breaks the data into multiples of *BITONIC_TILE_SIZE* – this is implemented inside the second *for* loop.

// Then sort the rows and columns for the levels > than the block size // Transpose. Sort the Columns. Transpose. Sort the Rows. for (UINT level = (BITONIC_TILE_SIZE * 2); level <= NUM_ELEMENTS; level = level * 2)

This takes data sorted along rows, interprets the data subset as a 2D matrix and transposes (see the Transpose section of the aforementioned article) to sort the column data, and then sorts column data. It does a transpose again, now to sort the row data and then sorts. This continues until all the data is sorted.

#### bitonic_sort_kernel – C++ AMP sorting kernel (Tiled)

This kernel, called from *bitonic_sort_amp,* is sorting a row of size *BITONIC_TILE_SIZE* number of elements at a time. All *BITONIC_TILE_SIZE* thread will read one element from GPU global memory into tile_static memory to avoid redundant reads. Then sorting is done in *tile_static* memory, here each thread will pick min or max, and then synchronizes before writing the result to *tile_static* memory. Eventually each thread will copy out the result from *tile_static* memory (indexed using the tile local id) to global memory (indexed using the global thread id).

#### transpose_kernel – C++ AMP matrix transpose kernel (Tiled)

This kernel, called from *bitonic_sort_amp,* transposes a 2D square matrix of size *TRANSPOSE_TILE_SIZE * TRANSPOSE_TILE_SIZE*. Given the input data as 1D vector, we could have used math to calculate linear address based on thread index and tile index to transpose. Another solution is to use *view_as* member function to view a 1D vector as a 2D matrix and then transform like a regular 2D matrix.

#### view_as – member function of concurrency::array

This function allows an *array* of higher rank to be reshaped into an *array* of lower rank or from lower rank to higher rank. Below is the code snippet where a lower rank *array* is reshaped to a higher rank

array<int, 1> a(extent<1>(100)); array_view<int, 2> av = a.view_as(extent<2>(10, 10));

##### Download the sample

Please download the attached project of the Bitonic Sort that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

This is great — I've just spent the last week implementing this myself, though since I'm not as familiar with optimising memory access patterns I expect this to perform much better (benchmarking will have to wait for a few hours — on the wrong laptop right now).

Any chance of this being included in the final release (as a concurrency::parallel_sort(array[_view]) overload, perhaps)?

Okay, I'll revise my praise down somewhat – the tiling used here doesn't seem to improve performance as much as I'd expected. I've posted more details (and my own code) on the forum: social.msdn.microsoft.com/…/90090ae1-44a9-470a-8be9-f40b4392c3d1