Sorting in PPL

Article
01/14/2011

The Parallel Patterns Library (PPL), a programming model that resembles the C++ Standard Template Library (STL), was introduced in Visual Studios 2010. PPL provides general-purpose containers and algorithms for performing fine-grained parallelism. Additional containers and algorithms were shipped as part of the Sample Pack/ConcRT Extras, and in its latest version (v0.33), an updated version of parallel sort (along with 2 other algorithms for sorting in parallel) were introduced. In this blog, I shall elaborate on these sorting algorithms and compare/contrast them:

The following parallel sorting algorithms are provided:

1. Parallel sort - a general-purpose compare-based in-place sort (parallel quick sort)

2. Parallel buffered sort - a faster general-purpose compare-based sort that require O(N) space (parallel merge sort + quick sort)

3. Parallel radix sort - an integer key-based stable sort that requires O(N) space

For general purpose compare based sorting algorithms, quicksort is very efficient for most of the cases. The standard sorting algorithm, std::sort, uses Introsort (quicksort/heapsort). Quicksort has a minimal average compare and swap times for sorting random sequences. The parallel version of quicksort has O((N/P) log(N)) time complexity when the number of processors P is much less than data length N, and in a practical situation, it usually achieves O(N/P) for many cases. The default sort algorithm parallel_sortis implemented with a parallel quicksort.

However, when the number of processors increases, parallel quicksort suffers from the time complexity changing to O((N/P) log(N/P) + 2N(P – 1)/P) since the initial splitting stage of quicksort cannot be parallelized, which affects the scalability of the algorithm over cores. Scalability bottleneck becomes more apparent especially when the compare operation is very expensive. This can be overcome by using a parallel merge to replace the initial stage of parallel quicksort and make the average time complexity O((N/P) log N). However, to make the parallel merge efficient, it will require O(N) extra space. This algorithm is implemented as the parallel_buffered_sort.

For the special objects with integer-based keys a more efficient algorithm may take advantage of the integer’s radix property. Hash-based radix sort can directly compute the destination of the key without requiring comparisons. This algorithm can achieve roughly O(N/P) complexity on a machine with P processors. It also requires O(N) space to avoid contention. This algorithm is implemented as the parallel_radixsort.

The space and time complexity relationship between the sort algorithms can be portrayed as follows:

Summary:

To summarize, the table below compares/contrasts the various parallel sorting algorithms in the sample pack:

	Parallel Sort	Parallel Buffered Sort	Parallel Radix Sort
Description	General-purpose compare-based in-place sort	Faster general-purpose compare-based sort that requires O(N) space	Integer key-based stable sort that requires O(N) space
Sorting mechanism	Compare-Based : a binary functor provides “less than” symmetric meaning (not three-way compare)	Compare-Based : a binary functor provides “less than” symmetric meaning (not three-way compare)	Hash Based : Project (or hash) target object to an integral key
Sort Stability	Unstable	Unstable	Stable
Memory requirements	No extra memory requirements	Requires extra O(N) space	Requires extra O(N) space
Time Complexity
Iterator access	Random	Random	Random

Example:

Given below is a code sample illustrating sorting of a vector using serial sort and the different parallel sorts.

 /// Sample code introducing the various parallel sorts
/// Compile with: /EHsc

#include <windows.h>
#include <vector>
#include <algorithm>
#include <iostream>
#include <ppl_extras.h>

const size_t NUM_ELEMS_TO_SORT = 10000000;

/// <summary>
/// Helper function that clocks and returns run times of a function
/// </summary>
template<typename Func>
const unsigned int ClockFunc(Func func)
{
    unsigned __int64 start=0;
    unsigned __int64 end=0;

    QueryPerformanceCounter((LARGE_INTEGER *) &start);
    func();
    QueryPerformanceCounter((LARGE_INTEGER *) &end);

    return (unsigned int)(end - start);
}


/// <summary>
/// Function that returns a vector with the same data(value) contents
/// on every call
/// </summary>
std::vector<size_t> GetVector()
{
    //Set const seed
    srand(0);
    std::vector<size_t> vec(NUM_ELEMS_TO_SORT);
    std::generate(vec.begin(),vec.end(),rand);
    return std::move(vec);
}

int main()
{
    //Std Sort (serial)
    {
        std::vector<size_t> v(GetVector());
        std::cout << "Std Sort: " << ClockFunc([&v]() { std::sort(v.begin(), v.end()); }) << std::endl;
    }

    //Parallel Sort
    {
        std::vector<size_t> v(GetVector());
        std::cout << "Parallel Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_sort(v.begin(), v.end()); }) << std::endl;
    }

    //Parallel Buffered Sort
    {
        std::vector<size_t> v(GetVector());
        std::cout << "Parallel Buffered Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_buffered_sort(v.begin(), v.end()); }) << std::endl;
    }

    //Parallel Radix Sort
    {
        std::vector<size_t> v(GetVector());
        std::cout << "Parallel Radix Sort: " << ClockFunc([&v]() { Concurrency::samples::parallel_radixsort(v.begin(), v.end()); }) << std::endl;
    }

    return 0;
}

Output:

When built for x86 retail and run on my 4-core machine:

Std Sort: 2342996

Parallel Sort: 861919

Parallel Buffered Sort: 977655

Parallel Radix Sort: 414180

Additionally, can also visualize the workings/behavior of the different sort algorithms with varying data patterns, concurrency index and comparator/hasher cost with the Parallel Sort Demo (under ConcRT_SamplePack->Samples->ParallelSortDemo) shipped as part of the Sample Pack.

Now that you are familiar with the different parallel sorting algorithms provided in the sample pack, is there a guideline on when to use a particular sort algorithm (and conversely, when not to use them)? In my next blog, I’ll explore this topic based on an experiment that I conducted.

Sorting in PPL

Summary:

Example:

Output:

Additional resources