In a directed graph, determining transitive closure or connectivity between the nodes is an important problem. Transitive closure is required in reachability analysis of networks representing distributed and parallel systems. In this blog post I am sharing a transitive closure sample implementation using C++ AMP. The sample is based on the research paper: All-Pairs Shortest-Paths for Large Graphs on the GPU (pdf).

#### main – Entry point

This is a driver function which will generate input data randomly, then compute the transitive closure on the CPU using the generated data. Then it initiates the computation on the GPU using the same generated data and eventually it compares the results between the CPU and the GPU computations.

#### verify_transitive_closure – Verify result

This function does two verifications:

- Verifies the CPU results against the GPU results for connectivity between nodes i.e. whether they are connected or not connected, and if connected directly or indirectly.
- Using the GPU results, for any indirectly connected nodes, it recursively verifies if the path between the nodes in question actually exist.

#### transitiveclosure_cpu_naive – C++ Serial

This algorithm has three nested *for* loops, where the outer loop index is used as an intermediate node which may connect two nodes in graph. The inner two loops run through the graph and for any unconnected node, check if they are connected via the intermediate node (specified by the outer *for* loop). If the two nodes are connected via intermediate node, update the connected nodes with the intermediate node index and mark the two nodes as being connected indirectly.If there are N nodes in a graph, connectivity information between nodes is stored in 2D array (of size NxN), for example, the connectivity information between node i and j are stored in an array *graph[i, j]*.

#### transitiveclosure_amp_tiled – C++ AMP Tiled Implementation

This is a three stage algorithm which is best described in the research paper linked at the start of this blog post. A key point mentioned in the paper and implemented in this sample is reduced memory copy operations. In the sample, the copy-in to the GPU global memory happens only at the beginning and the copy-out happens only at the end of function. There is no copy between different invocations of *parallel_for_each*; once an *array* is allocated in GPU global memory and data is copied there, then subsequent kernel invocations reuse this array without additional copies.

#### Download the sample

Please download the attached project of the Transitive Closure that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

I love the nativeconcurrency blog and I appreciate the great articles.

But this implementation doesn't seem to deliver on performance. Or I did something wrong.

I compiled the TransitiveClosure sample with the latest Ultimate VS 2012 RC and ran the app on three different machines:

1. HP Z800, Win 7, Geforce GTX 590

2. Dell 690, Win 8441, Geforce GTX 570

3. HP Z800, Win 7, 2x ATI Radeon 4850 Crossfire

I also added code that uses Simon Wybranski's Timer class (blogs.msdn.com/…/high-resolution-timer-for-c.aspx).

The results are rather disappointing.

Do you have an explanation for why those GPU computations are so slow?

Best wishes,

Bernd Paradies

DETAILS:

1. HP Z800, Win 7, Geforce GTX 590

Using device : NVIDIA GeForce GTX 590

Generating input data..Done

Running the Transitive Closure Sample…

Number of Vertices = 64

Iterations = 1

Computation on CPU…Elapsed: 0.458448ms

Computation on GPU…Elapsed: 48.1272ms

2. Dell 690, Win 8441, Geforce GTX 570

Using device : NVIDIA GeForce GTX 570

Generating input data..Done

Running the Transitive Closure Sample…

Number of Vertices = 64

Iterations = 1

Computation on CPU…Elapsed: 0.603965ms

Computation on GPU…Elapsed: 46.2473ms

3. HP Z800, Win 7, 2x ATI Radeon 4850 Crossfire

Using device : Software Adapter

WARNING!! Running on very slow emulator! Only use this accelerator for debugging

.

Generating input data..Done

Running the Transitive Closure Sample…

Number of Vertices = 64

Iterations = 1

Computation on CPU…Elapsed: 0.44133ms

Computation on GPU…Elapsed: 240.123ms

Hi Bernd

Thank you for your question, please allow me to offer some generic comments, while addressing the specific question you asked.

1. Our samples are aimed to help folks get started with the API, they are definitely not aimed to be the fastest best possible implementation of an algorithm. This sample simply translates the algorithm from the paper it references to C++ AMP.

2. When measuring C++ AMP code, there are a bunch of extra considerations to take into account (beyond just using the timer), and the sample doesn’t do those. I don’t know if you did when you modified the code. Please start here and the links it points to: blogs.msdn.com/…/data-warm-up-when-measuring-performance-with-c-amp.aspx

3. To get speed up from offloading computations to the GPU, it typically takes really large amounts of data. For example, for this sample try changing the num_vertices from (1 << 6) to (1 << 10). Since 64 seems to be too small a number, you should see benefits with 1024.

Cheers

Daniel

Hello Daniel,

I recompiled TransitiveClosure using (1<<10) vertices and I also repeated the computation on the GPU, so that we can treat the first run as the "warm-up" phase as suggested by the article you were referring to.

The results are now much better (see below). The 2nd GPU run is almost twice as fast as the "warm-up" run.

Now, the only remaining question I have is about the 2x ATI Radeon 4850 Crossfire machine.

Why does AMP C++ not support that graphics card?

Is there a list of GPUs and graphic cards that AMP C++ support?

Thanks,

– Bernd

Using device : NVIDIA GeForce GTX 590

Generating input data..Done

Running the Transitive Closure Sample…

Number of Vertices = 1024

Iterations = 1

Computation on CPU…Elapsed: 703.089ms

Computation on GPU…Elapsed: 113.645ms

Computation on GPU…Elapsed: 61.5971ms

Using device : NVIDIA GeForce GTX 570

Generating input data..Done

Running the Transitive Closure Sample…

Number of Vertices = 1024

Iterations = 1

Computation on CPU…Elapsed: 808.335ms

Computation on GPU…Elapsed: 107.151ms

Computation on GPU…Elapsed: 64.7328ms

Hi Bernd, C++ AMP supports all hardware with a DirectX 11 driver.

http://www.danielmoth.com/…/What-DX-Level-Does-My-Graphics-Card-Support-Does-It-Go-To-11.aspx