In a directed graph, determining transitive closure or connectivity between the nodes is an important problem. Transitive closure is required in reachability analysis of networks representing distributed and parallel systems. In this blog post I am sharing a transitive closure sample implementation using C++ AMP. The sample is based on the research paper: All-Pairs Shortest-Paths for Large Graphs on the GPU (pdf).
main – Entry point
This is a driver function which will generate input data randomly, then compute the transitive closure on the CPU using the generated data. Then it initiates the computation on the GPU using the same generated data and eventually it compares the results between the CPU and the GPU computations.
verify_transitive_closure – Verify result
This function does two verifications:
- Verifies the CPU results against the GPU results for connectivity between nodes i.e. whether they are connected or not connected, and if connected directly or indirectly.
- Using the GPU results, for any indirectly connected nodes, it recursively verifies if the path between the nodes in question actually exist.
transitiveclosure_cpu_naive - C++ Serial
This algorithm has three nested for loops, where the outer loop index is used as an intermediate node which may connect two nodes in graph. The inner two loops run through the graph and for any unconnected node, check if they are connected via the intermediate node (specified by the outer for loop). If the two nodes are connected via intermediate node, update the connected nodes with the intermediate node index and mark the two nodes as being connected indirectly.If there are N nodes in a graph, connectivity information between nodes is stored in 2D array (of size NxN), for example, the connectivity information between node i and j are stored in an array graph[i, j].
transitiveclosure_amp_tiled - C++ AMP Tiled Implementation
This is a three stage algorithm which is best described in the research paper linked at the start of this blog post. A key point mentioned in the paper and implemented in this sample is reduced memory copy operations. In the sample, the copy-in to the GPU global memory happens only at the beginning and the copy-out happens only at the end of function. There is no copy between different invocations of parallel_for_each; once an array is allocated in GPU global memory and data is copied there, then subsequent kernel invocations reuse this array without additional copies.
Download the sample
Please download the attached project of the Transitive Closure that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.