Loop-carried dependencies are dependencies where one iteration of the loop depends upon the results of another iteration of the loop
Question: How do I parallelize for-loops with no loop carried dependencies?
Question: How do I parallelize for-loops with loop carried dependencies?
Answer: It depends (on the dependencies)!!
As cryptic as the answer is, it really comes down to modeling the dependencies and usually altering the algorithm to break loop carried dependencies. Sometimes this might be possible, other times, it may not be feasible, perhaps because of performance and/or because of the effort involved in parallelizing the approach.
Lets dive down a little deeper into this with an example. Consider the following dynamic programming algorithm that uses the Smith-Waterman approach to align 2 biological sequences. At a high-level, the Smith-Waterman approach for aligning two biological sequences a and b consists of the following two main steps:
1) Populating a m x n matrix, thereby finding the winning node that corresponds to a row & column
2) Backtracking from the winning node, to find optimal alignment
The edges (row=0's and col=0's) have predefined value (=0). We calculate score of every node based on the maximum values of:
i) Local alignment (match/mismatch) score : dependent on diagonal max local align score + similarity/mismatch score from ScoringMatrix(H(i-1, j-1) + SimilarityScore(A[i], B[j]))
ii) Deletion score (dependent on left cell's deletion and score values)
iii) Insertion score (dependent on top cell's insertion and score values)
To simplify, every element[i, j] in the matrix has dependencies on [i, j-1], [i-1, j-1] and [i-1, j] values
Fig 1. Depiction of dependencies in the matrix
When parsing the matrix serially, the most trivial approach is to have two loops (1 based on query length and 1 based on subject length).
Pseudo-code for the serial approach is given below:
hVal = similarity(row, col) + blosum62(row,col)
eVal[row][col] = max(eVal[row-1][col]-gapExtn, score[row-1][col]-gapOpen)
fVal[row][col] = max(fVal[row][col-1]-gapExtn, score [row][col-1]-gapOpen)
score [row][col] = max(hVal, eVal[row][col],fVal[row][col], 0)
traceback[row][col] = winner(hVal, eVal[row][col],fVal[row][col], 0)
maxscore=max(maxscore, score [row][col])
Fig 2. Serial approach
To give you an idea of what it looks like in C++ (code snippet):
For effectively parallelizing the inner loop, we need to ensure that there are no loop-dependencies within the iterations of the inner loops (iterations of the outer loop may be dependent on its previous iterations and that's ok since we are not parallelizing the outer loop). We achieve this by calculating diagonals.
Caveat: Parallelizing loops that are too fine grained or consume a lot of bandwidth is a documented anti-pattern and contribute negatively to overall performance and/or scaling.
Fig 3. Parallel Wavefront approach
In this approach, (a0,b0) is calculated first, and then [(a0,b1), (a1,b0)] are calculated in parallel next, and then [(a0, b2), (a1,b1), (a2,b0)] are calculated in parallel and so on and so forth.
There are some points to pay attention to here:
- The parallelized algorithm above increases the memory footprint when compared to the sequential algorithm
- The performance and scalability of the approach elaborated above is coupled to the size of the matrix (subject length and query length)
- In theory, we can implement a hybrid model, where, for small iteration lengths (iterlength <= a), the matrix-wavefronts are populated serially, but for larger iteration lengths (iterlength > a), the matrix-wavefronts are populated in parallel. Defining a is beyond the scope of this blog
There should be better locality and cache coherence if the granularity was higher, which would make it more scalable than the fine grained approach. In the approach elaborated above, the wavefront was implemented to perform calculations with chunk size = 1 (1 element). Bumping the chunk size to n and implementing the wavefront on these larger chunks should yield better results. For example, a large chunk could be 64 elements from (a0-7,b0-7)
The example above was to demonstrate a pattern for breaking loop-carried dependencies and can be extrapolated and used in scenarios where one may have (the same or a subset of) the type of dependencies elaborated here. This is just one such patterns and there may exist other approaches/patterns which may align better with the given dependencies. Please consider the cost to parallelize, complexity/debugging cost, performance and scalability when parallelizing an existing algorithm.
References used for this blog:
1) Generating Parallel Programs from the Wavefront Design Pattern: http://plg.uwaterloo.ca/~stevem/papers/HIPS02.pdf
2) A Parallel Wavefront Algorithm for Efficient Biological Sequence Comparison: http://www.dehne.carleton.ca/publications/pdf/2-67.pdf