We're entering an era where CPU clock speeds will soon cease to scale upwards and instead CPU manufacturers are planning to put more and more independent cores on a chip. Intel plans to release an 80-core chip within 5 years. Consequently the research community is setting their eye on the manifold barriers to writing correct and efficient highly-parallel programs. Will we be able to continue to get the same boosts in computational performance that we have been getting through raw clock speed?

You might think it's just a matter of building effective tools for producing programs that can take advantage of highly parallel platforms, and training programmers to take advantage of them. But there are complexity theory results to show that there are some problems that, under widely accepted assumptions, fundamentally cannot be parallelized, and this is the theory of P-completeness. For these problems, it may be literally impossible to get any significant benefit out of throwing additional cores at them - such problems are called *inherently sequential*. More to the point, however long it takes your computer to solve these problems *right now*, that's about how fast it's going to stay for a long time.

The complexity class NC (Nick's Class, named after Nick Pippenger by Steven Cook) is the set of problems that can be solved in polylogarithmic time (O(log^{k} *n*) time for some integer *k*) by a machine with polynomially many processors. For example, given a list of *n* numbers, the typical way to add them up on a single processor would be with a loop, as in this C# snippet:

public static int AddList(List<int> list) {

int sum = 0;

foreach (int i in list) {

sum += i;

}

return sum;

}

But now suppose we had *n*/2 processors. Then each processor can add just two of the numbers, producing a list of *n*/2 numbers with the same sum. We add these in pairs as well, continuing until just one number remains, the sum. This fan-in algorithm requires only O(log *n*) time, placing the problem in NC (or technically the decision problem version of it).

We can use similar ideas to parallelize many algorithms. For example, the best known sequential algorithms for matrix multiplication requires time O(*n*^{2.376}) for an *n* × *n* matrix, but given O(*n*^{3}) processors, we can solve it in O(log *n*) time: for each entry, we assign *n* processors to compute the products that are added to form that entry, then use the fan-in procedure above to add them together. This can still be done in O(log *n*) time. There are efficient parallel algorithms for other common problems too, like sorting, string matching, and algorithms in graph theory and computational geometry.

In real life of course we don't have such a polynomial number of processors, so this is in some ways an impractical definition; all it really tells us is that if we have *many* processors, we can solve the problem dramatically faster. Practical parallel algorithms pay more attention to *speedup* - how many times faster the algorithm is with *k* processors than with one. Effective speedup requires a good way of dynamically decomposing the problem. But speedup is difficult to characterize formally in a machine-independent way, so we'll stick with the definition of NC.

The famous P = NP problem asks whether all problems whose solutions can be verified in polynomial time, can also be solved in polynomial time. In this problem, the problems in P are considered "feasible to solve", and most computer scientists believe P ≠ NP - that is, that some problems which we can feasibly verify cannot be solved feasibly. We believe this in part because we have shown that many important problems are NP-hard. To say a problem is NP-hard means there is a polynomial time algorithm that can take any problem in NP and convert it into an instance of that problem; consequently, if an NP-hard problem can be solved in polynomial time, so can any problem in NP.

But when you have a large number of processors, the question turns around: we no longer consider P feasible, because any algorithm using just one CPU would face a dramatic slowdown compared to algorithms that take advantage of all of them. Now NC algorithms are considered feasible, and we ask: can all problems with efficient sequential solutions be highly parallelized? That is, is NC = P?

Just as with P = NP, nobody knows. However, just as there are NP-hard problems, there are P-hard problems. A P-hard problem is a problem where there is an NC algorithm (running in polylogarithmic time on a polynomial number of processors) to convert any problem in P into an instance of that problem. If we had an NC algorithm to solve a P-hard problem, it would imply that NC = P, providing a constructive and mechanical way of parallelizing any problem we can solve sequentially in polynomial time. Just as most researchers believe P ≠ NP, most researchers believe NC ≠ P, which implies that no P-hard problem has an NC algorithm to solve it.

One of the most important P-complete problems is *linear programming*: given a linear target function in terms of some variables, and a number of linear inequalities expressing constraints between those variables, find the optimal (maximum) value of the target function. It finds uses in operations research, microeconomics, and business management, and it would be great if we could exploit large number of cores to tackle larger instances of these problems. But no one has found an efficient parallel algorithm for this problem.

Another important example is the *circuit-value problem*: given a digital logic circuit and its inputs, compute the output of the circuit. There is a trivial linear time solution in the number of gates: continually get the next gate that you know both inputs of and compute its output, until you're done. It would seem that certain gates don't influence each other and be computed in parallel. But because the wires can express complex interdependencies between values and can create long chains of dependencies, there's no clear way to parallelize a general circuit. Besides obvious applications to circuit design and simulation, many problems can be solved by families of circuits, and if we can evaluate them quickly we could solve these problems. But not only has no one has found a general way of parallelizing such evaluation, they've shown that even if we restrict the circuits so that wires can't cross, or such that only the inputs can be negated, the problem remains P-complete, and so difficult to parallelize (these are the *planar *and *monotone *circuit-value problems, respectively).

The first problem ever shown to be NP-complete, as well as an important problem in practice, is the boolean satisfiability problem: given a formula using AND, OR, and NOT, with a number of variables set to either true or false, determine if there is some way of setting the variables to make the whole formula evaluate to true. The P-complete version of this problem is called Horn satifiability.

Any boolean formula can be reduced efficiently to a form where it represents a three-layer circuit: a literal is either a variable or a negated variable; a clause is the OR of some literals; and an entire formula is the AND of some clauses. A "Horn clause" is a clause where all of the variables are negated except possibly one. This provides an efficient solution to the problem, since Horn clauses can be rewritten as implications (not(x1) or not(x2) or x3 is the same thing as (x1 and x2) implies x3). Formulas made up of Horn clauses appear in the context of logical deduction systems, such as the resolution systems used to model problems in AI. Again, no one has found an effective way to parallelize finding solutions for these formulas.

In the world of P = NP, people have long been studying ways to "get around" the infeasibility of finding a polynomial-time solution for NP-hard problems, such as approximation algorithms, special cases, fixed-parameter tractable solutions, and so on. We could ask the same questions about the NC = P question: could we find an approximate solution much more quickly by using all our cores, instead of running a classical algorithm to find an exact solution on just one? Are there special cases of these problems which are parallelizable? But comparatively little investigation has been done into this area, although some examples exist (like Karger and Motwani's "An NC Algorithm For Minimum Cuts").

Finally, I should mention that P-complete is relevant not only to the study of what problems can be efficiently parallelized, but also to what problems can be solved in a small amount of space. I'll leave this for another time though.

Nice work. Although I wish you can post more often.

PingBack from http://smoothspan.wordpress.com/2007/09/08/a-multicore-language-timetable-waste-more-hardware/

This is really neat stuff. Have you read any of the Google infrastructure papers (particularly the ones on <a href="http://research.google.com/archive/mapreduce.html">MapReduce</a> and <a href="http://research.google.com/archive/sawzall.html">Sawzall</a>)? They use a highly restricted programming model which allows them to automatically parallelize operations across hundreds or thousands of machines. The Sawzall paper includes a couple of examples in which the exact solution could not be easily distributed, so they used a parallelizable approximation algorithm.