Scaling out PLINQ: DryadLINQ at PDC09 and Supercomputing09

Stephen Toub - MSFT

November 11th, 20090 0

PLINQ enables developers to scale up computations in order to leverage the multiple cores available in modern hardware. For many problem domains, this is quite useful and sufficient. What happens, however, when a workload being processed is so big that even a manycore machine is insufficient to adequately handle the load? This can be the case with massively complex calculations. Moreover, in the age of “big data,” more and more we as an industry are encountering problems as we attempt to analyze and churn through data sets measured in terabytes and even petabytes.

In such situations, one solution is to scale out, distributing the computation amongst multiple computers aggregated into a computing cluster. That’s the domain of HPC, High Performance Computing, for which Microsoft provides valuable support through Windows HPC Server.

For those of you attending PDC09 or SC09 next week, we’re excited that we’ll be showcasing DryadLINQ, a project from Microsoft Research that runs LINQ queries on an HPC Server cluster. Instead of just using an IEnumerable, DryadLINQ provides a PartitionedTable entity that represents data partitioned across the nodes of the cluster. You can create a PartitionedTable from any IEnumerable, or you can preload the data onto the cluster manually. When you write a LINQ expression on a PartitionedTable, that expression is parsed and shipped to the cluster, which executes the expression in a distributed fashion. The Dryad execution infrastructure running on HPC Server and targeted by DryadLINQ handles the communication amongst the cluster nodes, deals with data partitioning, ensures proper aggregation, and so forth. The cluster nodes themselves run PLINQ in order to fully utilize as many cores as are available in the machine in order to achieve maximum speedup locally. Results can be streamed back into your program, or you can have them saved back into another partitioned table on the cluster to be reused by a subsequent query or manually at a later time.

Here’s an example of using DryadLINQ to implement the core of the PageRank algorithm:

public static IQueryable<Rank> Step(
    IQueryable<Page> pages, IQueryable<Rank> ranks)
{
    // Join pages with ranks, and disperse updates.
    var updates = from page in pages
                         join rank in ranks on page.name equals rank.name
                         select page.Disperse(rank);

    // Re-accumulate.
    return from list in updates
              from rank in list
              group rank.rank by rank.name into g
              select new Rank(g.Key, g.Sum());
}
…
var pages = PartitionedTable.Get<Page>(“pages.pt”);
var ranks = pages.Select(page => new Rank(page.name, 1.0));
ranks = Step(pages, ranks);
ranks.ToPartitionedTable<Rank>(“ranks.pt”);

As you can see, the actual algorithm is just a standard LINQ query and could run as well on a single machine (also in parallel if you used PLINQ’s AsParallel operator). However, as the data source is a PartitionedTable loaded from the cluster, this query will run now in a distributed fashion utilizing the remote environment, and all with very little additional effort.

If you’ll be at PDC09, we encourage you to go to John Vert’s presentation on DryadLINQ (currently scheduled in room 515A on Tuesday at 3:00 PM). If you’ll be attending SC09, come visit us at the Microsoft Exhibitor Booth for live demos, more details, and conversations about scenarios. Dryad and DryadLINQ are currently research projects, and we’d love any feedback you may have.

See you all soon!