Dryad プロジェクト～ Microsoft Research 研究紹介シリーズ

Article
06/21/2010

みなさん、こんにちは。

さて、今回の「Microsoft Research 研究紹介シリーズ」では（※１）、Dryad (ドライアド) プロジェクトを紹介します。

Dryad プロジェクトは、並列(Parallel)および分散(Distributed)プログラムを効率よく行うための “プログラミングモデル” の研究プロジェクトです。また、適用範囲としては、小規模なクラスター環境から、大規模データセンターまでを対象にしています。後程にも出てきますが、Dryad は特定の実行環境におけるソリューションではなく、並列・分散環境を使用した効率のよいプログラム実行を行うための、より一般的な概念・基盤である、と理解ください。

以下、Microsoft Research における Dryad プロジェクトのページより。

（以下、英語ページ、論文の翻訳を行っていますが、日本語でのわかりやすさを優先して、一部意訳をおこなっております。一方でより原文に近い情報を必要とされる方もいらっしゃると思いますので、英語の原文も併記しています。なお、翻訳の誤り、アドバイス等ございましたら、ご連絡いただければ幸いです。）

概要(Overview) Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming. Dryad はデータ並列(Data-Parallel)なプログラム実行を支援するため、プログラマーに対し、コンピューターのクラスターやデータセンターのリソースの使用を提供するためのインフラストラクチャです。 Dryad を使用するプログラマーは、並列・並行プログラミング(concurrent programinng)について知ることなく、何千台ものコンピューター～さらにそれらは複数プロセッサー・コアを持つ～を使用することが可能になります。

Dryad における Job の構造（The Structure of Dryad Jobs）

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

Dryad を使用するプログラマは、いくつかのシーケンシャル（逐次）なプログラムをコーディングし、またそれらを一方向(one-way)チャネルで接続(connect)します。 (Dryadにおける)計算は有効グラフとして構造化されます。つまり、プログラムは各グラフの頂点(vertex)となり、一方で、チャンネルはグラフの辺(edge)となります。Dryad の Job はグラフジェネレーターであり、それによりどのような有向非循環グラフ（directed acyclic graph)であっても統合的に扱うことが可能になります。これらのグラフは計算時に重要なイベントが発生した際にそれに対応するために、たとえ実行時であったとしても変更を行うことが可能です。

Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google's map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

Dryad は極めて表現力に富んでいます。Dryad は他の計算フレームワーク、たとえば Google の Map-Reduce や関係代数(Reational Algebra)といったもの、を包含しています。また、Dryad は、Job の作成と管理、リソースの管理、Job のモニタリングと可視化、フォルトトレランス、再実行、スケジューリング、および Accounting を実施します。

Dryad のソフトウェアスタック(The Dryad Software Stack)

As a proof of Dryad's versatility, a rich software ecosystem has been built on top Dryad:

Dryad の応用範囲の多様性(versatility)の証拠として、Dryad を活用した以下のようなソフトウェアエコシステムが構築されています。

SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad's fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft's AdCenter log processing pipelines.
Dryad 上の SSIS (SQL Server Integration Service)では、数多くの SQL Server のインスタンスが、それぞれ個別の Dryad における頂点として、稼働しており、Dryad のフォルトトレランスとスケジューリングの機能を活用しています。このシステムは、Microsoft AdCenter におけるログ処理のパイプラインの一部として、実際の稼働環境において使用されています。

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
DryadLINQ は LINQ (Language-Integrated Query）から C# による Dryad 用のコードを生成します。

The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
1. It allows processes to easily connect multiple file descriptors of each process -- hence the 2-D aspect.
2. It allows the construction of pipes spanning multiple machines, across a cluster.
3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
分散シェル (The Distributed Shell) は、Unixにおけるシェルからパイプのコンセプトを一般化します。 Unix のパイプは一次元（1-D)の処理構造を持ちますが、分散シェルにおいてはプログラマーに対し、スクリプト言語を用いて 2-D の構造を組み立てることを支援します。分散シェルは、以下の３つの点で Unix のパイプを一般化しています。

1. プロセスに対して、各プロセスの複数のファイルディスクリプターに対して簡単に接続できるようにします--この点がつまり、2-D ということです。
2. パイプの構造(construction)が、クラスターを超えた複数のマシンにまたがることを許容します。
3. パイプラインを仮想化し(virtualizes) 、時間多重化（time-multiplexing)プロセッサーとバッファリング結果により、利用可能なマシンの台数に比較し、より多数のプロセッサーを利用したパイプライン実行を可能にします。

Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.
言語によっては、分散シェルプロセスにコンパイルされるものがあります。PSQL は早期のバージョンであり、最近は Scope により置き換えられています。

参考文献(Publications)

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

# Google キャンパスにおける Dryad のプレゼンテーション(Google Tech Talk におけるプレゼン）

Presentation slides from a talk on Dryad at University of California at Santa Cruz, by Michael Isard, February 2008.

Another presentation, given at Microsoft Live Labs by Mihai Budiu, March 2008.

さて、Dryad のページをもとに、ざっくり紹介した Dryad ですが、論文「Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks」にて書かれているように、Dryad 自体は並列コンピューター～特にここでは広義な意味での並列コンピューターで、１台のコンピューターにおけるマルチコアな並列実行環境から、小規模なクラスター、あるいは大規模なデータセンターまで、を含む～における基盤概念です。

上記「Dryad のソフトウェアスタック(The Dryad Software Stack)」に挙げられているものは、この Dryad の概念に基づき実装されたソフトウェアインスタンスである、とご理解ください。

さて、この Dryad の実装の中で、実際にダウンロードいただけるものとして DryadLINQ があります。実際に Dryad にご興味を持たれた方は、ぜひ試してみてください（アカデミック目的、商用目的のいずれを目的とするかで、ライセンス（と手続き）が異なります。目的に合わせてライセンスの内容等をご確認いただき、ライセンスに同意いただける場合はダウンロードし、ご使用いただくことが可能です）。なお、DryadLINQ の利用の際には Windows HPC Server 2008 とクラスター環境が必要になりますのでその点はご留意ください。

DryadLINQ の実行環境概要(“Dryad and DryadLINQ Installation and Configuration Guide” より)

DryadLINQ についてもう一点。ご存じのようにLINQ 関連では並列実行のためのデータ問い合わせ技術としては PLINQ もあります。DryadLINQ は、「マルチコア」に対応したコード生成においては PLINQ を活用し、また、それ以外の部分、つまりマシンをまたがる処理の並列化についてクエリープランの生成を行う、といったすみ分けをしています。

さて、最後に気になる Goole MapReduce との関係については「Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks」より。

The fundamental difference between the two systems is that a Dryad application may specify an arbitrary communication DAG rather than requiring a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, our implementation is general enough to support all the features described in the MapReduce paper.

２つのシステム(Dryad と MapReduce)の最も基本的な違いとして、MapReduce では map/distribute/sort/reduce といった一連の操作を逐次実行することが求められるのに対して、Dryad では任意の DAG(Directed Acyclic Graph:有向非循環グラフ）によって処理を実行できるという点です。特筆すべき点としては、 Dryad では、グラフの頂点において、異なる型の、複数の入力および複数の出力を扱うことができることです。これにより、多くのアプリケーションにおいては、アルゴリズムから実装までのマッピングをシンプルに行うことが可能になります。つまり基本的なサブルーチンの組み合わせでより優れたライブラリを構築できます。また、データエッジにおいて、TCPパイプラインと共有メモリを組み合わせて利用することができる特徴により、パフォーマンスを向上させることが可能です。同時に、Dryad の実装においては、MapReduce の論文で述べられているすべての特徴をサポートするのに十分な一般性も持ち合わせています。

以上、今回の「Microsoft Research 研究紹介シリーズ」では（※１）、Dryad (ドライアド)プロジェクトを紹介させていただきました。

次回にもぜひご期待ください。

それでは！

※１

以前の「絵で見て理解する ASP.NET シリーズ：Routing の仕組み」とおなじく、ネタです。そんなシリーズはありませんのであしからずご了承ください。

Dryad プロジェクト ～ Microsoft Research 研究紹介シリーズ

Dryad における Job の構造（The Structure of Dryad Jobs）

Dryad のソフトウェア スタック(The Dryad Software Stack)

参考文献(Publications)

Additional resources

Dryad プロジェクト～ Microsoft Research 研究紹介シリーズ

Dryad のソフトウェアスタック(The Dryad Software Stack)