To get your big-data applications to work, you need to read in the data –but what if your data is too big? What if the dataset you are reading requires more memory than is available on any single machine? This is a common constraint for all disciplines that handle large datasets. With “Cloud Numerics” you can read data, in parallel, across several machines such as several Windows Azure cluster nodes.
This post walks you through the “Cloud Numerics” IParallelReader interface, which is part of an input framework that facilitates the parallel loading of distributed arrays from storage.
The IParallelReader interface supports the following general workflow:
- The master process determines the partitioning of work for all of the worker processes based on some information, provided by the calling routine, about the dataset that needs to be loaded.
- Next, the master process and each worker process loads a portion of the distributed array based on the partitioning scheme you pre-determined. Each process returns a Local.NumericDenseArray.
- Finally, the Local.NumericDenseArrays created in the previous step are assembled to form the Distributed.DenseArray that is the output of the parallel reader.
|The IParallelReader interface leverages the distributed runtime. For more details, please see this post.|
This data input framework takes care of all communication and assembly, which means you only need to supply the relevant source code for the first two steps of the workflow and then define how the Distributed.NumericDenseArray is assembled. This required information is provided by an instance of a class that implements the IParallelReader<T> interface. It must contain the following:
|•||object  ComputeAssignment(int totalLocales)|
|This function is run only at the master process and it determines the partitioning of work across the other worker processes. It must return an array (let’s say the array is called “result”) of size totalLocales --where result[i] comprises the instructions that will be sent to rank i.
result must be an array of a serializable type.
|•||Numerics.Local.NumericDenseArray<T> ReadWorker(object assignment)|
|This function runs at all ‘locales’ using the instructions provided by the ComputeAssignment call; rank i is called with result[i] as argument. This function performs the work of loading a portion of the stored object and returns a local NumericsDenseArray.|
|•||An integer property called DistributedDimension|
|DistributedDimension informs the framework how to assemble the local pieces into a final distributed array. For DistributedDimension = j the Local.NumericDenseArrays are assembled along the jth dimension to construct the Distributed.NumericDenseArray.|
Once the required class is defined, it can be used by creating a new instance and calling the LoadData<T> method. For example:
|Information that is common to all the ranks does not need to be placed in the result of ComputeAssignment. It can be stored in your IParallelReader<T> class; the class instance passed to LoadData is sent to all the processes by the framework. This happens before ComputeAssignment is called and so changes made by it to the instance are only seen by the master process.|
We’ve provided a sample CSV file reader that implements the interface described above. The IParallelReader<T> class contains fields that hold the name of the file and other common information such as its length. ComputeAssignment uses the length information to create an array of (start, end) offset pairs that serve as instructions for the ranks. The ReadWorker method then has the file name (common to all the ranks) and its own (start, end) offset pair. With this information each rank opens the file, reads lines from the starting to the ending offset and returns the local array. Note that we have to be a little careful here for the cases where the starting or ending offset lands in the middle of a line. Finally, the resulting distributed array is created by the framework using the DistributedDimension property. The sample CSV file reader can be downloaded from the “Cloud Numerics” page on the Microsoft Connect site. To request access to the site, navigate to this link: sign-up here. Once you are registered for the Microsoft codename “Cloud Numerics” lab, you can download the software with this link: download here.
Here is an example that reads a set of files: