“Cloud Numerics” Example: Statistics Operations to Azure Data

This post demonstrates how to use Microsoft.Numerics C# API to perform statistical operations on data in Windows Azure blob storage. We go through the steps of loading data using IParallelReader interface, performing distributed statistics operations, and saving results to blob storage. As we sequence through the steps, we highlight the code samples from the application.

  Note!
  You will need to download and install the “Cloud Numerics” lab in order to run this example. To begin that process, click here.

Before You Run the Sample Application

Before you run the sample “Cloud Numerics” statistics application, complete the instructions in the “Cloud Numerics” Getting Started wiki post to:

  • Create a Windows Azure account (if you do not have one already).
  • Install “Cloud Numerics” on your local computer where you build and develop applications with Visual Studio.
  • Configure and deploy a cluster in Azure (only if you have not done so already).
  • Submit the sample C# “Cloud Numerics” program to Windows Azure as a test that your cluster is running properly.
  • Download the project file and source code for the sample “Cloud Numerics” statistics application.

 

Blob Locations and Sample Application Download

You can download the sample application from the Microsoft Connect site (connect.microsoft.com). If you have not already registered for the lab, you can do that here. Registering for the lab provides you access to the “Cloud Numerics” lab materials (installation package, reference documentation, and sample applications).

  Note!
  If you are signed into Microsoft Connect, and you have already registered for your invitation to the “Cloud Numerics” lab, you can access the various sample applications using this link.

 

For your convenience, we have staged sample datasets of pseudorandom numbers in Windows Azure Blob Storage. You can access the small and medium datasets at their respective links:

These datasets are intended merely as examples to get you started. Also, feel free to customize the sample application code to suit your own datasets.

Choosing the Mode: Run on Local Development Machine or on Windows Azure

To run the application on your local workstation:

  1. Set StatisticsCloudApplication as your StartUp project within Visual Studio. (From Solution Explorer in the Visual Studio IDE, right click the StatisticsCloudApplication subproject and select Set as Startup Project). Although, your application will run on your local workstation, the application will continue to use Windows Azure storage for data input and output.
  2. Change Start Option paths for the project properties to reflect your local machine.
    a.   Right click the StatisticsCloudApplication subproject, and select Properties.
    b.   Click the Debug tab
    c.   In the Start Options section of the Debug tab, edit the following fields to reflect the paths on your local development machine:
         -   For the Command line arguments field, change:
             c:\users\roastala\documents\visual studio 2010\Projects… to 
             c:\users\<YourUsername>\documents\visual studio 2010\Projects…      -   For the Working directory field, change: 
             c:\users\roastala\documents\… to
             c:\users\<YourUsername>\documents\…
  1.      --Where c:\users\<YourUsername>\ reflects the home directory of the user on the local development machine where you installed the “Cloud Numerics” software.

To submit the application to Windows Azure (run on Windows Azure rather than locally):

Set AppConfigure as the StartUp project. (From Solution Explorer in the Visual Studio IDE, right click the AppConfigure subproject and select Set as Startup Project).

  Note!
  If you have already deployed your cluster or if it was pre-deployed by your site administrator, do not deploy it again. Instead, you only need to build the application and submit the main executable as a job.

Step 1: Supply Windows Azure Storage Account Information for Output

To build the application you must have a Windows Azure storage account for storing the output. Replace the string values “myAccountKey” and “myAccountName” with your own account key and name.

 static string outputAccountKey = "myAccountKey";
static string outputAccountName = "myAccountName";

The application creates a public blob for the output under this storage account. See Step 4 for details.

Step 2: Read in Data from Blob Storage Using IParallelReader Interface

Let us take a look at code in AzureArrayReader.cs file.

The input array in this example is in Azure blob storage, where each blob contains a subset of columns of the full array. By using the Microsoft.Numerics.Distributed.IO.IParallelReader interface we can read the blobs in distributed fashion and concatenate the slabs of columns into a single large distributed array.

First, we implement the ComputeAssignment method, which assigns blobs to the MPI ranks of our distributed computation.

 public object[] ComputeAssignment(int nranks)
{
    Object[] blobs = new Object[nranks];

    var blobClient = new CloudBlobClient(accountName);
    var matrixContainer = blobClient.GetContainerReference(containerName);
    var blobCount = matrixContainer.ListBlobs().Count();
    int maxBlobsPerRank = (int)Math.Ceiling((double)blobCount / (double)nranks);
    int currentBlob = 0;
    for (int i = 0; i < nranks; i++)
    {
        int step = Math.Max(0, Math.Min(maxBlobsPerRank, blobCount - currentBlob));
        blobs[i] = new int[] { currentBlob, step };
        currentBlob = currentBlob + step;
    }
    return (object[])blobs;
}

Next, we implement the property DistributedDimension, which in this case is initialized to 1 so that slabs will be concatenated along the column dimension.

 public int DistributedDimension
{
    get { return 1; }
    set { }
}

The ReadWorker method:

  1. Reads the blob metadata that describes the number of rows and columns in a given slab.
  2. Checks that the slabs have an equal number of rows so they can be concatenated columnwise.
  3. Reads the binary data from blobs.
  4. Constructs a local NumericDenseArray.
 public msnl.NumericDenseArray<double> ReadWorker(Object assignment)
{
    var blobClient = new CloudBlobClient(accountName);
    var matrixContainer = blobClient.GetContainerReference(containerName);
    int[] blobs = (int[])assignment;
    long i, j, k;
    msnl.NumericDenseArray<double> outArray;
    var firstBlob = matrixContainer.GetBlockBlobReference("slab0");
    firstBlob.FetchAttributes();
    long rows = Convert.ToInt64(firstBlob.Metadata["dimension0"]);
    long[] columnsPerSlab = new long[blobs[1]];
    if (blobs[1] > 0)
    {
        // Get blob metadata, validate that each piece has equal number of rows
        for (i = 0; i < blobs[1]; i++)
        {
            var matrixBlob = matrixContainer.GetBlockBlobReference("slab" + (blobs[0] + i).ToString());
            matrixBlob.FetchAttributes();
            if (Convert.ToInt64(matrixBlob.Metadata["dimension0"]) != rows)
            {
                throw new System.IO.InvalidDataException("Invalid slab shape");
            }
            columnsPerSlab[i] = Convert.ToInt64(matrixBlob.Metadata["dimension1"]);
        }

        // Construct output array
        outArray = msnl.NumericDenseArrayFactory.Create<double>(new long[] { rows, columnsPerSlab.Sum() });

        // Read data
        long columnCounter = 0;
        for (i = 0; i < blobs[1]; i++)
        {
            var matrixBlob = matrixContainer.GetBlobReference("slab" + (blobs[0] + i).ToString());
            var blobData = matrixBlob.DownloadByteArray();
            for (j = 0; j < columnsPerSlab[i]; j++)
            {
                for (k = 0; k < rows; k++)
                {
                    outArray[k, columnCounter] = BitConverter.ToDouble(blobData, (int)(j * rows + k) * 8);
                }
                columnCounter = columnCounter + 1;
            }
        }
    }
    else
    {
        // If a rank was assigned zero blobs, return empty array
        outArray = msnl.NumericDenseArrayFactory.Create<double>(new long[] { rows, 0 });
    }
    return outArray;
}

When an instance of reader is invoked by the Microsoft.Numerics.Distributed.IO.Loader.LoadData method, the ReadWorker instances are executed in parallel on each rank, and the LoadData method automatically takes care of concatenating the local pieces produced by the ReadWorkers.

Step 3: Compute Statistics Operations on Distributed Data

The source code in the Statistics.cs file implements the statistics operations performed on distributed data.

The sample data is stored at:

 static string inputAccountName = @"https://cloudnumericslab.blob.core.windows.net";

This is a storage account for our data. It contains the samples of random numbers in publicly readable containers named “smalldata” and “mediumdata.”

In the beginning of the main entry point of the application, we initialize the Microsoft.Numerics distributed runtime. This allows us to execute distributed operations by calling Microsoft.Numerics library methods.

 Microsoft.Numerics.NumericsRuntime.Initialize();

Next, we instantiate the array reader described earlier, and read data from blob storage.

 var dataReader = new AzureArrayReader.AzureArrayReader(inputAccountName, arraySize);

var x = msnd.IO.Loader.LoadData<double>(dataReader);

The output x is a columnwise distributed array loaded with the sample data. We then compute the statistics of the data: min, max, mean, median and percentiles, and write the results to an output string.

 // Compute summary statistics: max, min, mean, median
output.AppendLine("Summary statistics\n");
var xMin = ArrayMath.Min(x);
output.AppendLine("Minimum, " + xMin);
var xMax = ArrayMath.Max(x);
output.AppendLine("Maximum, " + xMax);
var xMean = Descriptive.Mean(x);
output.AppendLine("Mean, " + xMean);
var xMedian = Descriptive.Median(x);
output.AppendLine("Median, " + xMedian);

// Compute 10% quantiles
var tenPercentQuantiles = Descriptive.QuantilesExclusive(x, 10, 0).ToLocalArray();

As x is a distributed array, the overloaded variant of the method (QuantilesExclusive) that distributes processing over nodes of the Azure cluster is used. Note that the result of the quantiles operation is a distributed array. We copy it to a local array in order to write the result to an output string.

Step 4: Write Results to Blob Storage as a .csv File

The application, by default, writes the result to the file system of the virtual cluster. This storage is not permanent; the file will be removed when you delete the cluster. The application creates a public blob on the named Azure account you supplied in the beginning of the application.

 // Write output to blob storage
var storageAccountCredential = new StorageCredentialsAccountAndKey(outputAccountName, outputAccountKey);
var storageAccount = new CloudStorageAccount(storageAccountCredential, true);
var blobClient = storageAccount.CreateCloudBlobClient();
var resultContainer = blobClient.GetContainerReference(outputContainerName);
resultContainer.CreateIfNotExist();
var resultBlob = resultContainer.GetBlobReference(outputBlobName);

// Make result blob publicly readable,
// so it can be accessed using URI
// https://<accountName>.blob.core.windows.net/statisticsresult/statisticsresult
var resultPermissions = new BlobContainerPermissions();
resultPermissions.PublicAccess = BlobContainerPublicAccessType.Blob;
resultContainer.SetPermissions(resultPermissions);

resultBlob.UploadText(output.ToString());

You can then view and download the results by using a web browser to open the blob. For example, the syntax for the URI would be:

https://<accountName>.blob.core.windows.net/statisticsresult/statisticsresult

--Where <accountName> is the name of the cluster account you deployed to Windows Azure.