Hadoop for .NET Developers: Programmatically Loading Data to AVS

NOTE This post is one in a series on Hadoop for .NET Developers.

As mentioned in an earlier post, the WebHDFS client assumes a Hadoop cluster employs HDFS but can be configured to work with a cluster leveraging AVS. If you are working with a persistent HDInsight in Azure cluster (based on AVS), then the WebHDFS client is likely a good option for you to explore.

That said, Azure Blob Storage, the storage service foundation of AVS, provides additional options for working with your data. In this blog post, I want to show you how to load data directly to Azure Blob Storage so that it may be accessible to an HDInsight cluster that may or may not exist at the time the data is loaded.

In this exercise, we’ll load the ufo_awesome.tsv file to the Azure Blob Storage service provisioned with your HDInsight in Azure cluster. You could run these same steps to a service provisioned through other means and on which you later deploy your HDInsight cluster to achieve the same ends demonstrated here:

1. Navigate to the Azure portal and locate the Storage icon on the left-hand side of the page.

2. Click on the Storage icon to access the storage provisioned for your account. If you only have an HDInsight cluster associated with your account, you should see just one item in the list to the right of the Storage icon. If you have more than one HDInsight cluster or other services deployed, you may see other storage items in the list.

3. Record the name of the storage item associated with your HDInsight cluster. In the code that follows, this name will serve as the value of the name variable.

4. Click on the name of the storage item associated with your HDInsight cluster.

5. Click on the Manage Access Keys item at the bottom of the current page and record either the primary or secondary access key in the resulting dialog. This key will be used as the value of the key variable in the code that follows.

6. Close Manage Access Keys dialog and locate the Containers item at the top of the storage page.

7. Click on the Containers item and record the name of the container associated with your HDInsight cluster. The name of the container will be used as the value of the containerName variable in the code that follows.

8. Launch Visual Studio and create a new C# console application project.

9. Use NuGet to load the Windows Azure Storage package.

10. If it is not already open, open the Program.cs file and add the following directives to it:

using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Blob;

11. In the main method, add the following code with the appropriate substitutions for the user, key and containerName variables:

 //define storage account
string user = "MyAccount";
string key = "MyKey";
StorageCredentials myCredentials = new StorageCredentials(user, key);

//connect to storage
CloudStorageAccount myAccount = new CloudStorageAccount(myCredentials, true);
CloudBlobClient myClient = myAccount.CreateCloudBlobClient();

//access storage associated with cluster
string containerName = "mycontainer";
CloudBlobContainer myContainer = myClient.GetContainerReference(containerName);

//define new blob
string destFileName = "demo/ufo/in/ufo_awesome.tsv";
CloudBlockBlob myBlockBlob = myContainer.GetBlockBlobReference(destFileName);

//stream data to new blob
string srcFilename = "c:\temp\ufo_awesome.tsv";
using (var fileStream = System.IO.File.OpenRead(srcFilename))
{
    myBlockBlob.UploadFromStream(fileStream);
}

12. Run the application to load the data file to Azure.

The code is pretty straightforward and if not there’s an abundance of how-to’s for loading data to Azure Blob Storage that can guide your interpretation of it. The one piece that needs to be called out is the destination file’s name.

If you want your file in Azure Blob Storage to be picked up by HDInsight, it needs to be defined using a file name that includes the full file path starting from root but NOT including the leading forward-slash used to designate root. For example, our destination file’s name (as assigned to the destFileName variable) is demo/ufo/in/ufo_awesome.tsv. To HDInsight, this will appear as if there is a file named ufo_awesome.tsv under the /demo/ufo/in folder structure. Even though we will reference this file as existing under the demo folder which itself resides under root, we do not provide the forward-slash denoting root when we load the file the Azure Blob Storage.

To verify the ufo_awesome.tsv file is accessible to HDInsight, perform the following steps:

1. Navigate to the Azure portal.

2. Click on the HDInsight icon on the left-hand side of the portal page.

3. Click on the HDInsight item in the right-hand side of the page.

4. In the resulting page, click on the Connect icon at the bottom of the page and open the Remote Desktop file (.RDP) that downloads from the page.

5. Within the Remote Desktop session Connect to the HDInsight cluster using the user name and password combination set when the cluster was defined.

6. Launch the Hadoop Command Prompt and issue the following statement to verify the ufo_awesome.tsv file is visible to Hadoop:

hadoop fs –ls /demo/ufo/in

7. To view the contents of the file through Hadoop, issue the following statement at the command prompt:

hadoop fs –cat /demo/ufo/in/ufo_awesome.tsv

NOTE The cat command may choke on the file as it is a bit large for the command-prompt buffers. If this happens, you can use hadoop fs –tail /demo/ufo/in/ufo_awesome.tsv to see the last kilobyte worth of data in the file.

8. Close the command prompt and exit the remote desktop session to complete the exercise.