One of the cool features of the Microsoft Distribution of Hadoop (MDH) is the native support for Windows Azure Blob Storage.
When performing HDFS operations by default one can omit the scheme such that:
hadoop fs -lsr /mobile
Is equivalent to:
hadoop fs -lsr hdfs:///mobile
The commands are defaulting to the HDFS scheme. Although Hadoop comes with its own Distribution File System (HDFS), it does have a general-purpose file system abstraction. With MDH, a new scheme, “asv”, has been introduced to allow one to seamlessly access Windows Azure Blob Storage. Thus one can write commands such as:
hadoop fs -copyFromLocal C:\SampleData\MobileSampleDataBrief.txt asv://mobiledata/data/sampledata0.txt
hadoop fs -copyFromLocal C:\SampleData\MobileSampleDataFull.txt asv://mobiledata/data/sampledata1.txt
Performing a query on the data:
hadoop fs -lsr asv://mobiledata/data
You will see:
c:\apps\dist>hadoop fs -lsr asv://mobiledata/data
-rwxrwxrwx 1 0 2012-01-05 18:10 /data/$$$.$$$
-rwxrwxrwx 1 4000 2011-12-30 20:01 /data/sampledata0.txt
-rwxrwxrwx 1 5015508 2011-12-30 20:01 /data/sampledata1.txt
Using CloudXplorer, one can validate the data in Azure:
To enable this feature one just has to update the core-site.xml configuration file and add the following node entries (of course with one’s account name and key):
And that’s it. So now onto streaming jobs.
In a previous post I covered running Hadoop Streaming jobs. The usual way of running these jobs is to use HDFS files:
However with the addition of the “asv” scheme one can now use Windows Azure Blob Storage directly for running streaming jobs:
As you can see, in this instance the MapReduce job is pulling the data from the Windows Azure Blob Storage.
The beauty of the integration of MDH with Windows Azure Blob Storage it that one does not need to copy all your raw data from Azure to HDFS. If your data has been collected in Windows Azure then you can run your MapReduce jobs directly against the data at source.