How do I get data from an Azure Storage account into an Azure-deployed Cloudera cluster running HDFS? This question seems to be coming up with quite a bit of frequency lately, so that I thought I might answer it with a post.
Paige Liu has done an excellent job documenting how to configure Cloudera to read from an Azure Storage account. You can read her step-by-step instructions here.
So assuming you’ve already moved data into your Azure Storage account using either the Azure Storage Explorer, CLI, PowerShell, AZCopy, the Import/Export Service, or some other data transfer mechanism, once you’ve completed the configuration changes Paige details, you are ready to transfer data using a command like one of the following:
hadoop fs -cp wasbs://email@example.com/myfiles/ /myhdfsfiles/
hadoop distcp wasbs://firstname.lastname@example.org/myfiles/ /myhdfsfiles/
In these two examples, brysmi is the name of my Azure Storage account, mydata is the name of the container in that Storage account which holds my files, and myfiles is the folder path under that container where the files (blobs) have been landed. (The second path, i.e. /myhdfsfiles/, represents the path in HDFS where I want to files copied to.)
Finally, its important to note that distcp works much faster when you have numerous files with a large total storage volume as it copies files in parallel.