When you are using HDInsight Hadoop or Spark clusters in Azure, they are automatically pre-configured to access Azure Storage Blobs via the hadoop-azure module that implements the standard Hadoop FilesSystem interface. You can learn more about how HDInsight uses blob storage at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/
In this article, I will show how we can configure a local install of Spark 1.6.2 (e.g. running on my Windows 10 laptop, MAC, or Linux) to be able to read and write Azure Storage Blobs using wasb[s]:// URI scheme.
Install Spark (if you don't have it yet)
To get started, we download Spark 1.6.2 (Jun 25 2016) with the "Pre-build for Hadoop 2.6" package type from http://spark.apache.org/downloads.html and unzip it into a directory called C:\Spark. I am walking through these steps on my Windows 10 laptop and that is why I am using C:\Spark. Your path will different if you aren't on Windows, but the configuration concepts will be similar.
Make sure that your SPARK_HOME and HADOOP_HOME environment variables are set properly. In my case, I created a small spark-env.cmd file in the conf directory that I run before starting the spark-shell:
To avoid seeing a lot of verbose logs in the spark-shell window, I also set my conf/log4j.properties file to only log warnings (i.e. log4j.rootCategory=WARN, console instead of log4j.rootCategory=INFO, console)
Now, we should confirm that the regular bin/spark-shell is working properly.
If you are trying this on Windows and get errors when trying to run the spark-shell, please see my other article Resolving Spark 1.6.0 "java.lang.NullPointerException, not found: value sqlContext" error when running spark-shell on Windows 10 (64-bit).
Get Two Required JAR files
In order to access Azure Storage Blobs, we need to make sure that two required JAR files are available when we run the spark-shell:
If we are running Spark on a cluster with regular Hadoop distribution, these files will likely be already available to Spark. When we run Spark without Hadoop, we can get these files by downloading Hadoop 2.7.0 binaries (http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/), extracting them from the hadoop-2.7.0\share\hadoop\tools\lib folder, and putting them somewhere in our Spark folder (like C:\Spark\lib).
Create or Modify core-site.xml
We create (or modify) the core-site.xml configuration file in the C:\Spark\conf directory so that it contains the required hadoop-azure property fs.azure.account.key.AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net with value set to the secret Azure Storage Access Account Key.
In the screenshot above, Azure Storage Account name is "avdatarepo1" and the key obfuscated.
Start spark-shell with the required JARs
We now start the spark-shell and tell it to load the two required JARs which implement the FileSystem interface for WASB.
C:\Spark\bin\spark-shell --jars C:\Spark\lib\hadoop-azure-2.7.0.jar,C:\Spark\lib\azure-storage-2.0.0.jar
To avoid having to specify the --jars argument for spark-shell every time we start it, we can also modify conf\spark-defaults.conf file and include spark.jars property set to file:///-scheme-based URI of the two required local .jar files:
Now, we can start spark-shell without the --jars argument by simply using: bin\spark-shell
Access files stored in Azure Storage Blobs
Once the spark-shell starts, we are able to query the files that are stored in the Azure Storage Account that was configured in core-site.xml.
WASB URI syntax is:
When using the wasb:// URI scheme, Spark accesses the data from Azure Storage Blobs endpoint using unencrypted HTTP. We can use wasbs:// to make sure that the data is accessed via HTTPS.
I this article we described a way to access some data stored in Azure Blob Storage from a Spark instance that is running locally on your Windows 10 computer or on another OS. The main goal was to show you what Spark configurations to set. It is important to note that you are unlikely to use this approach to access a lot of data in Azure Blob Storage from a large production Spark cluster that is running at a colocation facility or your own data center due to: (1) the latency in transferring data from the remote blobs to the Spark executor nodes and (2) because transferring data out of cloud services usually involves nominal egress data transfer charges. If you are running Spark Azure IaaS virtual machines within the same region as the Azure Blob Storage account, the latency would be reasonable and there would usually be no data egress charges. However, in this case, you might want to use the managed Apache Spark for Azure HDInsight which has all of this pre-configured for you.
I'm looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad