Accessing Azure Storage Blobs from Spark 1.6 that is running locally


When you are using HDInsight Hadoop or Spark clusters in Azure, they are automatically pre-configured to access Azure Storage Blobs via the hadoop-azure module that implements the standard Hadoop FilesSystem interface. You can learn more about how HDInsight uses blob storage at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/

In this article, I will show how we can configure a local install of Spark 1.6.2 (e.g. running on my Windows 10 laptop, MAC, or Linux) to be able to read and write Azure Storage Blobs using wasb[s]:// URI scheme.

Install Spark (if you don't have it yet)

To get started, we download Spark 1.6.2 (Jun 25 2016) with the "Pre-build for Hadoop 2.6" package type from http://spark.apache.org/downloads.html and unzip it into a directory called C:\Spark. I am walking through these steps on my Windows 10 laptop and that is why I am using C:\Spark. Your path will different if you aren't on Windows, but the configuration concepts will be similar.

image

Make sure that your SPARK_HOME and HADOOP_HOME environment variables are set properly. In my case, I created a small spark-env.cmd file in the conf directory that I run before starting the spark-shell:

image

To avoid seeing a lot of verbose logs in the spark-shell window, I also set my conf/log4j.properties file to only log warnings (i.e. log4j.rootCategory=WARN, console instead of log4j.rootCategory=INFO, console)

Now, we should confirm that the regular bin/spark-shell is working properly.

If you are trying this on Windows and get errors when trying to run the spark-shell, please see my other article Resolving Spark 1.6.0 "java.lang.NullPointerException, not found: value sqlContext" error when running spark-shell on Windows 10 (64-bit).

image

Get Two Required JAR files

In order to access Azure Storage Blobs, we need to make sure that two required JAR files are available when we run the spark-shell:

  • hadoop-azure-2.7.0.jar
  • azure-storage-2.0.0.jar

If we are running Spark on a cluster with regular Hadoop distribution, these files will likely be already available to Spark. When we run Spark without Hadoop, we can get these files by downloading Hadoop 2.7.0 binaries (http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/), extracting them from the hadoop-2.7.0\share\hadoop\tools\lib folder, and putting them somewhere in our Spark folder (like C:\Spark\lib).

image

Create or Modify core-site.xml

We create (or modify) the core-site.xml configuration file in the C:\Spark\conf directory so that it contains the required hadoop-azure property fs.azure.account.key.AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net with value set to the secret Azure Storage Access Account Key.

image

In the screenshot above, Azure Storage Account name is "avdatarepo1" and the key obfuscated.

Start spark-shell with the required JARs

We now start the spark-shell and tell it to load the two required JARs which implement the FileSystem interface for WASB.

C:\Spark\bin\spark-shell --jars C:\Spark\lib\hadoop-azure-2.7.0.jar,C:\Spark\lib\azure-storage-2.0.0.jar

To avoid having to specify the --jars argument for spark-shell every time we start it, we can also modify conf\spark-defaults.conf file and include spark.jars property set to file:///-scheme-based URI of the two required local .jar files:

image

Now, we can start spark-shell without the --jars argument by simply using: bin\spark-shell

Access files stored in Azure Storage Blobs

Once the spark-shell starts, we are able to query the files that are stored in the Azure Storage Account that was configured in core-site.xml.

WASB URI syntax is:

wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

image

When using the wasb:// URI scheme, Spark accesses the data from Azure Storage Blobs endpoint using unencrypted HTTP. We can use wasbs:// to make sure that the data is accessed via HTTPS.

image

Conclusion

I this article we described a way to access some data stored in Azure Blob Storage from a Spark instance that is running locally on your Windows 10 computer or on another OS. The main goal was to show you what Spark configurations to set. It is important to note that you are unlikely to use this approach to access a lot of data in Azure Blob Storage from a large production Spark cluster that is running at a colocation facility or your own data center due to: (1) the latency in transferring data from the remote blobs to the Spark executor nodes and (2) because transferring data out of cloud services usually involves nominal egress data transfer charges. If you are running Spark Azure IaaS virtual machines within the same region as the Azure Blob Storage account, the latency would be reasonable and there would usually be no data egress charges. However, in this case, you might want to use the managed Apache Spark for Azure HDInsight which has all of this pre-configured for you.

I'm looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad

Comments (7)

  1. D'Blob says:

    Do we have java api examples for reading blob files in spark job?

  2. mkhl says:

    heh.. it went smoothly with 2.0

  3. Tomas says:

    Thank you.

  4. Joy says:

    I am getting below exception when trying to access storage account from Spark. It works on one machine but not in other. Same storage account. What are the places to look for. I am little new to the Spark environment.

    Caused by: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature
    at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:162)
    at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
    at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:177)
    at com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(CloudBlob.java:1468)
    at com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:255)

    1. The exception sounds like the storage account name and/or key are not configured properly in the core-site.xml file.

      1. Subhankar says:

        Hello Arsen,
        I am getting errors on using this with Linux based HDP cluster. Ever since I have set up the default Hadoop file system on Azure blob, Spark is unable to start. It is showing errors pertaining to initialization of HiveSession to AzureStorageException concerns. I have tried multiple options and have even made sure that the concerned jar files are present, but to no effect. I have raised this query on Hortonworks community but have received no response yet. Here is the link to the same:

        https://community.hortonworks.com/questions/167906/we-are-unable-to-access-sparkspark2-when-we-change.html

        I request you to kindly take a look and let me know if you have any suggestion to the above. Also, if you need further information on the configuration, kindly let me know.
        Looking forward to receiving a response from you. Thank you

        1. To debug this, can you try to remove all WASB settings (that you mention in your HDP question at the link provided) from your core-site files and only add one setting there for the fs.azure.key.STORAGE_ACCOUNT_NAME.blob.core.windows.net. Don’t point the spark settings at the WASB and try again (so that Spark continues to use its default locations for events and history). Then use simple sc.textFile(“wasb://…”).count() to test communication via the shell. When starting spark-shell provide the location of the JARs via –jars parameter just to make sure.

Skip to main content