Run H2O.ai in R on Azure HDInsight

Article
06/26/2017

This blog post is authored by Daisy Deng, Abhinav Mithal in Cloud AI group at Microsoft

In our previous blog, we introduced H2O.ai on Azure HDInsight. Currently, H2O can run on Azure HDInsight in Python or Scala APIs. However, R doesn’t come out-of-box. R has been popular in data scientist communities and support of R in H2O.ai on Azure HDInsight has been sought after by many of our customers. Today, we share our solution of how to enable R support of H2O.ai on Azure HDInsight.

Run Apache Spark and H2O in R on HDInsight

The ecosystem of data science has always high demands of R in any big data platform. With sparklyr, R users can access the capability in Apache Spark. With the help of Sparkling Water and rsparkling, R users can run analytics in H2O on top of a Spark Cluster, combining the machine learning power of H2O and data processing in Apache Spark together, with the same familiar R programming language.

Setting up R Environment to Run H2O On HDInsight

We provide a few script actions for installing rsparkling on Azure HDInsight. Script action is the invocation of custom scripts that customize the cluster, either at the time of cluster creation or after the cluster is created. When creating the HDInsight cluster, you can run the following script action to install H2O-related R packages on head nodes and worker nodes: https://bostoncaqs.blob.core.windows.net/scriptaction/install-h2opackages.sh. Please consult our previous blog for cluster creation and for script action.

Install RStudio As a Custom Application on HDInsight

RStudio can be installed as a custom application on Azure HDInsight so that RStudio can be securely accessed with a URI provided by HDInsight. To do so, follow the steps described next. Once your HDInsight cluster is deployed (you can check the status of your cluster by going to “Deployments” of the corresponding resource group), go to https://portal.azure.com/#create/Microsoft.Template , click “Build your own template in the editor”, and paste the content at https://bostoncaqs.blob.core.windows.net/scriptaction/install-rstudio-as-hdi-app.json into the editor window.

Click “Save” at the bottom left to save the ARM template and the Azure portal will show you the following picture:

Fill in the proper Azure subscription, resource group, cluster name, and location, agree with the term, and click “Purchase”, then the deployment of the RStudio will start. After the deployment finishes, go to your cluster at Azure portal, and go to “Application”, and click on the installed application “rstudio” and get the URI to access the RStudio on the edge node.

Access RStudio and Run H2O.ai

”, as shown in the above picture, use “admin” user to log onto the cluster and use “sshuser” or any other user you create to access the RStudio.

Upon this point, you are ready to use RStudio to run H2O machine learning on your HDInsight cluster. Remember to add “options(rsparkling.sparklingwater.version = "2.0.8") before loading the rsparkling library , i.e. library(rsparkling). Please be aware that these instructions work with HDInsight 3.5 and Spark version 2.0.2. If you wish to install rsparkling and related packages of another version, please check https://github.com/h2oai/rsparkling/blob/master/README.md for version match between Spark, H2O and Sparkling water and use the proper rsparkling version before loading the sparkling library. You can download the scripts, change versions, add additional libraries for your application and run script action after the cluster is up running as well.

An example of how to use the cluster to run H2O.ai is as shown below:

 options(rsparkling.sparklingwater.version = "2.0.8")library(rsparkling)library(h2o)library(sparklyr)library(dplyr)## The following is the spark Cluster configurationconf <- spark_config()conf$spark.driver.memory <- "8G"conf$spark.executor.cores <- 8conf$spark.executor.memory <- "8G"conf$spark.num.executors <- 3conf$spark.ext.h2o.cluster.size <- 3conf$spark.ext.h2o.default.cluster.size <- 3conf$spark.yarn.am.cores  <- 2conf$spark.yarn.am.memory <- "8G"conf$spark.dynamicAllocation.enabled <- "false"conf$maximizeResourceAllocation <- "true"conf$spark.default.parallelism <- 48conf$spark.rpc.message.maxSize <- 1024#Configure the location of Spark home:Sys.setenv(SPARK_HOME = '/usr/hdp/current/spark2-client')## Start a spark connection and a h2o connectionsc <- spark_connect(master = "yarn-client", config = conf, version = "2.0.2")h2o_context<-h2o_context(sc ,strict_version_check = TRUE)

Enable SSH Tunneling to Access RStudio and Run H2O.ai

If you choose not to install RStudio as a custom application, a quick and dirty way to access the Rstudio is to first establish SSH tunneling (by following the documentation here), and then access the RStudio on your edge node through web browser at https://localhost:8787 (RStudio uses 8787 by default). If you choose to use SSH tunnelling to access RStudio, you can install the https://bostoncaqs.blob.core.windows.net/scriptaction/install-rstudio-withh2o.sh on the edge node through script action.