Run H2O.ai in R on Azure HDInsight


In our previous blog, we introduced H2O.ai on Azure HDInsight. Currently, H2O can run on Azure HDInsight in Python or Scala APIs. However, R doesn’t come out-of-box. R has been popular in data scientist communities and support of R in H2O.ai on Azure HDInsight has been sought after by many of our customers. Today, we share our solution of how to enable R support of H2O.ai on Azure HDInsight.

Run Apache Spark and H2O in R on HDInsight

The ecosystem of data science has always high demands of R in any big data platform. With sparklyr, R users can access the capability in Apache Spark. With the help of Sparkling Water and rsparkling, R users can run analytics in H2O on top of a Spark Cluster, combining the machine learning power of H2O and data processing in Apache Spark together, with the same familiar R programming language.

Setting up R Environment to Run H2O On HDInsight

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-head.sh

And run the following action for the worker node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-worker.sh

Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Install RStudio and enable SSH tunneling

After installing the above script actions, RStudio will be installed on your head node. You need to first establish ssh tunneling (by following the documentation here), and then you can access the RStudio on your header node through web browser at https://localhost:8787 (RStudio uses 8787 by default) and you are ready to run H2O machine learning on your HDInsight cluster. Remember to add “options(rsparkling.sparklingwater.version = "2.0.8") before loading the rsparkling library , i.e. library(rsparkling).

Please be aware that this instruction works HDInsight 3.5 and Spark version 2.0.2. If you wish to install rsparkling and related packages of another version, please check https://github.com/h2oai/rsparkling/blob/master/README.md for version match between Spark, H2O and Sparkling water and use the proper rsparkling version before loading the sparkling library. You can download the scripts, add additional libraries for your application and run script action after the cluster is up running as well.

This blog post is authored by Daisy Deng working in Cloud AI group of Microsoft.


Comments (0)

Skip to main content