Benefits of Microsoft R and R Server – Quick Look


Scale your machine learning workloads on R (series)

In my previous post, I described how to leverage your R skills using Microsoft technologies for ordinary business users (marketers, business analysts, etc) with Power BI.
In this post, I describe what is the benefits of Microsoft R technologies for professional developers (programmers, data scientists, etc) with a few lines of code.

Designed for multithreading

R is most popular statistical programming language, but it’s having some concerns for enterprise use. The biggest one is the lack of parallelism.

Microsoft R Open (MRO) is renamed from the famous Revolution R Open (RRO). By using MRO you can take advantage of multithreading and high performance, although the functions of MRO is still compatible for the basic functions of open source R like CRAN.

Note : You can also use other choice (snow, etc) for parallel computing in R.

Please see the following official document about the benchmark.

The Benefits of Multithreaded Performance with Microsoft R Open
https://mran.microsoft.com/documents/rro/multithread/

For example, this document says that the matrix manipulation is many times faster than the open source R. Let’s see the following simple example.
The A is 10000 x 5000 matrix which elements is the repeated values of 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, … The B is the cross-product matrix by A. Here we measure this cros-product operation using system.time().

A <- matrix (1:5,10000,5000)
system.time (B <- crossprod(A))

The following is the results.
Here I’m using Lenovo X1 Carbon (with Intel Core vPro i7), and MRO is over 8 times faster than the other open R.

R 3.3.2 [64 bit] (CRAN)

Microsoft R 3.3.2 [64 bit] (MRO)

The analysis function for the large amount of data is also faster than the other R runtime.

Note : If you’re using RStudio and installing both open source R and MRO, you can change the R runtime for RStudio from [Tools] – [Global Options] menu.

 

Distributed and Scaling

By using Microsoft R Server (formerly, Revolution R Enterprise), you can also distribute and scale the computing of R across the multiple computers.
The R Server can be run on Windows (SQL Server), Linux, Teradata, and Hadoop (Spark) clusters.
You can also use R Server as one of the workload on Spark (see the following illustrated), and here we use the Spark cluster in this post. Using R Server on Spark, you can distribute the R algorithm on Spark cluster.

Note : For Windows, R Server is licensed under SQL Server. You can easily get the standalone R server (with SQL Server 2016 Developer edition) by using the virtual machine (called “Data Science Virtual Machine”) in Microsoft Azure.

Note : Spark MLlib is also the machine learning component to get the power of computing on Spark, but Python, Java, and Scala are the mainly used programming languages. (Currently almost functions are not supported in R.)

Note : You can also use SparkR. But you must remember that SparkR is currently just the data transformation for R computing (i.e, not mature).

Here I skip how to setup your R Server on Spark, but the easiest way is to use Azure Hadoop clusters (called “HDInsight”). You can setup your own experiment environment by just a few steps as follows.

  1. Create R Server on Spark cluster (Azure HDInsight). You just input several terms along with HDInsight cluster creation wizard on Azure Portal, and all the computer nodes (head nodes, edge nodes, worker nodes, zookeeper nodes) is automatically setup.
    Please see “Microsoft Azure – Get started using R Server on HDInsight” for details.
  2. If needed, setup RStudio connected to the Spark cluster (edge node) above. Note that RStudio Server Community Edition on edge node is automatically installed by HDInsight, then you just only setup your client environment. (Currently you don’t need to install RStudio Server by yourself.)
    Please see “Installing RStudio with R Server on HDInsight” for the client setup.

Note : Here I used Azure Data Lake store (not Azure Storage Blobs) for the primary storage on Hadoop clusters. (And I setup the service principal and its access permissions for this Data Lake account itself.)
For more details, please refer “Create an HDInsight cluster with Data Lake Store using Azure Portal“.

Using RStudio Server on the edge node on Spark cluster, you can use RStudio on the web browser using SSH tunnel. (See the following screenshot.)
It’s very convenient way for running and debugging your R scripts on R Server.

Here I prepared the source data (Japanese stocks daily reports) over 35,000,000 records (over 1 GB).
When I run my R script with this huge data on my local computer, the script fails because of the allocation error or timeout. In such a case, you can solve this problem using R Server on Spark cluster.

Now here’s the R script which I run on R Server.

##### The format of source data
##### (company-code, year, month, day, week, open-price, difference)
#
#3076,2017,1,30,Monday,2189,25
#3076,2017,1,27,Friday,2189,-1
#3076,2017,1,26,Thursday,2215,-29
#...
#...
#####

# Set Spark clusters context
spark <- RxSpark(
  consoleOutput = TRUE,
  extraSparkConfig = "--conf spark.speculation=true",
  nameNode = "adl://jpstockdata.azuredatalakestore.net",
  port = 0,
  idleTimeout = 90000
)
rxSetComputeContext(spark);

# Import data
fs <- RxHdfsFileSystem(
  hostName = "adl://jpstockdata.azuredatalakestore.net",
  port = 0)
colInfo <- list(
  list(index = 1, newName="Code", type="character"),
  list(index = 2, newName="Year", type="integer"),
  list(index = 3, newName="Month", type="integer"),
  list(index = 4, newName="Day", type="integer"),
  list(index = 5, newName="DayOfWeek", type="factor",
       levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
  list(index = 6, newName="Open", type="integer"),
  list(index = 7, newName="Diff", type="integer")
)
orgData <- RxTextData(
  fileSystem = fs,
  file = "/history/testCsv.txt",
  colInfo = colInfo,
  delimiter = ",",
  firstRowIsColNames = FALSE
)

# execute : rxLinMod (lm)
system.time(lmData <- rxDataStep(
  inData = orgData,
  transforms = list(DiffRate = (Diff / Open) * 100),
  maxRowsByCols = 300000000))
system.time(lmObj <- rxLinMod(
  DiffRate ~ DayOfWeek,
  data = lmData,
  cube = TRUE))

# If needed, predict (rxPredict) using trained model or save it.
# Here's just ploting the means of DiffRate for each DayOfWeek.
lmResult <- rxResultsDF(lmObj)
rxLinePlot(DiffRate ~ DayOfWeek, data = lmResult)

# execute : rxCrossTabs (xtabs)
system.time(ctData <- rxDataStep(
  inData = orgData,
  transforms = list(Close = Open + Diff),
  maxRowsByCols = 300000000))
system.time(ctObj <- rxCrossTabs(
  formula = Close ~ F(Year):F(Month),
  data = ctData,
  means = TRUE
))
print(ctObj)

Before running this script, I’ve uploaded the source data on Azure Data Lake store which is the same storage as the primary storage of Hadoop cluster. “adl://...” means the uri of Azure Data Lake store account.

The above functions which is prefixed by “rx” are called ScaleR functions (functions in RevoScaleR package). These functions are provided for distributing and scaling, and each ScaleR function is the scaling one of the corresponding basic R functions. For example, RxTextData is corresponding to read.table or read.csv, rxLinMod is corresponding to lm (linear regression model), and rxCrossTabs is xtabs (cross-tabulation).
You can use these R functions for leveraging the computing power of Hadoop clusters. (See the following reference document for details.)

Microsoft R – RevoScaleR Functions for Hadoop
https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler-hadoop-functions

Microsoft R – Comparison of Base R and ScaleR Functions
https://msdn.microsoft.com/en-us/microsoft-r/scaler/compare-base-r-scaler-functions

Note : For more details (descriptions, arguments, etc) about each ScaleR functions, please type “?{function name}” (ex. “?rxLinePlot“) in R Console.

Note : You can also use more fast modeling functions implemented by Microsoft Research called MicrosoftML (MML). (This also includes the functionality for the anomaly detection and the deep nural networks.) Now these functions are only on Windows and not in Spark clusters, but soon will be updated in the future.
See “Building a machine learning model with the MicrosoftML package“.

The following illustrates the topology of Spark clusters. The workloads of R Server reside in the edge node and worker nodes.

The edge node is having the role of the development front-end, and you can interact with R Server through this node. (Currently, R Server on HDInsight is the only cluster which provides the edge node by default.)
For example, RStudio Server is installed on this edge node, and when you run your R scripts through RStudio on the web browser, this node starts all computations distributed to worker nodes. If the computation cannot be distributed for some reason (both intentionally and accidentally), this task will be run on this local edge node.

While the script is running, please see YARN on resource manager (rm) in Hadoop. You could find the running the application of ScaleR on the scheduler. (See the following screenshot.)

When you monitor the worker nodes in resource manager UI, you could find that all nodes are used for computation.

The ScaleR functions are also provided in Microsoft R Client (on top of Microsoft R Open), which can run on the standalone computer. (You don’t need extra servers.)
Using Microsoft R Client, you can send completed R commands to the remote R Server for execution. (Use mrsdeploy package.) Or, you can learn and test these ScaleR functions on the local computer with Microsoft R Client, and you can migrate to the distributed clusters.
I think it’s better idea to use Microsoft R Client in the development time, because there’re so many overheads (fee, provisioning, etc) to use Hadoop clusters when you’re in the development.

 

You can take a lot of advantages of the robust computing platform with Microsoft R technologies !

Comments (1)

Skip to main content