Author: Benjamin Wright-Jones
Contributors: Sander Timmer, Derek Norton
Reviewers: Anderson Chan
The results of the recent IEEE Survey (2015) clearly show the rising interest in R (the linga franca of data scientists). In SQL Server 2016, R Services will be available, leveraging the highly scalable and parallel algorithms from the Revolution Analytics engine.
SQL Server 2016 Community Technology Preview (CTP) 3.x onwards provides the ability to enable and experiment with R Services in a familiar database development environment. In this blog, I will refer to the SQL Server 2016 RC2 setup process and documentation links, describing a simple use case and the visualisation of the R plots using Transact-SQL and R.
The recent SQL Server 2016 RC2 release greatly simplifies the installation of the R Services for the database engine; please refer to the following article: Set up SQL Server R Services (In-Database) https://msdn.microsoft.com/en-us/library/mt696069.aspx. In general, there are two use cases for SQL Server R Services, 1) analysing SQL Server data in database (the data is located in a SQL Server database and the computations are performed at the location of the data) which is likely to be the most common pattern and 2) data is extracted from the SQL Server database using an ODBC connection and computations are performed on a computer alongside the SQL Server Platform.
Executing R scripts for Transact-SQL is provided through the system stored procedure sp_execute_external_script
which takes accepts a number of parameters, an example of which is provided below. The full syntax explanation is available here: https://msdn.microsoft.com/en-US/library/mt604368.aspx
@language = N'R'
, @script = N'OutputDataSet <- InputDataSet'
, @input_data = N'SELECT 1 as Col'
WITH RESULT SETS ((col INT NOT NULL));
Next, I thought it would be useful to find out what packages are installed by default. There is a sample script provided in the documentation but I wrote my own (shown below) as I wanted to see details such as directory, version, license.
@language = N'R'
, @script = N'
OutputDataSet <- data.frame(installed.packages());
rPackages <- subset( OutputDataSet, select = -c(Priority, Depends,
Imports, LinkingTo, Suggests, Enhances, License_is_FOSS,
License_restricts_use, OS_type, MD5sum, NeedsCompilation ))'
, @input_data_1 = N''
, @output_data_1_name = N'rPackages'
WITH RESULT SETS
"Package" varchar(100) not null,
"LibPath" varchar(100) not null,
"Version" varchar(100) not null,
"License" varchar(100) not null,
"Built" varchar(10) not null
Data type mapping
The data types implemented in SQL Server 2016 and R are not equal and you have to take care when mapping these in the sp_execute_external_script system stored procedure.
Fortunately, this is where the UNDEFINED keyword becomes useful.
@language = N'R'
, @script = N' str(InputDataSet);'
, @input_data_1 = N' SELECT TOP 10 * FROM [dbo].[tSharkAttackData];'
WITH RESULT SETS UNDEFINED;
Below is a sample of the output using a data set from here http://www.sharkattackfile.net/incidentlog.htm. You will notice I have a lot of categorical variables. There are ten (10) observations (because I restrict the result set) of fifty-seven (57) variables, all of which are referenced as 'Factor'. This is simply because each of the variables are defined as varchar(100) in SQL Server tables.
The data type guidance has recently been updated for the RC2 release https://msdn.microsoft.com/en-US/library/mt590948.aspx which describes the mappings in more detail.
Default library paths
Installing additional R packages in SQL Server 2016 is very simple. You can use the standard R syntax install.packages. However, I recommend checking your default library path prior to installing any additional R packages for SQL Server 2016, otherwise these may be mistakenly installed in the wrong path. Fortunately, this is a simple process. Executing .libPaths() in any R console (for example, using R Tools in Visual Studio 2015 or R Studio) will return the paths which are configured for your environment. If you discover the default path is not correct, then you can either change the default or you can specify the installation directory during installation of R libraries for SQL Server 2016. For my purposes, I specified the following command to install a few additional R libraries.
install.packages(c("labeling"), "C:\\Program Files\\RRO\\RRO-3.2.2-for-RRE-7.5.0\\R-3.2.2\\library", dependencies=T);
You will subsequently be able to verify the libraries has been installed correctly by executing the previously mentioned script (to validate the installed R libraries).
Returning data from R is relatively easy, the stored procedure below was implemented to test a simple query to return an R plot. I used Shark Attack data sets from the following site http://www.sharkattackfile.net/incidentlog.htm. There are no predictive functions implemented in the below example; the script is a simply injecting a T-SQL result set to return a R plot using the ggplot2 library but hopefully provides a useful example.
CREATE PROCEDURE usp_SharkDataAttacksByYear
@script = N'
image_file <- tempfile();
jpeg(filename = image_file, width = 500, height = 500);
df <- dSharkData;
plot(df$NumAttacks, type="o", x=df$Year, xlab="Year", ylab="Number of Attacks", col="blue", pch=18);
title(main="Number of Attacks by Area", col.main="red", font.main=4);
OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6));',
@input_data_1 = N'
SELECT COUNT(*) AS NumAttacks, Country, Area, Year, Type, Activity, Sex
GROUP BY COUNTRY, Area, Year, TYPE, Activity, Sex
ORDER BY NumAttacks DESC;',
@input_data_1_name = N'dSharkData',
@output_data_1_name = N'OutputDataSet'
WITH RESULT SETS
The R script returns an output as a varbinary(max) data type (represented as a series of alpha numerics in the SQL Server result pane). Visualisation tools such as Report Builder or Reporting Services can convert and display this data graphically. I used SQL Server 2016 Report Builder RC2 to display the R plot results, by binding the output of the R script to an image object. There are a few minor configuration steps e.g. connecting to the SQL Server instance and specifying the connection string and appropriate fields.
Data Pipeline Options
The integration of R services into SQL Server 2016 introduces a number of interesting data flow possibilities. With Azure Data Factory, calls to SQL Server 2016 can automate R processes in the pipeline. The Data Management Gateway also provides the option to leverage existing on-premise SQL Server investments with connectivity to Azure services. In contrast, SQL Server Integration Services can execute R commands in the control or data flow to enable alternative scenarios. Power BI, coupled with the Enterprise Data Management Gateway, also permits DirectQuery execution against SQL Server 2016, enabling pass through of R results back to Power BI dashboards.
Training and Modeling
The next blog post will discuss training and modeling using the RevoScaleR package and also the use of the @parallel option in the sp_execute_external_script system stored procedure.
SQL Server 2016 R Services https://msdn.microsoft.com/en-US/library/mt604845.aspx
SQL Server sp_execute_external_script https://msdn.microsoft.com/en-US/library/mt604368.aspx
SQL Server R Services Tutorials https://msdn.microsoft.com/en-us/library/mt591993.aspx