Operationalize your machine learning project using SQL Server 2016 SSIS and R Services

With the release of CTP3 SQL Server 2016 and its native In-database support for the open source R language (SQL Server R Services), users can now call both R and RevoScaleR functions and scripts directly from within a SQL query and benefit from multi-threaded and multi-core in-DB computations. The R integration brings the utility of data science to your applications without the need to ‘export’ the data to your R environment. Moreover, users can now use SQL Server Integration Services (SSIS) to:

          extract data from various on-premises and/or cloud sources to build training data

          extract data from various on-premises and/or cloud sources to prepare production data

          operationalize your R code to build and run a R model as part of your data integration workflow

          operationalize your R code to build a workflow to retrain your R model regularly

          load result from your R script to other destinations such as Excel, Oracle, Teradata and more…

Today, I will use the Adventure Works samples for SQL Server 2016 CTP3 to showcase how we can use SSIS to operationalize a R prediction from doing data preparation, to using the training data to build and save the “trained” model and running prediction using the trained model.


Preparing Data

In this specific example, we will use the IRIS flower dataset from Ronald Fisher that is built-in dataset from R as our data source and we will load this dataset into a SQL Server table called IRIS_RX_DATA. This will be our training data.

First, we create an empty IRIS_RX_DATA table in SQL Server


 Then, we will create a stored procedure called GET_IRIS_DATASET to get the IRIS built-in dataset in R as our data source for training data


 And then we use the SQL Server Integration Service Execute SQL Task to get the IRIS dataset and store the data into the IRIS_RX_DATA table like below





Of course, training data for your business model can be from many different sources. SQL Server Integration Service has many source tasks that allows you to extract data and store it into the SQL Server table like above, depends on where the training data is located:

          ADO.NET source

          Excel source

          Flat File source

          OLE DB source

          ODBC source

          Raw File source

          XML source

          Azure Blob source

          HDFS source

          OData source

          Teradata source

          Oracle source


With the rich connectivity support in SSIS 2016, we can extract data pretty much from everywhere (cloud or on-premises) and in any formats. Also, we can use the SSIS built-in data transformation tasks to achieve data cleansing and data masking as needed before loading it to the SQL Server destination.

Operationalizing creation / update of the “trained” model using latest training data

Now once the training data is prepared in the SQL Server, we can build or refresh a model within SQL Server. Integration Services allows us to operationalize and automate this model building and re-training workflow easily

Let’s try to first create an empty SQL Server table to store the trained model,


Then, we will create a stored procedure called GENERATE_IRIS_RX_MODEL, which has an embedded R script to create a “trained” model using the training data from IRIS_RX_DATA and the built-in rxLinMod linear model function


Here we again use the SQL Server Integration Service Execute SQL Task to execute the stored procedure GENERATE_IRIS_RX_MODEL and store the newly trained model into the IRIS_RX_MODEL table like below




Operationalizing prediction and scoring using the “trained” model

Now we have the “trained” model being stored in the IRIS_RX_MODEL table. Let’s try to run a prediction using it. All we need to do is to run an embedded R script in the SQL query to trigger the rxPredict built-in R function to run the prediction using the model we “trained” above. For manageability, we create a stored procedure in SQL Server called PREDICT_SPECIES_LENGTH to do this


 Again, we use the SQL Server Integration Service Execute SQL Task to execute the stored procedure PREDICT_SPECIES_LENGTH to run the prediction using the trained model




 Of course, we can always use the wide range of destination task to load the prediction output in the SQL Server table or back to the data warehouse or other SSIS-supported destinations, such as:

          ADO NET destination

          Excel destination

          Flat File destination

          OLE DB destination

          ODBC destination

          Raw File destination

          Recordset destination

          SQL Server Destination

          SQL Server Compact destination

          DataReader destination

          HDFS destination

          Azure Blob destination

          Data Streaming Destination



Operationalizing model retraining workflow

Now, we can even build a SSIS package that is run on a regular basis (e.g. once a day or week) to automatically trigger a model retraining workflow just like below:



After extracting the production data, we can create a script task to analyze both the production and training data and decide if the model retraining is necessary. We can add precedence constraint to the script task such that only Remodel Needed constraint is met then the R script is triggered to retrain the model.



Building an advanced analytics data pipeline with machine learning capabilities is just like building LEGO, you can build it with different patterns and possibilities. Now you can use SQL Server Integration Service along with the SQL Server R Services in SQL Server 2016 to operationalize your on-premises machine learning project with R, from data preparation, prediction execution, to model retraining.

Comments (0)

Skip to main content