Operationalize your machine learning project using SQL Server 2016 SSIS and R Services


With the release of CTP3 SQL Server 2016 and its native In-database support for the open source R language (SQL Server R Services), users can now call both R and RevoScaleR functions and scripts directly from within a SQL query and benefit from multi-threaded and multi-core in-DB computations. The R integration brings the utility of data science to your applications without the need to ‘export’ the data to your R environment. Moreover, users can now use SQL Server Integration Services (SSIS) to:

-          extract data from various on-premises and/or cloud sources to build training data

-          extract data from various on-premises and/or cloud sources to prepare production data

-          operationalize your R code to build and run a R model as part of your data integration workflow

-          operationalize your R code to build a workflow to retrain your R model regularly

-          load result from your R script to other destinations such as Excel, Oracle, Teradata and more…

Today, I will use the Adventure Works samples for SQL Server 2016 CTP3 to showcase how we can use SSIS to operationalize a R prediction from doing data preparation, to using the training data to build and save the “trained” model and running prediction using the trained model.

 

Preparing Data

In this specific example, we will use the IRIS flower dataset from Ronald Fisher that is built-in dataset from R as our data source and we will load this dataset into a SQL Server table called IRIS_RX_DATA. This will be our training data.

First, we create an empty IRIS_RX_DATA table in SQL Server

 

 Then, we will create a stored procedure called GET_IRIS_DATASET to get the IRIS built-in dataset in R as our data source for training data

 

 And then we use the SQL Server Integration Service Execute SQL Task to get the IRIS dataset and store the data into the IRIS_RX_DATA table like below

 

 

 

 

Of course, training data for your business model can be from many different sources. SQL Server Integration Service has many source tasks that allows you to extract data and store it into the SQL Server table like above, depends on where the training data is located:

-          ADO.NET source

-          Excel source

-          Flat File source

-          OLE DB source

-          ODBC source

-          Raw File source

-          XML source

-          Azure Blob source

-          HDFS source

-          OData source

-          Teradata source

-          Oracle source

 

With the rich connectivity support in SSIS 2016, we can extract data pretty much from everywhere (cloud or on-premises) and in any formats. Also, we can use the SSIS built-in data transformation tasks to achieve data cleansing and data masking as needed before loading it to the SQL Server destination.

Operationalizing creation / update of the “trained” model using latest training data

Now once the training data is prepared in the SQL Server, we can build or refresh a model within SQL Server. Integration Services allows us to operationalize and automate this model building and re-training workflow easily

Let’s try to first create an empty SQL Server table to store the trained model,

 

Then, we will create a stored procedure called GENERATE_IRIS_RX_MODEL, which has an embedded R script to create a “trained” model using the training data from IRIS_RX_DATA and the built-in rxLinMod linear model function

 

Here we again use the SQL Server Integration Service Execute SQL Task to execute the stored procedure GENERATE_IRIS_RX_MODEL and store the newly trained model into the IRIS_RX_MODEL table like below

 

  

 

Operationalizing prediction and scoring using the “trained” model

Now we have the “trained” model being stored in the IRIS_RX_MODEL table. Let’s try to run a prediction using it. All we need to do is to run an embedded R script in the SQL query to trigger the rxPredict built-in R function to run the prediction using the model we “trained” above. For manageability, we create a stored procedure in SQL Server called PREDICT_SPECIES_LENGTH to do this

 

 Again, we use the SQL Server Integration Service Execute SQL Task to execute the stored procedure PREDICT_SPECIES_LENGTH to run the prediction using the trained model

 

 

 

 Of course, we can always use the wide range of destination task to load the prediction output in the SQL Server table or back to the data warehouse or other SSIS-supported destinations, such as:

-          ADO NET destination

-          Excel destination

-          Flat File destination

-          OLE DB destination

-          ODBC destination

-          Raw File destination

-          Recordset destination

-          SQL Server Destination

-          SQL Server Compact destination

-          DataReader destination

-          HDFS destination

-          Azure Blob destination

-          Data Streaming Destination

 

 

Operationalizing model retraining workflow

Now, we can even build a SSIS package that is run on a regular basis (e.g. once a day or week) to automatically trigger a model retraining workflow just like below:

 

 

After extracting the production data, we can create a script task to analyze both the production and training data and decide if the model retraining is necessary. We can add precedence constraint to the script task such that only Remodel Needed constraint is met then the R script is triggered to retrain the model.

 

Summary

Building an advanced analytics data pipeline with machine learning capabilities is just like building LEGO, you can build it with different patterns and possibilities. Now you can use SQL Server Integration Service along with the SQL Server R Services in SQL Server 2016 to operationalize your on-premises machine learning project with R, from data preparation, prediction execution, to model retraining.

Comments (0)

Skip to main content