Loan Classification using SQL Server 2016 R Services


R Services (In-database) provides a platform for developing and deploying intelligent applications that uncover new insights. You can use the rich and powerful R language and the many packages from the community to create models and generate predictions using your SQL Server data. Because R Services (In-database) integrates the R language with SQL Server, you can keep analytics close to the data and eliminate the costs and security risks associated with data movement.

In this post, we will see how we can leverage SQL Server 2016 as a Scoring Engine to predict “bad” loans. Loans that indicate good repayment behavior are considered “good” and loans that indicate less than perfect repayment behavior are considered “bad”.

The Azure Data Science VM comes pre-installed with SQL Server 2016 Developer edition and can be used readily for our scoring experiment. We will use the publicly available dataset of Lending Club loan performance.

This experiment consists of 6 steps. We shall define stored procedures in SQL for each of these steps and execute them in order.

 

Drawing3

 

DATA PREPARATION

 

Create a DB, say ‘lendingclub’ in SQL Server

CREATE DATABASE [lendingclub]

Download the loan data into a folder (say C:\lendingclub)

There are 6 csv files: LoanStats3a.csv, LoanStats3b.csv, LoanStats3c.csv, LoanStats3d.csv, LoanStats_2016Q1.csv, LoanStats_2016Q2.csv. Use the following clean-up steps for the csv files :
1. Remove this line at the beginning of file : “Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)”
2. Remove these lines at the end of file : “Total amount funded in policy code 1: Total amount funded in policy code 2: ”
3. Remove the line “Loans that do not meet the credit policy” in LoanStats3a.csv at line number 39788, 39789, 39790.

Now, let us import the data from these 6 csv files to a single table [dbo].[LoanStats] in lendingclub DB. There are many ways to import csv data into SQL (using bcp, powershell etc) but we are going to use something quick and easy – sp_execute_external_script !

Executing the above stored procedure creates a table [dbo].[LoanStats] with loan data :

EXEC [dbo].[LoadData]

Once the data is loaded, let us do some cleaning using SQL and also identify good/bad loans based on the value of the column [loan_status] :

 

FEATURE SELECTION

 

Before we run feature-selection, let’s install the required R packages by following this guide :

Right Click C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64\Rgui.exe and Run as Administrator. Run the following R code to install the packages:

install.packages(c("reshape2","ggplot2","ROCR","plyr","Rcpp","stringr","stringi","magrittr","digest","gtable",
"proto","scales","munsell","colorspace","labeling","gplots","gtools","gdata","caTools","bitops"), 
lib = "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library")

There are many ways to perform feature selection/variable selection. A quick and easy way is to use ggplot to plot the distribution of is_bad variable for each numeric variable. From this plot, we can identify variables which have significant differences in the is_bad=0(good) and is_bad=1(bad) population.
Stored procedure to plot distribution.

Executing the above stored procedure creates a JPEG image in C:\temp\plots

EXEC [dbo].[PlotDistribution]

ggplot_2b9479aa74b7

 

Good predictors have significant differences in the distribution lines factor(is_bad)=0 (Red) and factor(is_bad)=1 (Blue), for example : int_rate. ID variables like id, member_id will also be plotted in the above image (since they are numeric as well), but they cannot be used as predictor variables – hence ignore such variables. Let us choose some predictors variables by quickly analyzing the above image and use them in our Random Forest Model. (NOTE: This is not the final/best list of predictors, you may choose other methods to find predictors as well)

revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint, total_rec_prncp, all_util

 

DATA PARTITION

 

We shall use SQL to split the Loan Data into Training Data 75% ([dbo].[LoanStatsTrain]) and Test Data 25% ([dbo].[LoanStatsTest]). Training data will be used to build the model and the built model shall be applied on test data for evaluation.
Stored procedure to split the data

Executing the above stored procedure creates 2 tables [dbo].[LoanStatsTrain] and [dbo].[LoanStatsTest]

EXEC [dbo].[SplitLoans]

 

MODELLING

 

Build a Random Forest Model using the variables we identified in Feature Selection and store the model in a table [dbo].[models] to be used later for scoring.

Executing the above stored procedure builds a random forest model and stores it in [dbo].[models]

EXEC [dbo].[BuildModel]
SELECT * FROM [dbo].[models]

model

 

SCORING

 

Stored Procedure to use the model in [dbo].[models] and score test data [dbo].[LoanStatsTest]

Executing the above stored procedure creates a table [dbo].[LoanStatsPredictions] which contains the predictions in is_bad_Pred column

EXEC [dbo].[ScoreLoans]
SELECT TOP 10 * FROM [dbo].[LoanStatsPredictions]

predictions

 

MODEL EVALUATION

 

Now let us plot a ROC curve to evaluate the model. We will also calculate the Area under ROC Curve

Executing the above stored procedure will save the ROC curve in C:\temp\plots as well as output the Area Under The Curve measure (AUC)

EXEC [dbo].[PlotROCCurve]

arocrocCurve_4ef8510c7109

 

NOTE : We can improve scoring performance (number of predictions per second) using In-Memory Tables, Columnstore Indexes and Resource Governor Configurations. These steps are detailed with github code in this blog post : A walkthrough of Loan Classification using SQL Server 2016 R Services

REFERENCES

 

Lending Club Statistics

Machine Learning for Predicting Bad Loans

Variable Importance Plot and Variable Selection

Machine Learning Templates with SQL Server 2016 R Services

SQL Server R Services Tutorials

Provision the Microsoft Data Science Virtual Machine

sp_execute_external_script (Transact-SQL)

Explore and Visualize the Data (In-Database Advanced Analytics Tutorial)

Selecting Rows Randomly from a Large Table

Receiver operating characteristic

Area under the curve

The Area Under an ROC Curve

 

Comments (22)

  1. Harsh says:

    EXEC [dbo].[LoadData] generates the following error:

    Msg 8114, Level 16, State 1, Line 0
    Error converting data type nvarchar(max) to int.

    1. Hi Harsh,
      In the csv files, please do the following and then try executing [dbo].[LoadData] :
      1. Remove this line at the beginning of file : “Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)”
      2. Remove these lines at the end of file : “Total amount funded in policy code 1: 1443412975
      Total amount funded in policy code 2: 511988838”

      1. Harsh says:

        Thanks Ramkumar, already did that, still getting the same error.

        1. LoanStats3a.csv has 3 unwanted lines at line number 39788, 39789, 39790 – “Loans that do not meet the credit policy”. Remove these 3 lines and try again.

          1. Harsh says:

            Thanks Ramkumar. With that fix i was able to load the data successfully in the DB. However, now i am getting the following error when running the Feature Selection code:

            Msg 39004, Level 16, State 19, Line 34
            A ‘R’ script error occurred during execution of ‘sp_execute_external_script’ with HRESULT 0x80004004.
            Msg 39019, Level 16, State 1, Line 34
            An external script error occurred:
            Error in eval(expr, envir, enclos) : bad allocation
            Calls: source -> withVisible -> eval -> eval -> .Call
            Execution halted

          2. This might be a memory issue .. look at the memory usage while running feature selection.
            Also you may try running the feature selection with less rows (Say 1000) – modify the @input_data_1 = N’SELECT TOP 1000 * FROM [dbo].[LoanStats]’

          3. Thanks Ramkumar. I was able to walk through the demo with your help. I would like to call out one thing though. The list of packages to be installed in the article is not complete. The complete list should be:

            install.packages(c(“reshape2″,”ggplot2″,”ROCR”,”plyr”,”Rcpp”,”stringr”,”stringi”,”magrittr”,”digest”,”gtable”,”proto”,”scales”,”munsell”,”colorspace”,”labeling”,”gplots”,”gtools”,”gdata”,”caTools”,”bitops”), lib = “C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library”)

  2. I hope you will update the article to include all these correctional steps that you mentioned in the comments. Thanks.

  3. Glad that you were able to run the demo Harshdeep ! I have updated the article to include the csv clean-up steps as well as the updated package list.

    1. Akash Singhal says:

      Hi Harsh,

      I am trying a scenario where I need to run R script on 1 M data. I am trying it with your method, I am using rxDTree instead of rxDForest because it is taking too much time to build model on 100000 rows also both algorithms are giving error when I try to execute it on 1M data. The error is-

      A ‘R’ script error occurred during execution of ‘sp_execute_external_script’ with HRESULT 0x80004004.

      I read that comment where you said it could be memory issues but in my case memory went upto 39 GB max before giving this error.

      I have 52 GB RAM on my machine. I know that RevoScaleR can handle very large data-set in SQL Server. I am very excited to see the performance of it.

      I really want to go in more detail about this if you can help me, really appreciate it.

      Thanks,
      Akash

      1. By default, the R runtime processes associated with R Services (In-database) are limited to using no more than 20% of total machine memory. You will need to define an external resource pool and allocate more memory for R :
        https://msdn.microsoft.com/en-us/library/mt703708.aspx
        https://msdn.microsoft.com/en-us/library/mt703706.aspx
        https://msdn.microsoft.com/en-us/library/mt694065.aspx

        1. Akash Singhal says:

          @Ramkumar
          I tried to increase the memory used by R scripts from 20 to 80 but it still not runnign on 1 M rows. I tried with external resource pool but did not help much.
          Is there anything you can suggest ?
          Thanks

  4. Amir says:

    Hi guys,

    Really good article and work.
    I am curious to know how did you transfer the classification model output (categorical output: is_bad, is_not_bad) to the scoring (continuous values I mean)?

    1. Thanks Amir.
      rxPredict() has a parameter called “type”, the default value for this parameter is “response” which means rxPredict() returns a vector of predicted values for a regression forest and predicted classes (with majority vote) for a classification forest. In this article, I have built a regression forest using rxDForest() by using a numeric column called is_bad in the formula. Hence rxPredict() returns a vector of predicted values.

      Now if I generate the same is_bad column with factors YES,NO – I would get a classification forest and thus rxPredict would output predicted classes(categories).

      ALTER TABLE [dbo].[LoanStats] ADD [is_bad] varchar(5)
      UPDATE [dbo].[LoanStats]
      SET [is_bad] = (CASE WHEN loan_status IN (‘Late (16-30 days)’, ‘Late (31-120 days)’, ‘Default’, ‘Charged Off’) THEN ‘YES’ ELSE ‘NO’ END)

      So, the output of rxPredict depends on the forest that you have built(regression/classification) which depends on the column type that you have used in the formula.
      Hope this answers your question.

  5. Giacomo says:

    Hi, I’m trying to perform the query referred to the Modelling step:

    EXEC sp_execute_external_script
    @language = N’R’,
    @script = N’
    randomForestObj <- rxDForest(is_bad ~ revol_util + int_rate + mths_since_last_record + annual_inc_joint + dti_joint + total_rec_prncp + all_util, InputDataSet)
    model withVisible -> eval -> eval -> .Call

    Could anyone help me to solve this issue? What does actually this error message mean? Thanks

    1. Giacomo says:

      I tried to execute it twice, but after more or less 24 mins of processing, I received the following error message:

      Msg 39004, Level 16, State 20, Line 1
      A ‘R’ script error occurred during execution of ‘sp_execute_external_script’ with HRESULT 0x80004005.
      STDERR message(s) from external script:
      Error in eval(expr, envir, enclos) : no message recieved from pipe
      Calls: source -> withVisible -> eval -> eval -> .Call

  6. Giacomo says:

    Sorry for the previous comment, maybe I had error with cut&paste.

    This is my query:
    EXEC sp_execute_external_script
    @language = N’R’,
    @script = N’
    randomForestObj <- rxDForest(is_bad ~ revol_util + int_rate + mths_since_last_record + annual_inc_joint + dti_joint + total_rec_prncp + all_util, InputDataSet)
    model withVisible -> eval -> eval -> .Call

    1. Hi Giacomo, I believe you are hitting memory issues since it ran for 24 minutes .. can you try to build the model with less number of rows and see if it works. Just select top 1000 rows from training data : DECLARE @inquery nvarchar(max) = N’SELECT top 1000 * FROM [dbo].[LoanStatsTrain]’

      1. Giacomo says:

        Hi Ramkumar,
        thanks for your reply, it has been very helpful.

        I think I’ve a similiar problem with Microsoft R Client, but I’m not sure.
        I’m trying to execute the R code of the modelling step in Microsoft R Client.
        My machine has 32gb of RAM, that has been sufficient for the computation with SQL Server 2016 R Services.
        By processing with Microsoft R Client I receive the error: “Dimensions are too big: 451720, 35186”.

        Is there a difference, in terms of memory usage, between the computation with SQL Server 2016 R Services and Microsoft R Client? Maybe I’m missing something? Thanks

        1. Hi Giacomo,

          As per our documentation at https://msdn.microsoft.com/en-us/microsoft-r/r-client-get-started
          R Client allows you to work with production data locally using the full set of ScaleR functions, but there are some constraints. On its own, the data to be processed must fit in local memory, and processing is capped at two threads for RevoScaleR functions.

          – Prabhat

Skip to main content