Loan Classification using SQL Server 2016 R Services


R Services (In-database) provides a platform for developing and deploying intelligent applications that uncover new insights. You can use the rich and powerful R language and the many packages from the community to create models and generate predictions using your SQL Server data. Because R Services (In-database) integrates the R language with SQL Server, you can keep analytics close to the data and eliminate the costs and security risks associated with data movement.

In this post, we will see how we can leverage SQL Server 2016 as a Scoring Engine to predict "bad" loans. Loans that indicate good repayment behavior are considered "good" and loans that indicate less than perfect repayment behavior are considered "bad".

The Azure Data Science VM comes pre-installed with SQL Server 2016 Developer edition and can be used readily for our scoring experiment. We will use the publicly available dataset of Lending Club loan performance.

This experiment consists of 6 steps. We shall define stored procedures in SQL for each of these steps and execute them in order.

 

Drawing3

 

DATA PREPARATION

 

Create a DB, say ‘lendingclub’ in SQL Server

CREATE DATABASE [lendingclub]

Download the loan data into a folder (say C:\lendingclub)

There are 6 csv files: LoanStats3a.csv, LoanStats3b.csv, LoanStats3c.csv, LoanStats3d.csv, LoanStats_2016Q1.csv, LoanStats_2016Q2.csv. Use the following clean-up steps for the csv files :
1. Remove this line at the beginning of file : “Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)”
2. Remove these lines at the end of file : “Total amount funded in policy code 1: Total amount funded in policy code 2: ”
3. Remove the line “Loans that do not meet the credit policy” in LoanStats3a.csv at line number 39788, 39789, 39790.

Now, let us import the data from these 6 csv files to a single table [dbo].[LoanStats] in lendingclub DB. There are many ways to import csv data into SQL (using bcp, powershell etc) but we are going to use something quick and easy – sp_execute_external_script !

Executing the above stored procedure creates a table [dbo].[LoanStats] with loan data :

EXEC [dbo].[LoadData]

Once the data is loaded, let us do some cleaning using SQL and also identify good/bad loans based on the value of the column [loan_status] :

 

FEATURE SELECTION

 

Before we run feature-selection, let’s install the required R packages by following this guide :

Right Click C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64\Rgui.exe and Run as Administrator. Run the following R code to install the packages:

install.packages(c("reshape2","ggplot2","ROCR","plyr","Rcpp","stringr","stringi","magrittr","digest","gtable",
"proto","scales","munsell","colorspace","labeling","gplots","gtools","gdata","caTools","bitops"), 
lib = "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library")

There are many ways to perform feature selection/variable selection. A quick and easy way is to use ggplot to plot the distribution of is_bad variable for each numeric variable. From this plot, we can identify variables which have significant differences in the is_bad=0(good) and is_bad=1(bad) population.
Stored procedure to plot distribution.

Executing the above stored procedure creates a JPEG image in C:\temp\plots

EXEC [dbo].[PlotDistribution]

ggplot_2b9479aa74b7

 

Good predictors have significant differences in the distribution lines factor(is_bad)=0 (Red) and factor(is_bad)=1 (Blue), for example : int_rate. ID variables like id, member_id will also be plotted in the above image (since they are numeric as well), but they cannot be used as predictor variables - hence ignore such variables. Let us choose some predictors variables by quickly analyzing the above image and use them in our Random Forest Model. (NOTE: This is not the final/best list of predictors, you may choose other methods to find predictors as well)

revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint, total_rec_prncp, all_util

 

DATA PARTITION

 

We shall use SQL to split the Loan Data into Training Data 75% ([dbo].[LoanStatsTrain]) and Test Data 25% ([dbo].[LoanStatsTest]). Training data will be used to build the model and the built model shall be applied on test data for evaluation.
Stored procedure to split the data

Executing the above stored procedure creates 2 tables [dbo].[LoanStatsTrain] and [dbo].[LoanStatsTest]

EXEC [dbo].[SplitLoans]

 

MODELLING

 

Build a Random Forest Model using the variables we identified in Feature Selection and store the model in a table [dbo].[models] to be used later for scoring.

Executing the above stored procedure builds a random forest model and stores it in [dbo].[models]

EXEC [dbo].[BuildModel]
SELECT * FROM [dbo].[models]

model

 

SCORING

 

Stored Procedure to use the model in [dbo].[models] and score test data [dbo].[LoanStatsTest]

Executing the above stored procedure creates a table [dbo].[LoanStatsPredictions] which contains the predictions in is_bad_Pred column

EXEC [dbo].[ScoreLoans]
SELECT TOP 10 * FROM [dbo].[LoanStatsPredictions]

predictions

 

MODEL EVALUATION

 

Now let us plot a ROC curve to evaluate the model. We will also calculate the Area under ROC Curve

Executing the above stored procedure will save the ROC curve in C:\temp\plots as well as output the Area Under The Curve measure (AUC)

EXEC [dbo].[PlotROCCurve]

arocrocCurve_4ef8510c7109

 

NOTE : We can improve scoring performance (number of predictions per second) using In-Memory Tables, Columnstore Indexes and Resource Governor Configurations. These steps are detailed with github code in this blog post : A walkthrough of Loan Classification using SQL Server 2016 R Services

REFERENCES

 

Lending Club Statistics

Machine Learning for Predicting Bad Loans

Variable Importance Plot and Variable Selection

Machine Learning Templates with SQL Server 2016 R Services

SQL Server R Services Tutorials

Provision the Microsoft Data Science Virtual Machine

sp_execute_external_script (Transact-SQL)

Explore and Visualize the Data (In-Database Advanced Analytics Tutorial)

Selecting Rows Randomly from a Large Table

Receiver operating characteristic

Area under the curve

The Area Under an ROC Curve

 

Comments (32)

  1. Harsh says:

    EXEC [dbo].[LoadData] generates the following error:

    Msg 8114, Level 16, State 1, Line 0
    Error converting data type nvarchar(max) to int.

    1. Hi Harsh,
      In the csv files, please do the following and then try executing [dbo].[LoadData] :
      1. Remove this line at the beginning of file : "Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)"
      2. Remove these lines at the end of file : "Total amount funded in policy code 1: 1443412975
      Total amount funded in policy code 2: 511988838"

      1. Harsh says:

        Thanks Ramkumar, already did that, still getting the same error.

        1. LoanStats3a.csv has 3 unwanted lines at line number 39788, 39789, 39790 - "Loans that do not meet the credit policy". Remove these 3 lines and try again.

          1. Harsh says:

            Thanks Ramkumar. With that fix i was able to load the data successfully in the DB. However, now i am getting the following error when running the Feature Selection code:

            Msg 39004, Level 16, State 19, Line 34
            A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
            Msg 39019, Level 16, State 1, Line 34
            An external script error occurred:
            Error in eval(expr, envir, enclos) : bad allocation
            Calls: source -> withVisible -> eval -> eval -> .Call
            Execution halted

          2. This might be a memory issue .. look at the memory usage while running feature selection.
            Also you may try running the feature selection with less rows (Say 1000) – modify the @input_data_1 = N’SELECT TOP 1000 * FROM [dbo].[LoanStats]’

          3. Thanks Ramkumar. I was able to walk through the demo with your help. I would like to call out one thing though. The list of packages to be installed in the article is not complete. The complete list should be:

            install.packages(c("reshape2","ggplot2","ROCR","plyr","Rcpp","stringr","stringi","magrittr","digest","gtable","proto","scales","munsell","colorspace","labeling","gplots","gtools","gdata","caTools","bitops"), lib = "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library")

  2. I hope you will update the article to include all these correctional steps that you mentioned in the comments. Thanks.

  3. Glad that you were able to run the demo Harshdeep ! I have updated the article to include the csv clean-up steps as well as the updated package list.

    1. Akash Singhal says:

      Hi Harsh,

      I am trying a scenario where I need to run R script on 1 M data. I am trying it with your method, I am using rxDTree instead of rxDForest because it is taking too much time to build model on 100000 rows also both algorithms are giving error when I try to execute it on 1M data. The error is-

      A ‘R’ script error occurred during execution of ‘sp_execute_external_script’ with HRESULT 0x80004004.

      I read that comment where you said it could be memory issues but in my case memory went upto 39 GB max before giving this error.

      I have 52 GB RAM on my machine. I know that RevoScaleR can handle very large data-set in SQL Server. I am very excited to see the performance of it.

      I really want to go in more detail about this if you can help me, really appreciate it.

      Thanks,
      Akash

      1. By default, the R runtime processes associated with R Services (In-database) are limited to using no more than 20% of total machine memory. You will need to define an external resource pool and allocate more memory for R :
        https://msdn.microsoft.com/en-us/library/mt703708.aspx
        https://msdn.microsoft.com/en-us/library/mt703706.aspx
        https://msdn.microsoft.com/en-us/library/mt694065.aspx

        1. Akash Singhal says:

          @Ramkumar
          I tried to increase the memory used by R scripts from 20 to 80 but it still not runnign on 1 M rows. I tried with external resource pool but did not help much.
          Is there anything you can suggest ?
          Thanks

  4. Amir says:

    Hi guys,

    Really good article and work.
    I am curious to know how did you transfer the classification model output (categorical output: is_bad, is_not_bad) to the scoring (continuous values I mean)?

    1. Thanks Amir.
      rxPredict() has a parameter called "type", the default value for this parameter is "response" which means rxPredict() returns a vector of predicted values for a regression forest and predicted classes (with majority vote) for a classification forest. In this article, I have built a regression forest using rxDForest() by using a numeric column called is_bad in the formula. Hence rxPredict() returns a vector of predicted values.

      Now if I generate the same is_bad column with factors YES,NO - I would get a classification forest and thus rxPredict would output predicted classes(categories).

      ALTER TABLE [dbo].[LoanStats] ADD [is_bad] varchar(5)
      UPDATE [dbo].[LoanStats]
      SET [is_bad] = (CASE WHEN loan_status IN ('Late (16-30 days)', 'Late (31-120 days)', 'Default', 'Charged Off') THEN 'YES' ELSE 'NO' END)

      So, the output of rxPredict depends on the forest that you have built(regression/classification) which depends on the column type that you have used in the formula.
      Hope this answers your question.

  5. Giacomo says:

    Hi, I'm trying to perform the query referred to the Modelling step:

    EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
    randomForestObj <- rxDForest(is_bad ~ revol_util + int_rate + mths_since_last_record + annual_inc_joint + dti_joint + total_rec_prncp + all_util, InputDataSet)
    model withVisible -> eval -> eval -> .Call

    Could anyone help me to solve this issue? What does actually this error message mean? Thanks

    1. Giacomo says:

      I tried to execute it twice, but after more or less 24 mins of processing, I received the following error message:

      Msg 39004, Level 16, State 20, Line 1
      A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004005.
      STDERR message(s) from external script:
      Error in eval(expr, envir, enclos) : no message recieved from pipe
      Calls: source -> withVisible -> eval -> eval -> .Call

  6. Giacomo says:

    Sorry for the previous comment, maybe I had error with cut&paste.

    This is my query:
    EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
    randomForestObj <- rxDForest(is_bad ~ revol_util + int_rate + mths_since_last_record + annual_inc_joint + dti_joint + total_rec_prncp + all_util, InputDataSet)
    model withVisible -> eval -> eval -> .Call

    1. Hi Giacomo, I believe you are hitting memory issues since it ran for 24 minutes .. can you try to build the model with less number of rows and see if it works. Just select top 1000 rows from training data : DECLARE @inquery nvarchar(max) = N'SELECT top 1000 * FROM [dbo].[LoanStatsTrain]'

      1. Giacomo says:

        Hi Ramkumar,
        thanks for your reply, it has been very helpful.

        I think I've a similiar problem with Microsoft R Client, but I'm not sure.
        I'm trying to execute the R code of the modelling step in Microsoft R Client.
        My machine has 32gb of RAM, that has been sufficient for the computation with SQL Server 2016 R Services.
        By processing with Microsoft R Client I receive the error: "Dimensions are too big: 451720, 35186".

        Is there a difference, in terms of memory usage, between the computation with SQL Server 2016 R Services and Microsoft R Client? Maybe I'm missing something? Thanks

        1. Hi Giacomo,

          As per our documentation at https://msdn.microsoft.com/en-us/microsoft-r/r-client-get-started
          R Client allows you to work with production data locally using the full set of ScaleR functions, but there are some constraints. On its own, the data to be processed must fit in local memory, and processing is capped at two threads for RevoScaleR functions.

          - Prabhat

  7. JacqueC says:

    Hello - Thank you for such a great tutorial; for what has worked thus far - I have really enjoyed it. I appreciate that it uses both a large data set and that the data is relatively current. Everything worked for me until the Scoring step and then I received the following error:

    "Warning: In DataSet::DeleteColsInPlace, attempting to select a column that does not exist: .rxis_bad_Pred.
    Warning: In DataSet::DeleteColsInPlace, attempting to select a column that does not exist: is_bad_Sum."

    Spent some time searching the Net for possible solutions and found nothing really helpful.

    What do you think is my issue?

    Thank you,
    JacqueC

    1. Hi jacqueC, in the scoring step , make sure the column names of the prediction results match with the column names of LoanStatsPredictions. Here is the code to print the column names of prediction results, can you execute the following code and give the output :

      EXEC sp_execute_external_script
      @language = N'R',
      @script = N'
      rfModel <- unserialize(as.raw(model));
      OutputDataSet<-rxPredict(rfModel, data = InputDataSet, extraVarsToWrite = c("id"))
      rxGetVarInfo(OutputDataSet)
      ',
      @input_data_1 = @inquery,
      @params = N'@model varbinary(max)',
      @model = @model

      1. JacqueC says:

        Hello Ramkumar,

        Thank you for your response! I started the “loan classification” tutorial from the beginning (by the way, I am using an Azure Virtual Machine, Standard DS4 v2 (8 cores, 28 GB memory)). The first error I experienced was with a memory issue when attempting to create the ggplot distributions. So I re-ran the selection with the TOP 1000 rows. This addressed that issue.

        Unfortunately, when I ran the code you posted in your replay – I received the exact same error:

        ---
        STDOUT message(s) from external script:
        Rows Read: 0, Total Rows Processed: 0

        Warning: In DataSet::DeleteColsInPlace, attempting to select a column that does not exist: .rxis_bad_Pred.
        Warning: In DataSet::DeleteColsInPlace, attempting to select a column that does not exist: is_bad_Sum.

        , Total Chunk Time: 0.080 seconds
        (0 row(s) affected)
        ---

        Thus I have no output to share with you at this time. I’ll keep working on this though – I really like what you have shared with us here and I truly appreciate that you respond to the posted comments/questions. In many ways the comments/replies are as informative as the details shared in the tutorial.

        ~JacqueC

  8. Ravi says:

    Hi Ram,
    Thank you for this great article.
    When I am trying to execute PlotDistribution stored proc
    I am getting following error, Kindly help

    Msg 39004, Level 16, State 20, Line 69
    A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
    Msg 39019, Level 16, State 1, Line 69
    An external script error occurred:
    Error: cannot allocate vector of size 1024.0 Mb

    Error in ScaleR. Check the output for more information.
    Error in eval(expr, envir, enclos) :
    Error in ScaleR. Check the output for more information.
    Calls: source -> withVisible -> eval -> eval -> .Call
    Execution halted
    STDOUT message(s) from external script:
    [1] Creating output plot files:
    [1] C:\\temp\\plots\\ggplot_599c2a245b7e.jpg

    1. Hi Ravi,
      Use the following guide : https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/server-memory-server-configuration-options
      to provide more memory for R in SQL Server.

      Other thing you can try is to reduce the height, width, resolution of image by changing the parameters in jpeg function call :
      jpeg(filename=dest_filename, height=3900, width = 6400, res=300);

      1. Ravi says:

        Thanks for your reply. The server has 54 GB memory and I have allocated 40GB memory for SQL Server. Unfortunately, the same error "An external script error occurred: Error: cannot allocate vector of size 1024.0 Mb" is reappearing. As per suggestion, I reduced image resolution. Greatly appreciate any inputs for resolution.

      2. Ravi says:

        Bingo. I got it working. I followed to create resource governor for R using the following URL
        https://docs.microsoft.com/en-us/sql/advanced-analytics/r/how-to-create-a-resource-pool-for-r
        Thank you

  9. New Data from lending club has been updated with 11 new columns since the blog post was published. This new data can be imported by adding the following columns to the table definition.

    [revol_bal_joint] [float] NULL,
    [sec_app_earliest_cr_line] [nvarchar](max) NULL,
    [sec_app_inq_last_6mths] [int] NULL,
    [sec_app_mort_acc] [int] NULL,
    [sec_app_open_acc] [int] NULL,
    [sec_app_revol_util] [float] NULL,
    [sec_app_open_il_6m] [int] NULL,
    [sec_app_num_rev_accts] [int] NULL,
    [sec_app_chargeoff_within_12_mths] [int] NULL,
    [sec_app_collections_12_mths_ex_med] [int] NULL,
    [sec_app_mths_since_last_major_derog] [int] NULL

  10. Javier Recasens says:

    Ramkumar, Thank you so much for your example. I did something similar, I was able to create an .XDF file from all the .CSV and now Im trying to use "rxDataStep" to store it in SQL Server. Im getting a "[Microsoft][ODBC SQL Server Driver]COUNT field incorrect or syntax error" I wonder if something similar has happened to you? Regards

Skip to main content