*This post is authored by Surendra Tipparaju and Durga Prasad Chappidi at Microsoft*

Credit risk prediction is one of most common models and yet most revisited. The risk assessment is determined based on dataset and number of features that can be included in the model. We have implemented the initial model few months back in open R with good results, in our recent exploration with Microsoft R we revisited the model using ScaleR functions and rxXXX functions with bigger data set and faster results. This goal of this blog is to give you both the implementation and reinforce the message on how simple and effective to develop in Microsoft R with very little effort. In the subsequent sections we presented the code samples with Open R & Microsoft R in that order to see the differences. At the end you can see that model evaluation results for both the implementations

**Credit Risk (Open source R) - **** ***Load required libraries*** : **library(e1071), library(ROCR), library(car)

**Credit Risk (Microsoft R)- ***Load required libraries*** : **library(dplyrXdf), library(RevoScaleR)

Next Step is loading the data and performing descriptive analysis of data

**Credit Risk (Open source R) **

*str(creditData)*** : **To obtain the structure of an R object. From this, we can easily identify all the variable types easily with values ranges.

*summary(creditData)*** : **This gives us the brief summary of the data. This includes the mean, median, Quantile values and NA’s count details for easy understanding of the data.

*boxplot(creditData)*** : **Using the box plot, we can able to identify the outliers easily with the mean, median deviation and outliers in the data with graphical representation.

boxplot(creditData ,las = 2, ylab = "Value", main = "Box Plot for credit Risk Data")

*Correlation* **:** Correlation gives us the mutual relation among various variables in the dataset.

cor(creditData, method = "pearson")

From the above processes we observed huge deviation among the variables, we have to normalize the data to avoid the variable dominations

*Normalizing Data*** : ** By normalizing the data , we can guarantee the dataset is following the normal distribution. The high value variables also reduces to the range of [-1,1] or [0,1] based on the normalization (or) standardization technique used.

normalize <- function(x) {

return((x - min(x)) / (max(x) - min(x)))

}

creditData <- as.data.frame(sapply(creditData,normalize))

After normalization, the box plot is impressive and all the variables ranges from [0,1].

**Credit Risk (Microsoft R)**

*Generating XDF : *

creditData_xdf <- rxImport(inData = fileInPath, outFile = "creditData.xdf",stringsAsFactors = TRUE, rowsPerRead = 10000, overwrite = TRUE)

*Descriptive Analysis : *

rxGetInfo(data = creditData_xdf,getVarInfo = TRUE)

This gives us the details of data types of variables inside the given dataset.

*rxsummary(creditData) :* This gives us the brief summary of the data. This includes the mean, median, Quantile values and NA’s count details for easy understanding of the data.

rxSummary(formula = form , creditData_xdf)

In Microsoft R there are no explicit boxplot function, one can you open R functions for box plot by converting XDF data into data frame and calling open R box plot functions

*Correlation*** :** Correlation gives us the mutual relation among various variables in the dataset.

rxCor(formula = paste("~", paste(allvars, collapse = "+")) , creditData_xdf)

Normalizing Data in Microsoft R here we need to normalize for each feature since we did not have functions like 'sapply' to use

normalize <- function(data) {

data$Account_Balance <- (data$Account_Balance - min(data$Account_Balance))/(max(data$Account_Balance) - min

(data$Account_Balance))

data$Duration_of_Credit_month <- (data$Duration_of_Credit_month - min(data$Duration_of_Credit_month))/(max(data$Duration_of_Credit_month) - min(data$Duration_of_Credit_month))

data$Payment_Status_of_Previous_Credit <- (data$Payment_Status_of_Previous_Credit - min(data$Payment_Status_of_Previous_Credit))/(max(data$Payment_Status_of_Previous_Credit) - min(data$Payment_Status_of_Previous_Credit))

**Model Identification**

In this business scenario, we need to identify the Credit risk to classify the class (Credibility) of the loan to a given customer based on his spending patterns and debts.

Based on this, we can conclude that, the regression models are best fit for this kind of problem.

**Credit Risk (Open source R)**

*Building Logistic Regression Model : *

model_glm <- glm(formula = form1, data = creditData[creditData_train,], family = binomial(link = logit))

summary(model_glm)

**Credit Risk (Microsoft R)**

model_glm_xdf <- rxGlm(formula = form , data = data_XDF_train,family = binomial(link = logit))

**Validate the Model **

We evaluated the model using ROC charts with GLM and linear regression functions. The results from both implementation are below. In fact accuracy from Microsoft R are higher than Open R

**Credit Risk (Open source R)**

**Credit Risk (Microsoft R)**

Looking at the results with same data set with open source R and Microsoft R, it was evident that both implementation match in terms of performance and accuracy. The parallelization functions to handle XDF data load are very similar to open source R and simple to use. We have extended our test to more data sets to observe the behavior of performance improvements in using rxXXX functions in Microsoft R and here are the results:

If you observe the time take for model execution with small dataset (highlighted in yellow), Microsoft R takes slightly more time. This slowness is because MRS is optimized for large data sets to address the Open R limitation and hence performs better on large data sets vs smaller. Once we loaded large dataset with 294K records for a credit risk data, impressive performance improvements (highlighted in green) were observed for Microsoft R model to the scale of 1.5x to 5x in comparison to open source R model.

We are using above observations to drive the value proposition of Microsoft R with our customers and hope these results encourage you to take advantages of Microsoft R in solving your big data problems

Feel free to download the complete source code for both models discussed above here: Source Code