Goal:
Understand and experience logistic regression, decision trees to predict the probability of loan default and minimize loss for the company.

Exploratory Data Analysis

1
2
{r setup, include=TRUE}
knitr::opts_chunk$set(echo = TRUE)
1
2
3
4
5
6
install.packages("tidyverse")
library(dplyr)
library(readr)
loanData <- readRDS("/Users/mengshuzhang/Desktop/Durham/Spring1/fraud/assignment 3/LoanData.rds")
str(loanData)
summary(loanData)
1
2
3
install.packages("gmodels")
library(gmodels)
CrossTable(loanData$creditGrade, loanData$isLoanDefault)
The letter A is the best (highest) credit rating because it has the lowest default rate (0.059) and the default rate increases as the credit rating for each subsequent letter grade.

Modeling

Logistic Regression

Run a logistic regression on loanTrain data and store the result in glmBase.

1
2
glmBase <- glm(isLoanDefault~., data=loanTrain, family="binomial")
summary(glmBase)
The Most significant features: creditGrade, incomeAnnual, naEmploymentYears Mildly significant features: interestRate, employmentYears Least significant features: loanAmount, homeLiving, ageYears, naInterestRate
1
2
3
4
5
6
7
predictionsBase = predict(glmBase, newdata = loanTest, type = "response")
loanTest= cbind(loanTest, predictionsBase)
# predictionsBase
summary(loanTest[, "predictionsBase"])
# predictionsBase - isLoanDefault
summary(predictionsBase-loanTest$isLoanDefault)
rbind(predictionsBase = summary(loanTest[, "predictionsBase"]), isLoanDefault = summary(loanTest[, "isLoanDefault"]), difference = summary(predictionsBase-loanTest$isLoanDefault))
A mean close to zero and a small median are a good early sign. However, the min of -0.997 indicates that the model failed to predict a bad loan by assigning a 0.003236 probability to a default (0.003 = 1 - 0.997). The max of 0.466 was also not so good.

Decision Tree

To make the dataset balanced, we trick R by giving it a balanced dataset where 1/3 of the cases are default and 2/3 are are not default.

1
2
3
tree_defaults_balanced_parms_lossmatrix = rpart(isLoanDefault ~ ., method="class", data = loanTrain_balanced,control = rpart.control(cp = 0.001),parms = list(loss = matrix(c(0, 2, 1, 0), ncol = 2)))
plot(tree_defaults_balanced_parms_lossmatrix,uniform = TRUE)
text(tree_defaults_balanced_parms_lossmatrix,cex = 0.25)

Evaluation

I based my cutoff on the fact that 0.1121 of test cases are bad loans. Specifically, I sorted the predictionsBase vector in non-decreasing order (using sort() function with decreasing = TRUE), and then I set the value of cutoff equal to value of the element number as.integer(0.1121 * length(sortedPredictionBase)) of the sorted list.

1
2
3
4
5
6
sortedPredictionBase = sort(predictionsBase, decreasing = TRUE)
#Compute cutoff
cutoff = sortedPredictionBase[as.integer(0.1121*length(sortedPredictionBase))]
isPrediction = as.integer(cutoff < predictionsBase)
loanTest= cbind(isPrediction, loanTest)
summary(loanTest)

Confusion Matrix

1
2
3
4
5
tneg = sum( (0 == loanTest$isPrediction) & (0 == loanTest$isLoanDefault))
fneg = sum( (0 == loanTest$isPrediction) & (1 == loanTest$isLoanDefault))
fpos = sum( (1 == loanTest$isPrediction) & (0 == loanTest$isLoanDefault))
tpos = sum( (1 == loanTest$isPrediction) & (1 == loanTest$isLoanDefault))
confusionMatrix(data = as.factor(loanTest$isPrediction), reference = as.factor(loanTest$isLoanDefault))

Feature Engineering

Top 5 predictors of the bad loans

1
2
 glmTop5 = glm(isLoanDefault  ~ interestRate + creditGrade + employmentYears + incomeAnnual + naEmploymentYears, family = "binomial", data = loanTrain)
summary(glmTop5)