Skip to content

DrIanGregory/Kaggle-GiveMeSomeCredit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle DataSet - Give Me Some Credit

Description:

A brief look at the Give Me Some Credit dataset from Kaggle. Incorrect data was imputed and outliers removed. Subsequently a 2-fold cross validation was applied and various models were fitted such as Logistic Regression, Random Forest, XGBoost. Further an Ensemble and Stacking was tried.

Introduction

The 3 month long contest in 2011 from Kaggle called Give Me Some Credit (GMSC) involves predicting the probability that a person within 2 years did not repay an installment paying in 90 days or more beyond the due date. There are 11 bits of historical data with about 250,000 anonymous borrower information occupying 15MB and 5MB compressed hard drive space. The dataset is split into 150,000 examples (labeled training set - with 10,026 positive and 139,974 negative elements) and testing rows of 101,503. The following table summarises the numeric dataset with one predictor variable called SeriousDlqin2yrs and 10 explanatory variables (features).
Variable NameDescriptionData Type
SeriousDlqin2yrsPerson experienced 90 days past due delinquency or worse.Y/N
RevolvingUtilizationOfUnsecuredLinesTotal balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.percentage
AgeAge of borrower in years.integer
NumberOfTime30-59DaysPastDueNotWorseNumber of times borrower has been 30-59 days past due but no worse in the last 2 years.integer
DebtRatioMonthly debt payments, alimony,living costs divided by monthy gross income.percentage
MonthlyIncomeMonthly income.real
NumberOfOpenCreditLinesAndLoansNumber of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards).integer
NumberOfTimes90DaysLateNumber of times borrower has been 90 days or more past due.integer
NumberRealEstateLoansOrLinesNumber of mortgage and real estate loans including home equity lines of credit.integer
NumberOfTime60-89DaysPastDueNotWorseNumber of times borrower has been 60-89 days past due but no worse in the last 2 years.integer
NumberOfDependentsNumber of dependents in family excluding themselves (spouse, children etc.).integer

Since the class labels are provided for the sole response variable SeriousDlqin2yrs taking values 0 (NO default) or 1 (default). Supervised Learning classification models are a natural initial approach such as logistic regression as the base model and to improve with an ensemble of two-class boosted decision tree and stacking techniques involving Neural Networks.

Given the nature of the problem. The following analysis approach makes sense.

Exploratory Data Analysis

Summary descriptive statistics of the features.

SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
count 150,000.00 150,000.00 150,000.00 150,000.00 150,000.00 120,269.00 150,000.00 150,000.00 150,000.00 150,000.00 146,076.00
mean 0.07 6.05 52.30 0.42 353.01 6,670.22 8.45 0.27 1.02 0.24 0.76
std 0.25 249.76 14.77 4.19 2,037.82 14,384.67 5.15 4.17 1.13 4.16 1.12
min - - - - - - - - - - -
25% - 0.03 41.00 - 0.18 3,400.00 5.00 - - - -
50% - 0.15 52.00 - 0.37 5,400.00 8.00 - 1.00 - -
75% - 0.56 63.00 - 0.87 8,249.00 11.00 - 2.00 - 1.00
max 1.00 50,708.00 109.00 98.00 329,664.00 3,008,750.00 58.00 98.00 54.00 98.00 20.00

To examine feature reduction, the following showsn a correlation heatmap of the original data. Red implies values are correlated.

The variable SeriousDlqin2yrs is labelled either 0's and 1's with no other values. The class imbalance is 6.7% (ratio of 1:14) as shown be the following pie chart. As this majority class of 0's could dominate the prediction. Increasing the relavant proportion of the minority class of 1's was also tried using downsampling with upsampling of the dominant class.

For the age variable, it can be believed that the minimum age should start at 21 as there is only one value below this with an age of 0. This value can be adjusted to take the average value of the distribution. For the variables "NumberOfTime30-59DaysPastDueNotWorse", two years divided into 60 day increments suggests a maximum value of about 12-24 and this is what the data set exhibits except for a cluter of points at the values of 96 and 98. These are either strange mistakes or possibly values specific to the dataset with some other meaning such as 'unknown'. Further, the variables "NumberOfTimes90DaysLate" and "NumberOfTime60-89DaysPastDueNotWorse" also contained these 96 and 98 pillars of data which showed up on correlation between the three variables. To accomodate for this, the values were considered to be Winsorized to the previous largest value but the median value was chosen with the aim to have less impact over the sample distribution. After altering the values for 96 and 98. The significant inter-correlation variables NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-89DaysPastDueNotWorse drops away.

The feature MonthlyIncome has 29,731 missing values or 19.8% of 150,000 examples with the belly of the distribution around 5,000 and a long right-tail.

NumberOfDependents had 3924 missing values or 2.6% of the dataset. These values were set to zero. There is a jump from 6 to 20 dependents with 244 examples in that 99.9% percentile bin. All values were nominal whole numbers such that there were no fraction of a dependent.

Model Estimations

Split the training data into 2-fold Cross Validation with 75% random samples in training the remaining for validation/test. Logistic regression can be considered the base model to compare others to. Out-of-sample fit on unseen data gave an AUC of 0.8117. With the following ROC curve.


The important features according to the Logistic model are shown in the following figure.

The XGBoost estimation provided an improvement over Logistic regression, AdaBoost and standard tree. Generating an AUC of 0.86146 and resulting in the following AUC curve.

Summary table of models fitted with AUC for their out-of-sample performance on the 2-fold cross validation data.

ModelAUC - TrainingAUC - Validation
Logistic Regression0.81610.8161
Decision Tree0.610.601
Random Forest0.86280.862
Gradient Boosting0.86280.861
XGBoost0.8780.8615
Ensemble0.8316 
Stacking 0.6462

Conclusions

XGBoost seems to perform quite well out of the box as well as Random Forest. It could make sense to further investigate combining these model powers in a well tuned ensemble or stacking method and even further the impact of a deep learning algorithm. Given from 150,000 sample a population estimation of credit default is 6.7%. If the average loan size is US$1,000 per customer. Per 100,000 customers, assuming the amount if fully unrecoverable this can equate to a full potential loss of US$6.7mn. Being able to predict from the onset and existing customers default probabilities accurately clearly directly impact a loan firms bottom line. Even a 50% increase in default estimation efficiency from 6.7% to 3.35% results in a US$3.35mn loss. The implementation is more likely to be on reduction of loan amount size depending on probability of default. As not to turn customers away. Further, can generate revenues before a default occurs.
A model used in this anlaysis can be used in an online credit-line approval process as well as provided to product and client relationship managers to detect ongoing potential credit threats. Complex implementation such as those involving stacking with Deep Leariing models require time for hyperparameter tuning. Further, the production implmentation of these models do somewhat step away from an intuitive approach of regression models such as Logistic regression.

Future Analysis

The extreme examples may be the ones which default more and this extreme distribution may be worth estimating separately to the rest of the distributions. Traditionally, Ensemble techniques win Kaggle competitions. This includes ideas such as Stacking. These techniques require Hyper-parameter tuning which can occupy considerable time and was out-of-scope for this exercise for me. But could be something to consider in the future.

Requirements

Python (>3.4), jupyter and SciKitLearn.

About

Kaggle DataSet - Give Me Some Credit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published