Variable Name | Description | Data Type |
---|---|---|
SeriousDlqin2yrs | Person experienced 90 days past due delinquency or worse. | Y/N |
RevolvingUtilizationOfUnsecuredLines | Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. | percentage |
Age | Age of borrower in years. | integer |
NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due but no worse in the last 2 years. | integer |
DebtRatio | Monthly debt payments, alimony,living costs divided by monthy gross income. | percentage |
MonthlyIncome | Monthly income. | real |
NumberOfOpenCreditLinesAndLoans | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards). | integer |
NumberOfTimes90DaysLate | Number of times borrower has been 90 days or more past due. | integer |
NumberRealEstateLoansOrLines | Number of mortgage and real estate loans including home equity lines of credit. | integer |
NumberOfTime60-89DaysPastDueNotWorse | Number of times borrower has been 60-89 days past due but no worse in the last 2 years. | integer |
NumberOfDependents | Number of dependents in family excluding themselves (spouse, children etc.). | integer |
Since the class labels are provided for the sole response variable SeriousDlqin2yrs taking values 0 (NO default) or 1 (default). Supervised Learning classification models are a natural initial approach such as logistic regression as the base model and to improve with an ensemble of two-class boosted decision tree and stacking techniques involving Neural Networks.
Given the nature of the problem. The following analysis approach makes sense.
Summary descriptive statistics of the features.
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 150,000.00 | 150,000.00 | 150,000.00 | 150,000.00 | 150,000.00 | 120,269.00 | 150,000.00 | 150,000.00 | 150,000.00 | 150,000.00 | 146,076.00 |
mean | 0.07 | 6.05 | 52.30 | 0.42 | 353.01 | 6,670.22 | 8.45 | 0.27 | 1.02 | 0.24 | 0.76 |
std | 0.25 | 249.76 | 14.77 | 4.19 | 2,037.82 | 14,384.67 | 5.15 | 4.17 | 1.13 | 4.16 | 1.12 |
min | - | - | - | - | - | - | - | - | - | - | - |
25% | - | 0.03 | 41.00 | - | 0.18 | 3,400.00 | 5.00 | - | - | - | - |
50% | - | 0.15 | 52.00 | - | 0.37 | 5,400.00 | 8.00 | - | 1.00 | - | - |
75% | - | 0.56 | 63.00 | - | 0.87 | 8,249.00 | 11.00 | - | 2.00 | - | 1.00 |
max | 1.00 | 50,708.00 | 109.00 | 98.00 | 329,664.00 | 3,008,750.00 | 58.00 | 98.00 | 54.00 | 98.00 | 20.00 |
To examine feature reduction, the following showsn a correlation heatmap of the original data. Red implies values are correlated.
The variable SeriousDlqin2yrs is labelled either 0's and 1's with no other values. The class imbalance is 6.7% (ratio of 1:14) as shown be the following pie chart. As this majority class of 0's could dominate the prediction. Increasing the relavant proportion of the minority class of 1's was also tried using downsampling with upsampling of the dominant class.
For the age variable, it can be believed that the minimum age should start at 21 as there is only one value below this with an age of 0. This value can be adjusted to take the average value of the distribution. For the variables "NumberOfTime30-59DaysPastDueNotWorse", two years divided into 60 day increments suggests a maximum value of about 12-24 and this is what the data set exhibits except for a cluter of points at the values of 96 and 98. These are either strange mistakes or possibly values specific to the dataset with some other meaning such as 'unknown'. Further, the variables "NumberOfTimes90DaysLate" and "NumberOfTime60-89DaysPastDueNotWorse" also contained these 96 and 98 pillars of data which showed up on correlation between the three variables. To accomodate for this, the values were considered to be Winsorized to the previous largest value but the median value was chosen with the aim to have less impact over the sample distribution. After altering the values for 96 and 98. The significant inter-correlation variables NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-89DaysPastDueNotWorse drops away.
The feature MonthlyIncome has 29,731 missing values or 19.8% of 150,000 examples with the belly of the distribution around 5,000 and a long right-tail.
NumberOfDependents had 3924 missing values or 2.6% of the dataset. These values were set to zero. There is a jump from 6 to 20 dependents with 244 examples in that 99.9% percentile bin. All values were nominal whole numbers such that there were no fraction of a dependent.
Split the training data into 2-fold Cross Validation with 75% random samples in training the remaining for validation/test. Logistic regression can be considered the base model to compare others to. Out-of-sample fit on unseen data gave an AUC of 0.8117. With the following ROC curve.
The important features according to the Logistic model are shown in the following figure.
The XGBoost estimation provided an improvement over Logistic regression, AdaBoost and standard tree. Generating an AUC of 0.86146 and resulting in the following AUC curve.
Summary table of models fitted with AUC for their out-of-sample performance on the 2-fold cross validation data.
Model | AUC - Training | AUC - Validation |
---|---|---|
Logistic Regression | 0.8161 | 0.8161 |
Decision Tree | 0.61 | 0.601 |
Random Forest | 0.8628 | 0.862 |
Gradient Boosting | 0.8628 | 0.861 |
XGBoost | 0.878 | 0.8615 |
Ensemble | 0.8316 | |
Stacking | 0.6462 |
A model used in this anlaysis can be used in an online credit-line approval process as well as provided to product and client relationship managers to detect ongoing potential credit threats. Complex implementation such as those involving stacking with Deep Leariing models require time for hyperparameter tuning. Further, the production implmentation of these models do somewhat step away from an intuitive approach of regression models such as Logistic regression. The extreme examples may be the ones which default more and this extreme distribution may be worth estimating separately to the rest of the distributions. Traditionally, Ensemble techniques win Kaggle competitions. This includes ideas such as Stacking. These techniques require Hyper-parameter tuning which can occupy considerable time and was out-of-scope for this exercise for me. But could be something to consider in the future.
Python (>3.4), jupyter and SciKitLearn.