Predicting the probability of an applicant paying back the loan.This repository aims to analyze data from different types of personal loans and apply machine learning algorithms to develop a credit risk predictor.
Predict whether a loan application should be approved or not based on the probability of credit default. I use the following models:
- Random Forest
- XGBoost with Incremental Learning
To install all the required python packages run the following code on linux terminal.
pip install -r requirements.txt
The dataset represents over 300 k personal home loans.
- Each row represents one loan.
Data preprocessing is an essential step in preparing the data for analysis and modeling. It involves transforming the raw data into a format that is suitable for machine learning algorithms. In this project, we followed the data preprocessing steps below:
- Handling Outliers: Used Z-score to identify the outliers in the numerical features. Call the module
from feature_engineering import outliers
- Handling missing values: Missing data can have a reason for missing, therefore its important to understand the properties of the missing values.
I used missingno python library to analyze and visualize the missing values. For some features with missing values, I created an extra column indicating whether a value is missing or not.
Then I compared the model performances for different imputation techniques. Imputation techniques used:
- Median and Mode Imputation
- MICE (Multivariate Imputation by Chained Equation)
- Median and Mode Imputation combined mean Imputation
We checked for class imbalance for the target variable.
You can find all the processes we implemented in this section, in feature-engeering.ipynb.
First, we used One-Hot encoding for cateogorical data that does not have a hierarchical structure. Other categorical data with a hierarch, I implemented Ordinal Encoding.
We applied two alogrithms seperately to select important features
- Recursive Feature Elimination (RFE)
- Univariate Feature Selection : ANOVA F-value
- Information Value(IV) and Weight of evidence (WoE)
- Correlation
- Variance Threshold
- Boruta
We used ROC-auc score as the main metric to evaluate the performance of a model.
To find the best threshold we caluclated the threshold for the maximum of Young's J Statistic.
We utilized 3 different methods to evaluate feature importance.
- Scikit-learn's Feature importance: averaging the decrease in impurity over trees
- Permutation Feature Importance: based on how random re-shuffling of each perdictor influences the model performances.
- SHAP
- Best model achieved a 0.69 ROC-auc score.
- PR-ROC Score is 140 % better than the baseline model (Due to the imbalance in classes in target variable, I computed Precission-Recall Curve)