Mortgages, student and auto loans, and debt consolidation are just a few examples of credit and loans that people seek online. Peer-to-peer lending services such as Loans Canada and Mogo let investors loan people money without using a bank. However, because investors always want to mitigate risk, a client has asked to help them predict credit risk with machine learning techniques.
In this assignment I built and evaluated several machine learning models to predict credit risk using data you'd typically see from peer-to-peer lending services. Credit risk is an inherently imbalanced classification problem (the number of good loans is much larger than the number of at-risk loans), so I employed different techniques for training and evaluating models with imbalanced classes. I used the imbalanced-learn and Scikit-learn libraries to build and evaluate models using the two following techniques:
In this Jupyter Notebook, I used the imbalanced learn library to resample the LendingClub data, build and evaluate logistic regression classifiers using the resampled data.
The notebook consists of:
-
Reading the CSV into a DataFrame.
-
Splitting the data into Training and Testing sets.
-
Scaling the training and testing data using the
StandardScaler
fromsklearn.preprocessing
. -
Using the provided code to run a Simple Logistic Regression:
- Fitting the
logistic regression classifier
. - Calculating the
balanced accuracy score
. - Displaying the
confusion matrix
. - Printing the
imbalanced classification report
.
- Fitting the
It also includes:
-
Oversampling the data using the
Naive Random Oversampler
andSMOTE
algorithms. -
Undersampling the data using the
Cluster Centroids
algorithm. -
Over- and undersampling using a combination
SMOTEENN
algorithm.
For each of the above, I've:
-
Trained a
logistic regression classifier
fromsklearn.linear_model
using the resampled data. -
Calculated the
balanced accuracy score
fromsklearn.metrics
. -
Displayed the
confusion matrix
fromsklearn.metrics
. -
Printed the
imbalanced classification report
fromimblearn.metrics
.
In this section, I train and compared two different ensemble classifiers to predict loan risk and evaluate each model. I used the Balanced Random Forest Classifier and the Easy Ensemble Classifier.
I begin by:
-
Reading the data into a DataFrame.
-
Splitting the data into training and testing sets.
-
Scaling the training and testing data using the
StandardScaler
fromsklearn.preprocessing
.
Then, I completed the following steps for each model:
-
Training the model using the quarterly data from LendingClub provided in the
Resource
folder. -
Calculating the balanced accuracy score from
sklearn.metrics
. -
Displaying the confusion matrix from
sklearn.metrics
. -
Generating a classification report using the
imbalanced_classification_report
from imbalanced learn. -
For the balanced random forest classifier only, I printed the feature importance sorted in descending order (most important feature to least important) along with the feature score.