Machine Learning - Risky Business

Background

Mortgages, student and auto loans, and debt consolidation are just a few examples of credit and loans that people seek online. Peer-to-peer lending services such as Loans Canada and Mogo let investors loan people money without using a bank. However, because investors always want to mitigate risk, a client has asked to help them predict credit risk with machine learning techniques.

In this assignment I built and evaluated several machine learning models to predict credit risk using data you'd typically see from peer-to-peer lending services. Credit risk is an inherently imbalanced classification problem (the number of good loans is much larger than the number of at-risk loans), so I employed different techniques for training and evaluating models with imbalanced classes. I used the imbalanced-learn and Scikit-learn libraries to build and evaluate models using the two following techniques:

Resampling
Ensemble Learning

Resampling

In this Jupyter Notebook, I used the imbalanced learn library to resample the LendingClub data, build and evaluate logistic regression classifiers using the resampled data.

The notebook consists of:

Reading the CSV into a DataFrame.
Splitting the data into Training and Testing sets.
Scaling the training and testing data using the StandardScaler from sklearn.preprocessing.
Using the provided code to run a Simple Logistic Regression:
- Fitting the logistic regression classifier.
- Calculating the balanced accuracy score.
- Displaying the confusion matrix.
- Printing the imbalanced classification report.

It also includes:

Oversampling the data using the Naive Random Oversampler and SMOTE algorithms.
Undersampling the data using the Cluster Centroids algorithm.
Over- and undersampling using a combination SMOTEENN algorithm.

For each of the above, I've:

Trained a logistic regression classifier from sklearn.linear_model using the resampled data.
Calculated the balanced accuracy score from sklearn.metrics.
Displayed the confusion matrix from sklearn.metrics.
Printed the imbalanced classification report from imblearn.metrics.

Ensemble Learning

In this section, I train and compared two different ensemble classifiers to predict loan risk and evaluate each model. I used the Balanced Random Forest Classifier and the Easy Ensemble Classifier.

I begin by:

Reading the data into a DataFrame.
Splitting the data into training and testing sets.
Scaling the training and testing data using the StandardScaler from sklearn.preprocessing.

Then, I completed the following steps for each model:

Training the model using the quarterly data from LendingClub provided in the Resource folder.
Calculating the balanced accuracy score from sklearn.metrics.
Displaying the confusion matrix from sklearn.metrics.
Generating a classification report using the imbalanced_classification_report from imbalanced learn.
For the balanced random forest classifier only, I printed the feature importance sorted in descending order (most important feature to least important) along with the feature score.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Images		Images
Resources		Resources
README.md		README.md
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning - Risky Business

Background

Resampling

Ensemble Learning

About

Releases

Packages

Languages

RawnakMahjabib/ML-risky-business

Folders and files

Latest commit

History

Repository files navigation

Machine Learning - Risky Business

Background

Resampling

Ensemble Learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages