As a data science intern at Home Credit, I was assigned to make a credit risk analysis and credit scoring. My objective here is to make a good prediction model that can classify whether the clients are having payment difficulties or not. I used logistic regression as a model to predict credit risk and implemented weight evidence & information values to perform feature selection. Metrics used in this project are ROC-AUC score and KS-Statistic. My goals here are:
- build a prediction model with good ROC-AUC score (>0.7) and good KS-Statistic score (>0.3)
- build credit score (score card) each borrower and treshold ecommendation list
Metrics used in this project are ROC-AUC score and KS-Statistic
- This datasets is about client application record for credit loans
- The dataset consists of 121 features and 0.37 million records training dataset and 120 features + 38k record testing dataset
- There are 50 features containing >20% missing values, we dropped it and did missing values imputation for the rest features
- The target feature is labelled as client's payment difficulties, 1 stands for the client with payment difficulties.
- Target feature is highly imbalanced (91:9), we handled with SMOTE technique
- We only used training application dataset for modeling , we will only use additional dataset for gaining insights
- Data preprocessing: dropped unused features, missing values imputation, datetime feature engineering, and feature encoding
- Feature Selection: using Weight of Evidence and Information Values
- Feature encoding and feature binning
- Split for training and testing
- Handling imbalanced target feature
- We got 79 features ready to be training in machine learning model
We used logistic regression to predict whether the borrower will be a good borrower or not. We also used AUC score and KS-Statistic as our evaluation metric. After performing hyperparameter tuning and feature selection, the results are :
- AUC Training Score : 0.72
- AUC Testing Score : 0.72
- KS Statistic : 0.33
- Male borrowers have higher odds of being a borrowers with payment difficulties 1.003 times than female
- Borrowers who live in population relative 4 have higher odds of being a borrowers with payment difficulties 1 times than other area
- Borrowers who have a car have higher odds of being a borrowers with payment difficulties 0.95 times than borrowers with no car
- Borrowers who have no children have higher odds of being a borrowers with payment difficulties 0.93 than other CNT_FAM_MEMBER
- Borrowers with 4 or more children have higher odds of being a borrowers with payment difficulties 0.92 than other CNT_FAM_MEMBER
full credit score file: Download
full credit score file: Download
- Gender and Car Ownership are the top 3 important features, demographically male with care ownership have a higher probability to be clients with payment difficulties
- We suggest marketing department to target female clients and/or male prospective clients with no car ownership and personalized campaign with demographic information related to next recommendation
- Client with no children and have >4 children are the top 6 important features
- Our recommendation in marketing department, focus on client with less children and aware with client that have either no children or too many children
- It's make sense when family with no burdency and with too many burdency to have payment difficulties
- Client with amount price for loans < 300k and >900k are top 10 feature importance
- Our recommendation in marketing department, between those range because it has lower probability to have payment difficulties
- Company should aware with too little amount goods price and/or with too high amount goods of price.