Asisgnment Data Scientist
overview: For this Dataset, I built the model using Random Forest Classifier Algorithm. Done the Exploratory Analysis, Preprocessing, Feature Selection before building the model. After which I decided to build the model using Random Forest Cassifier as it was giving good performance based on the all performance matrices like accuracy, true positive ratio (tpr), false positive ratio (fpr), precision, F1 Ratio, ROC AUC Curve.
Steps followed in the code
a) Removing Unwanted Columns b) Checking for Nulls c) Data Type Check d) Class Imbalance Check - Checking the Distribution of Y (Target) variable e) Check for Constant Columns (columns with singularities) and removing them. Using sklearn Variance Threshold
Feature Selection using Random Forest Feature Importance
Top five features are : X52, X55, X21, X16, X56
Uisng Random Forest Classifier Based on the performance matrices:
- Got good confusion matrix : [[455, 26], [ 26, 275]]
- High Accuracy Score : 93.35038363171356 %
- High True Positive Ratio (TPR) : 0.9136212624584718
- Low Fasle Positive Ratio (FPR) : 0.05405405405405406
- High Precision : 0.9136212624584718
- High F1 Ratio : 0.9136212624584718
- High ROC AUC : 0.9297836042022088 Because of the Good Performance, done prediction using Random Forest Classifier
- Pandas Version : 1.2.3
- Numpy Version : 1.19.2
- Matplotlib Version :3.3.2
- Sklearn Version : 0.24.2