Side Project: Synthetic Fraud Detection

Overview

The goal of this peoject is to identify the fraudulent transactions.

Since this dataset is very imbalanced, it can simulate the problems we face in reality.

The best performaing model so far in this task is xgboost model.

I use feature transformation (log transformation) to get better features.
After removing highly-correlated features, the model performs better.
I find that using different thresholds would yield significantly different performances on precision, recall and f1-score on similar models with slightly different hyperparameters. In reality, we have to consider the goals of this task to get the optimal results.
Slightly different settings of hyperparameters on xbgoost model (n_estimators, learning_rate and max_depth) can have quite different results, which is very worth further studying.
Random forest model is likely to have better result since it has similar performance with xgboost model when modeling the training set; however, the memory limitation on my computer stops me from trying that.
Imbalanced dataset is still a hard task to handle, which requires more trying besides undersampling.
Build the models on multiple subsets consisting of different negative samples from the training dataset combined with the same positive samples and then use the predictions made by each model to make final decision could perform better since the model is trained on more different negative datasets and it can know the patterns more. (2017/5/8)
Further work:
- More feature engineering
- More hyperparameter tuning
- Try more different ways to deal with imbalanced data
- Try ensemble model techniques like stacking

Link to the work here