Authors: Mykyta Alekseiev, Elizaveta Barysheva, Joao Melo, Thomas Schneider, Harshit Shangari and Maria Stoelben
The goal of this project is to predict a binary variable using white and black box models. Subsequently, the performance and fairness of the models with respect to certain protected features will be analysed. The protected attributes that will be focused on here are gender and race. Moreover, the models' predictions will be analysed with methods for interpretability.
For this project a dataset of traffic violations in Maryland, USA was selected. You can download the data here. The .arff
should be placed in a data/
folder in the root of your repository.
The processed data contains 65'203 instances with 15 columns, where 5 columns are categorical and the rest binary or numeric. The target column is Citation, which is equal to 1 when a citation was given by an officer and 0 if only a warning was declared.
Create a virtual environment and install the requirements:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
pre-commit install
Check out the jupyter notebooks to understand the data the preprocessing decisions.
To run the data preprocessing and get a data.csv
output for the following parts, run:
python -m spacy download en_core_web_sm
python src/data_preprocessing/data_preprocessor.py
The parameters can be changed in the config/config_modeling.py
. The data is seperated into 60% training and 20% validation and testing each by default.
Run the training with mlflow tracking with the following command:
python src/modeling/main.py
The model selection was performed on the validation data. Below the results are displayed for white and black box models.
Model | Train AUC | Val AUC | Test AUC | Test Accuracy | Test F1 Score |
---|---|---|---|---|---|
XGB | 0.898 | 0.866 | 0.860 | 0.778 | 0.748 |
Random Froest | 0.873 | 0.849 | 0.843 | 0.764 | 0.728 |
Decision Tree | 0.825 | 0.818 | 0.818 | 0.742 | 0.703 |
GAM | 0.805 | 0.814 | 0.805 | 0.730 | 0.705 |
Logistic Regression | 0.645 | 0.652 | 0.641 | 0.600 | 0.559 |
ANN | 0.641 | 0.649 | 0.637 | 0.537 | 0.097 |
If you are interested in our conclusions regarding how our model works and if it is fair to different protected attributes, please check within the notebooks folder the explanation and fairness subfolders, respectively.