by Lennart Wallentin, lennartwallentin@gmail.com
In 'passenger_satisfaction_stacking_anova_lennart_wallentin.ipynb' I classify if a person flying with an unnamed north american airline company is satisfied or neutral/dissatisfied with their flight, but the classifying per se is not the end goal with this project instead the main objectives of this project are:
- Is a Stacked Generalization (Stacking) model’s AUC score higher than two standalone (Logistic regression and XGBoost) machine learning model’s AUC scores?
- Compare the three models through one-way analysis of variance (ANOVA) tests to see if they are different from each other.
In addition to that is my focus with this project to show that I have good knowledge regarding statistics, machine learning and business knowledge. I demonstrate that by explaining and using:
-
Statistics - More advanced statistical topics as chi-square and construct a Cramer’s V tests for association between categorical features. GLM, the logistic function and odds ratio in conjunction with logistic regression. And as already mentioned one-way analysis of variance (ANOVA) tests. In addition basic statistics such as boxplots, Pearson correlation coefficient, histograms, probabilities etc is also part of this project.
-
Machine learning - Stacking using base and meta learners as well as Logistic regression and XGboost models and how they fit the data in regard to bias and variance. I cross-validate and do feature selection on all three models, and also hyperparameter tuning with Bayesian optimization on the standalone XGBoost and the XGBoost stacked base and meta learners.
A range of evaluation metrics are displayed with each model and since AUC is my main evaluation metric for this project, I construct graphs for each model that show the different cut off thresholds for the predicted probabilities in regard to True Positive Rate/Sensitivity and False Positive Rate. -
Business knowledge - For example noticed that some of the survey questions are wrongly labeled as 0 instead of NA-value and fixed that. As for outliers I use business sense in regard to flight distance and departure delay outliers. Working in an organization I would of course talk to stakeholders to get even more information regarding any questions about the data.
The xlsx-file, ‘bayesian_optimization_iterations.xlsx’ contains all the Bayesian optimization iterations.
The dataset that is used for this project is in the csv-file, 'data_passenger_satisfaction_stacking_anova_lennart_wallentin.csv'