In this project, I work on Wine Quality Data Set from UCI Machine Learning Repository. This notebook consists of my approach for finding the best way to predict the Wine Quality with this dataset.
For this project, I've taken inspiration from work of people on Wine Quality Kaggle Dataset. Please note that the dataset on Kaggle is slightly different from UCI Respository. I hope anyone looking at this finds some value out of my work. :)
Citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
- Preparing Data: We read the dataset, add column for wine type, and scale it.
- EDA: We do basic checks for null values, check details of all attributes, and do basic visualisations to get insights from the dataset.
- Solving Class Imbalance: We try to solve Class Imabalance issue in dataset using class weights, oversampling, and aggregation of classes.
- Spot-Checking Algorithms: We check which algorithm would be best for our dataset by doing cross validation with various algorithms for classification. For this, we use a Spot-Check framework.
- Hyperparameter Tuning: From the results of Spot-Checking, we pick three best. Also comparing them with Deep Learning model and go ahead with best performing model. We do hyperparameter tuning for best model and analyse results.
- Conclusion: We conclude which model(s) would be useful and why. Also mentioning further work to be done.