A capstone project of my Data Science course at Codecademy.
The jupyter notebook is based on Python 3.7 and relies mainly on pandas, numpy, matplotlib and seaborn for data preparation and visualization. Furthermore I imported from SKlearn different machine learning models: Neural Network Classifier, K-Neighbors Classifier, Random Forst Classifier and Support Vector Machines.
The task set by Codecademy was to apply machine learning techniques to predict a variable from the OKCupid dataset. Codecademy's example was about predicting the Zodiac sign based on the user's responses to different questions.
My approach was to predict the body type (e.g. average, fit, curvy etc.) based on the user's diet (e.g. vegetarian, everything etc.) and his drinking, smoking and taking drugs habits.
As this data was provided by Codecademy during a paid course, I am not able to share it.
The maximum score was achieved by the Neural Network Classifier, K-Neighbors, SVC (linear kernel) at lay at 0.561. I have used all features for this prediction. Maybe this result could be improved by selected the features (e.g. not taking into account the smoking habits).
- OKCupid data - correlation between users body type and users diet.ipynb: Jupyter notebook with all my code.
- instructions_codecademy.md: The instructions from Codecademy which I found quite useful when first approaching the dataset.
Every idea and contribution is welcome.
Thanks to Codecademy and OKCupid for providing the data. Also thanks to the developers of all those useful libraries like pandas, numpy, matplotlib, seaborn and sklearn.
Maximilian Müller, Business Development Manager in the Renewable Energy sector. Now diving into the field of data analysis.