To view the entire project, open the full notebook here.
Objective: Identify clusters of survey respondents in the NHANES dataset.
Data: NHANES 2017-March 2020 Pre-Pandemic Questionnaire Data
Model Type: k-means clustering (unsupervised machine learning)
Tools/Libraries: Pandas, Scikit-Learn, Matplotlib, Seaborn
For this unsupervised machine learning project, I built a k-means model that clusters individuals based on medical and demographic data. The NHANES is a great data source for this because it is publicly available and captures many aspects of survey participants' health. This includes demographics, physical exam and blood test results, dietary observations, and lifestyle questionnaires. This dataset also works well with a k-means model because much of the data consists of continuous numeric values, and many of the categorical variables are reported as numeric values with an ordinal logic.