This project focuses on leveraging computational analytics and machine learning techniques to analyze and predict birthweight, a crucial metric for neonatal health. The analysis aims to identify key factors influencing birthweight, assess strong correlations, and implement advanced machine learning models to derive actionable insights for improving public health outcomes.
The study employs data-driven methods to address the following:
- Identifying correlations with birthweight, including strong positive and negative relationships.
- Transforming birthweight data to evaluate improvements in correlation metrics.
- Implementing classification models to predict low birthweight and evaluate their performance.
Key deliverables include exploratory data analysis (EDA), feature engineering, multiple model implementations, and the final predictions submitted to Kaggle.
- Exploratory Data Analysis (EDA): Insights into the dataset through descriptive statistics, histograms, and correlation matrices.
- Feature Engineering: Designed and transformed features to enhance predictive power and address data challenges.
- Modeling Techniques: Developed and evaluated multiple classification models, including Logistic Regression, Ridge Classification, and Random Forest.
- Confusion Matrix Analysis: Analyzed model performance and errors to prioritize correct predictions for low birthweight cases.
- Actionable Insights: Provided recommendations based on model results to inform public health strategies.
|-- Compagnone_Stefano_A2.ipynb # Jupyter Notebook containing analysis and model development
|-- Compagnone_Stefano_A2.html # HTML version of the notebook for easy viewing
|-- birthweight.csv # Dataset used for training and analysis
|-- submission.csv # Final predictions submitted to Kaggle
|-- Images/
|-- correlation_matrix.png # Heatmap of feature correlations
|-- confusion_matrix.png # Confusion matrix of the final model
|-- feature_histograms.png # Histogram visualizations for continuous variables
- Correlation Analysis: Strong correlations identified between birthweight and factors such as parental age, education level, and health-related behaviors (e.g., smoking and drinking).
- Threshold Analysis: Explored birthweight thresholds distinguishing healthy and non-healthy categories, supported by public health research.
- Model Interpretability: Highlighted impactful features, such as maternal education and prenatal visits, to provide actionable insights for healthcare professionals.
Programming Language: Python Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, PHIK Models: Logistic Regression, Ridge Classifier, K-Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machine (GBM)
For any inquiries or further collaboration, please contact: Stefano Compagnone Email: stefanocompagnone98@gmail.com