This project investigates the relationships between patient health metrics and the occurrence of stroke through Exploratory Data Analysis (EDA). The goal is to uncover patterns and significant correlations that can inform predictive models and healthcare decisions.
Key steps include:
- Data cleaning and preprocessing to handle missing values and outliers.
- Statistical analysis to identify meaningful relationships between features.
- High-quality visualizations to effectively communicate insights.
This project serves as a foundation for future machine learning applications in healthcare analytics.
- Missing Value Handling: Strategies for imputing or removing missing data.
- Feature Encoding: Conversion of categorical data using techniques like Label Encoding and Ordinal Encoding.
- Scaling: Use of MinMaxScaler and StandardScaler for normalization of numerical features.
- Univariate Analysis: Histograms and boxplots for individual feature distributions.
- Bivariate Analysis: Correlation heatmaps and scatterplots to identify relationships between features.
- Statistical Testing: Chi-squared tests and hypothesis testing to confirm significant associations.
- Heatmaps for correlation analysis.
- Pair plots and scatterplots to visualize trends.
- Customized plots with Matplotlib and Seaborn for better clarity.
- Processed and analyzed health-related features like BMI, glucose levels, and hypertension.
- Performed statistical testing to confirm significant relationships between features.
- Generated actionable insights that can be further used for machine learning models to predict stroke risk.
- Python: Core programming language for analysis.
- Pandas/Numpy: Data manipulation and preprocessing.
- Matplotlib/Seaborn: Visualization libraries for data insights.
- SciPy/Scikit-learn: Statistical testing and feature scaling.
- Extend the project to include predictive modeling with machine learning algorithms.
- Explore advanced visualization techniques to communicate findings better.
- Implement additional statistical methods to validate results.
This project uses several visualizations, including:
- Correlation Heatmaps: Identify relationships between features.
- Histograms and Boxplots: Examine feature distributions.
- Scatterplots and Pair Plots: Visualize trends and feature interactions.
Ensure you have Python installed along with the required libraries. You can install dependencies with:
pip install pandas numpy matplotlib seaborn scikit-learn scipy
## Clone the Repository
git clone https://github.com/Danial-Ghofrani/Stroke_Exploratory_Data_Analysis
.git