Stroke Prediction Machine Learning Model

Project Description

Stroke is a serious medical condition that occurs when the blood supply to part of the brain is interrupted or reduced, leading to brain damage and potential long-term disability or death. The risk of stroke is affected by a wide range of factors, including age, gender, hypertension, heart disease, obesity, and smoking.

Detecting a stroke in its early stages brings numerous advantages, including timely medical intervention, reduced brain damage, prevention of long-term disabilities, identification of underlying causes, and the facilitation of swift medical decision-making to optimize patient outcomes.

For this project, our objective is to create a machine learning model for early stroke detection, focusing on various causative factors. This model aims to assist clinical teams in predicting or assessing the risk of stroke occurrence, enabling timely medical intervention, minimizing brain damage, preventing long-term disabilities, identifying underlying causes, and facilitating quick medical decision-making to optimize patient outcomes.

Data Overview

The stroke dataset comprises a compilation of patients' medical records. It encompasses a wide range of information, including patient demographics, medical history, lifestyle factors, and the presence or absence of a stroke for each patient.

Here is a snippet of the dataset:

Built With

Python and Packages (eg. scikit-learn, matplotlib, searborn, pickle, etc.)
Flask
HTML
CSS

Getting Started

Prerequisites

Make sure you have installed all of the following prerequisites on your development machine:

Git
Python and set up your virtual environment
pip install flask globally
Your favoriate code editor (e.g. VScode, etc.)
Your favoriate browser (e.g. Google Chrome, etc.)

Installation

Clone this repo and save it in your local directory, to clone with URL run the following code in terminal

git clone https://github.com/yeyanwang/stroke_classifier.git
Start Flask app by running the following code in terminal

python app.py
Visit localhost: 5000 in your browser and enjoy!

Machine Learning Pipline

Data Collection

Data was collected from Kaggle
Model Selection

The Random Forest Classifier was selected for this problem due to its reputation for achieving high accuracy in classification tasks. It is a popular choice in the healthcare and medical industry, where precise and reliable predictions are crucial.
Exploratory Data Analysis (EDA)

The below snippets are some interesting observations we discovered:
- Data Distribution with Histogram Analysis
- Closer Examination on Target Variable
- Missing Values on 'bmi' column
  
  Filled in Missing Values with Imputed values
- Singleton Record on 'gender' column
  
  Dropped Singleton Record
- Uneven Distribution 'bmi' values
  
  Binning 'bmi' values (Model 3 only)
Data Preprocessing

Data was cleaned and prepared for further analysis. This includes:
- Applied oversampling methods to handle imbalanced data
- Apllied encoding to categorical varibales
- Applied feature scaling to transform numerical features into a consistent range
- Divided data into training and testing sets using train_test_split module
Modeling Training

Trained 3 models using different resampling methods
- Model 1 used oversampled data with the RandomOverSampler technique
- Model 2 used oversampled data with the SMOTE technique
- Model 3 used oversampled data with RandomOverSampler and applied additional binning to the 'bmi' column
Evaluation
- Utilized accuracy scores, confusion matrix and classification reports to compare to access the performance of all 3 models
- We ultimately selected model 1 as the final model due to its overall superior performance.
Despite having a false positive rate of 1.82%, it is more advantageous in the context of predicting stroke occurrence to falsely identify patients as likely to have a stroke. This allows clinicians to allocate more attention and care to these patients. Conversely, a high false negative rate would be concerning, where patients who have a risk of stroke will not be intervened. In other words, it is preferable to mistakenly identify patients who are likely to have a stroke, rather than missing such cases.

See more detail about our final model
Model Deployment with Flask
- Implemented the model logic in our home route to handle incoming requests
- Created Web-based UI with HTML and CSS
Check out the snippets below:
- Landing Page:
- Result Page

Credits

Kevin Lee
Kaggle
UC Berkely Extension Data Analytics Bootcamp

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Data		Data
model		model
templates		templates
.gitignore		.gitignore
README.md		README.md
Stroke Prediction.pdf		Stroke Prediction.pdf
app.py		app.py
stroke_classifier_final .ipynb		stroke_classifier_final .ipynb
stroke_classifier_optimization.ipynb		stroke_classifier_optimization.ipynb
stroke_classifier_optimization_1.ipynb		stroke_classifier_optimization_1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Prediction Machine Learning Model

Project Description

Data Overview

Built With

Getting Started

Machine Learning Pipline

Credits

About

Releases

Packages

Contributors 3

Languages

yeyanwang/stroke_classifier

Folders and files

Latest commit

History

Repository files navigation

Stroke Prediction Machine Learning Model

Project Description

Data Overview

Built With

Getting Started

Machine Learning Pipline

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages