Stroke is a serious medical condition that occurs when the blood supply to part of the brain is interrupted or reduced, leading to brain damage and potential long-term disability or death. The risk of stroke is affected by a wide range of factors, including age, gender, hypertension, heart disease, obesity, and smoking.
Detecting a stroke in its early stages brings numerous advantages, including timely medical intervention, reduced brain damage, prevention of long-term disabilities, identification of underlying causes, and the facilitation of swift medical decision-making to optimize patient outcomes.
For this project, our objective is to create a machine learning model for early stroke detection, focusing on various causative factors. This model aims to assist clinical teams in predicting or assessing the risk of stroke occurrence, enabling timely medical intervention, minimizing brain damage, preventing long-term disabilities, identifying underlying causes, and facilitating quick medical decision-making to optimize patient outcomes.
The stroke dataset comprises a compilation of patients' medical records. It encompasses a wide range of information, including patient demographics, medical history, lifestyle factors, and the presence or absence of a stroke for each patient.
Here is a snippet of the dataset:
- Python and Packages (eg. scikit-learn, matplotlib, searborn, pickle, etc.)
- Flask
- HTML
- CSS
Prerequisites
Make sure you have installed all of the following prerequisites on your development machine:
- Git
- Python and set up your virtual environment
pip install flask
globally- Your favoriate code editor (e.g. VScode, etc.)
- Your favoriate browser (e.g. Google Chrome, etc.)
Installation
-
Clone this repo and save it in your local directory, to clone with URL run the following code in terminal
git clone https://github.com/yeyanwang/stroke_classifier.git
-
Start Flask app by running the following code in terminal
python app.py
-
Visit localhost: 5000 in your browser and enjoy!
-
Data Collection
Data was collected from Kaggle
-
Model Selection
The Random Forest Classifier was selected for this problem due to its reputation for achieving high accuracy in classification tasks. It is a popular choice in the healthcare and medical industry, where precise and reliable predictions are crucial.
-
Exploratory Data Analysis (EDA)
The below snippets are some interesting observations we discovered:
-
Data Distribution with Histogram Analysis
-
Closer Examination on Target Variable
-
Missing Values on 'bmi' column
Filled in Missing Values with Imputed values
-
Singleton Record on 'gender' column
Dropped Singleton Record
-
Uneven Distribution 'bmi' values
Binning 'bmi' values (Model 3 only)
-
-
Data Preprocessing
Data was cleaned and prepared for further analysis. This includes:
- Applied oversampling methods to handle imbalanced data
- Apllied encoding to categorical varibales
- Applied feature scaling to transform numerical features into a consistent range
- Divided data into training and testing sets using
train_test_split
module
-
Modeling Training
Trained 3 models using different resampling methods
- Model 1 used oversampled data with the
RandomOverSampler
technique - Model 2 used oversampled data with the
SMOTE
technique - Model 3 used oversampled data with
RandomOverSampler
and applied additional binning to the 'bmi' column
- Model 1 used oversampled data with the
-
Evaluation
- Utilized accuracy scores, confusion matrix and classification reports to compare to access the performance of all 3 models
- We ultimately selected model 1 as the final model due to its overall superior performance.
Despite having a false positive rate of 1.82%, it is more advantageous in the context of predicting stroke occurrence to falsely identify patients as likely to have a stroke. This allows clinicians to allocate more attention and care to these patients. Conversely, a high false negative rate would be concerning, where patients who have a risk of stroke will not be intervened. In other words, it is preferable to mistakenly identify patients who are likely to have a stroke, rather than missing such cases.
See more detail about our final model
-
Model Deployment with Flask
- Implemented the model logic in our home route to handle incoming requests
- Created Web-based UI with HTML and CSS
Check out the snippets below:
- Landing Page:
- Result Page