Skip to content

🧠⚡️Random Forest Classification Model for Stroke Prediction

Notifications You must be signed in to change notification settings

yeyanwang/stroke_classifier

Repository files navigation

Stroke Prediction Machine Learning Model

Project Description

Stroke is a serious medical condition that occurs when the blood supply to part of the brain is interrupted or reduced, leading to brain damage and potential long-term disability or death. The risk of stroke is affected by a wide range of factors, including age, gender, hypertension, heart disease, obesity, and smoking.

Detecting a stroke in its early stages brings numerous advantages, including timely medical intervention, reduced brain damage, prevention of long-term disabilities, identification of underlying causes, and the facilitation of swift medical decision-making to optimize patient outcomes.

For this project, our objective is to create a machine learning model for early stroke detection, focusing on various causative factors. This model aims to assist clinical teams in predicting or assessing the risk of stroke occurrence, enabling timely medical intervention, minimizing brain damage, preventing long-term disabilities, identifying underlying causes, and facilitating quick medical decision-making to optimize patient outcomes.

Data Overview

The stroke dataset comprises a compilation of patients' medical records. It encompasses a wide range of information, including patient demographics, medical history, lifestyle factors, and the presence or absence of a stroke for each patient.

Here is a snippet of the dataset: image

Built With

  • Python and Packages (eg. scikit-learn, matplotlib, searborn, pickle, etc.)
  • Flask
  • HTML
  • CSS

Getting Started

Prerequisites

Make sure you have installed all of the following prerequisites on your development machine:

  • Git
  • Python and set up your virtual environment
  • pip install flask globally
  • Your favoriate code editor (e.g. VScode, etc.)
  • Your favoriate browser (e.g. Google Chrome, etc.)

Installation

  1. Clone this repo and save it in your local directory, to clone with URL run the following code in terminal

    git clone https://github.com/yeyanwang/stroke_classifier.git

  2. Start Flask app by running the following code in terminal

    python app.py

  3. Visit localhost: 5000 in your browser and enjoy!

Machine Learning Pipline

  1. Data Collection

    Data was collected from Kaggle

  2. Model Selection

    The Random Forest Classifier was selected for this problem due to its reputation for achieving high accuracy in classification tasks. It is a popular choice in the healthcare and medical industry, where precise and reliable predictions are crucial.

  3. Exploratory Data Analysis (EDA)

    The below snippets are some interesting observations we discovered:

    • Data Distribution with Histogram Analysis

      image

    • Closer Examination on Target Variable

      image

    • Missing Values on 'bmi' column

      image

      Filled in Missing Values with Imputed values

      image

    • Singleton Record on 'gender' column

      image

      Dropped Singleton Record

      image

    • Uneven Distribution 'bmi' values

      image

      Binning 'bmi' values (Model 3 only) image

      image

  4. Data Preprocessing

    Data was cleaned and prepared for further analysis. This includes:

    • Applied oversampling methods to handle imbalanced data
    • Apllied encoding to categorical varibales
    • Applied feature scaling to transform numerical features into a consistent range
    • Divided data into training and testing sets using train_test_split module
  5. Modeling Training

    Trained 3 models using different resampling methods

    • Model 1 used oversampled data with the RandomOverSampler technique
    • Model 2 used oversampled data with the SMOTE technique
    • Model 3 used oversampled data with RandomOverSampler and applied additional binning to the 'bmi' column
  6. Evaluation

    • Utilized accuracy scores, confusion matrix and classification reports to compare to access the performance of all 3 models
    • We ultimately selected model 1 as the final model due to its overall superior performance.

    Despite having a false positive rate of 1.82%, it is more advantageous in the context of predicting stroke occurrence to falsely identify patients as likely to have a stroke. This allows clinicians to allocate more attention and care to these patients. Conversely, a high false negative rate would be concerning, where patients who have a risk of stroke will not be intervened. In other words, it is preferable to mistakenly identify patients who are likely to have a stroke, rather than missing such cases.

    See more detail about our final model

  7. Model Deployment with Flask

    • Implemented the model logic in our home route to handle incoming requests
    • Created Web-based UI with HTML and CSS

    Check out the snippets below:

    • Landing Page:

    Landing Page

    • Result Page

    Result Page

Credits

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •