Be it living at the peaceful countryside or among the hustle and bustle of big city, we are subjected to stress at different levels. Keeping your body in good shape isn't just enough to cope with daily tasks, your heart needs to LIVE through the day too! Here as part of my assessment I have made a Streamlit app that gives a binary output(Positive/Negative), aimed to predict the risk of getting cardiovascular disease (CVD) based on several features:
'thalachh'
(Maximum heart rate achieved),'oldpeak'
(ST depression induced by exercise relative to rest),'caa'
(Number of major vessels),'cp'
(Type of chest pain), and'thall'
(Thalium Stress Test result)
I would also like to thank Rashik Rahman for providing the dataset to work on, for information about description of data can be obtained here.
With the help of Streamlit I have managed to create my first WebApp with my favourite clean n' crisp setting, so here is:
This app is built on a ML model with 87% accuracy, performance of model is summarised below:
Classification Report | Confusion Matrix |
---|---|
Prior to deploying the model I have also tested the model with several new observations as shown below:
The model is able to achieve 80% accuracy.
Classification Report | Confusion Matrix |
---|---|
The dataset consists of 303 observations and 14 columns (303,14)
, all of which are numerized:
Dataset is fairly balanced:
Upon inspection 1 duplicate observations and 2 null values in 'thall'
are observed:
Null value is represented by 0
and hence filled using imputation by median method.
Logistic Regression is used to infer correlation of selected column to target feature 'output'
:
Numeric features with R-squared value lesser than 0.65 are filtered.
Cramer's V is used to study correlation of categorical features to target feature 'output'
:
Since most categorical features show low correlation to target column only 'cp' and 'thall' are selected.
Train and test dataset is splitted at 7:3 ratio.
Pipelines with different combinations of scalers and classification models are built and tested as summarised below:
Upon testing it is observed that the optimum pipeline is built base on StandardScaler()
with LogisticRegression()
:
GridSearchCV with cross validation cv=5
is applied to finetune the optimum pipeline. Hyperparameter C
and penalty
are tested out:
Optimum hyperparamter:
Model is able to achieve 87% and 80% during best pipeline evaluation and testing on new observations respectively. Suggestions to improve model:
- Train model with a larger dataset.
- Ensembling methods.