PredictMD: Symptom-Based Disease Predictor

Overview

This repository hosts a machine learning project aimed at predicting diseases from a set of symptoms. The project applies several classification techniques, such as Decision Tree, Random Forest, SVM, and XGBoost, to a dataset of symptoms and their corresponding diseases.

Dataset

The dataset contains 132 symptoms as features and a target variable for prognosis, mapping to 42 different diseases. It is split into two CSV files: one for training the models and the other for testing their performance. The features underwent preprocessing and feature selection to identify the most relevant for disease classification.

Top 5 Features for each Target varaible

Number of target class variables

Data Insights

Top Features for Disease Prediction

Target Label	Top 5 Contributing Features
Chronic Cholestasis	`malaise`, `chest_pain`, `excessive_hunger`, `dizziness`, `blurred_and_distorted_vision`
Drug Reaction	`irritability`, `muscle_pain`, `loss_of_balance`, `swelling_joints`, `stiff_neck`
Fungal Infection	`vomiting`, `chills`, `skin_rash`, `joint_pain`, `itching`
GERD	`nausea`, `loss_of_appetite`, `abdominal_pain`, `yellowing_of_eyes`, `yellowish_skin`
Peptic Ulcer Disease	`family_history`, `painful_walking`, `red_sore_around_nose`, `stomach_bleeding`, `coma`
Allergy	`fatigue`, `high_fever`, `headache`, `sweating`, `cough`

Methodology

Data Preprocessing: The raw data was cleaned and preprocessed to prepare for the analysis. This included handling missing values, normalizing, and encoding categorical variables.
Exploratory Data Analysis (EDA): Univariate and multivariate analyses were performed to understand the relationships between features and the prognosis.
Feature Selection: Recursive Feature Elimination (RFE) was utilized to reduce the number of features, focusing on those most impactful for predicting the outcome.
Model Training: The models were trained on the training dataset, using cross-validation techniques to ensure robustness.
Model Evaluation: The trained models were evaluated on a separate testing dataset. The performance metrics include accuracy, precision, recall, and F1-score.

Models Implemented

Decision Tree
Random Forest
Support Vector Machine (SVM)
XGBoost

Requirements

This project uses the following Python libraries:

Collections
Matplotlib
NumPy
Pandas
Seaborn
Scikit-learn
Warnings
XGBoost

Results and Comparison

The following images show the comparison of all models based on their performance metrics and feature importances as assessed by mutual information:

Conclusions

Model Performance and Comparision: All models, namely Decision Tree (DT), Random Forest (RF), SVM Linear, and XGBoost, have an almost identical accuracy score on the test data, indicating that they are performing equally well in predicting the target variable.

Overfitting: The overfitting score of SVM is lowest, except for XGBoost, which has an overfitting score of 3.086.This indicates that DT, RF, and SVM Linear models are not overfitting, but XGBoost is severely overfitting.
Training Accuracy: All models achieved a training accuracy of 1.0, indicating perfect fitting to the training data. However, this may not necessarily translate to performance on unseen data.
Model Complexity: The DT model is the simplest, with RF and XGBoost being more complex. SVM Linear has intermediate complexity. Balancing model complexity and performance is crucial to prevent overfitting and ensure good generalization.
Model Selection: Although all models perform similarly on the given data, XGBoost's overfitting indicates it may not generalize well. Therefore, careful evaluation using appropriate metrics is important before selecting the final model.

Future Work

Further hyperparameter tuning could improve model performances.
Investigating additional features and engineering new ones may provide better insights.
Expanding the dataset could enhance the model's ability to generalize to new data.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Analysis_Report.pptx		Analysis_Report.pptx
LICENSE		LICENSE
README.md		README.md
disease_prediction.ipynb		disease_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredictMD: Symptom-Based Disease Predictor

Overview

Dataset

Top 5 Features for each Target varaible

Number of target class variables

Data Insights

Top Features for Disease Prediction

Methodology

Models Implemented

Requirements

Results and Comparison

Conclusions

Future Work

About

Releases

Packages

Languages

License

tanzealist/PredictMD-SymptomDiseasePredictor

Folders and files

Latest commit

History

Repository files navigation

PredictMD: Symptom-Based Disease Predictor

Overview

Dataset

Top 5 Features for each Target varaible

Number of target class variables

Data Insights

Top Features for Disease Prediction

Methodology

Models Implemented

Requirements

Results and Comparison

Conclusions

Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages