This repository hosts a machine learning project aimed at predicting diseases from a set of symptoms. The project applies several classification techniques, such as Decision Tree, Random Forest, SVM, and XGBoost, to a dataset of symptoms and their corresponding diseases.
The dataset contains 132 symptoms as features and a target variable for prognosis, mapping to 42 different diseases. It is split into two CSV files: one for training the models and the other for testing their performance. The features underwent preprocessing and feature selection to identify the most relevant for disease classification.
Target Label | Top 5 Contributing Features |
---|---|
Chronic Cholestasis | malaise , chest_pain , excessive_hunger , dizziness , blurred_and_distorted_vision |
Drug Reaction | irritability , muscle_pain , loss_of_balance , swelling_joints , stiff_neck |
Fungal Infection | vomiting , chills , skin_rash , joint_pain , itching |
GERD | nausea , loss_of_appetite , abdominal_pain , yellowing_of_eyes , yellowish_skin |
Peptic Ulcer Disease | family_history , painful_walking , red_sore_around_nose , stomach_bleeding , coma |
Allergy | fatigue , high_fever , headache , sweating , cough |
- Data Preprocessing: The raw data was cleaned and preprocessed to prepare for the analysis. This included handling missing values, normalizing, and encoding categorical variables.
- Exploratory Data Analysis (EDA): Univariate and multivariate analyses were performed to understand the relationships between features and the prognosis.
- Feature Selection: Recursive Feature Elimination (RFE) was utilized to reduce the number of features, focusing on those most impactful for predicting the outcome.
- Model Training: The models were trained on the training dataset, using cross-validation techniques to ensure robustness.
- Model Evaluation: The trained models were evaluated on a separate testing dataset. The performance metrics include accuracy, precision, recall, and F1-score.
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- XGBoost
This project uses the following Python libraries:
- Collections
- Matplotlib
- NumPy
- Pandas
- Seaborn
- Scikit-learn
- Warnings
- XGBoost
The following images show the comparison of all models based on their performance metrics and feature importances as assessed by mutual information:
- Model Performance and Comparision: All models, namely Decision Tree (DT), Random Forest (RF), SVM Linear, and XGBoost, have an almost identical accuracy score on the test data, indicating that they are performing equally well in predicting the target variable.
-
Overfitting: The overfitting score of SVM is lowest, except for XGBoost, which has an overfitting score of 3.086.This indicates that DT, RF, and SVM Linear models are not overfitting, but XGBoost is severely overfitting.
-
Training Accuracy: All models achieved a training accuracy of 1.0, indicating perfect fitting to the training data. However, this may not necessarily translate to performance on unseen data.
-
Model Complexity: The DT model is the simplest, with RF and XGBoost being more complex. SVM Linear has intermediate complexity. Balancing model complexity and performance is crucial to prevent overfitting and ensure good generalization.
-
Model Selection: Although all models perform similarly on the given data, XGBoost's overfitting indicates it may not generalize well. Therefore, careful evaluation using appropriate metrics is important before selecting the final model.
- Further hyperparameter tuning could improve model performances.
- Investigating additional features and engineering new ones may provide better insights.
- Expanding the dataset could enhance the model's ability to generalize to new data.