# Python Data Mining code implementing Naive Bayes, Decision Tree and K-Nearest Neighbours (KNN) Big Data Analysis Algorithms
Note: Please view the 'DOC' pdf file for an in-detail and comprehensive documentation of this project. Contains detailed explanation, its working and visualized results of the code. The program and dataset are added in the 'DataMiningCode' folder of this repository
This program implements a machine learning pipeline to classify data from the SaYoPillow.csv dataset. The program compares the performance of three machine learning models—Naive Bayes, Decision Trees, and K-Nearest Neighbors (KNN)—on a given dataset, determining which model performs best for classification. The dataset used is SaYoPillow.csv, which appears to contain physiological and sleep-related data, with a target variable that is classified into different categories. The performance metrics like accuracy, precision, recall and F1 scores are calculated manually to better demonstrate the understanding and functionality of each metrics.
Comparing the performance of different models for a given dataset is essential because no single model works best for all types of data. Different models have varying strengths and weaknesses depending on the data's characteristics, such as its distribution, the presence of noise, or class imbalance. This program demonstrates classification of the dataset using three models. However, based on the requirements, more models can be implemented, trained and tested using python libraries and classes and choose better models for classification. Python provides efficient and powerful libraries and classes, primarily within the scikit-learn library.
-
- Library: sklearn.naive_bayes
- Class: GaussianNB
- Description: The GaussianNB class is used for implementing the Gaussian Naive Bayes algorithm, which assumes that the features follow a normal distribution. It is ideal for small datasets and handles continuous data well
-
- Library: sklearn.tree
- Class: DecisionTreeClassifier
- Description: The DecisionTreeClassifier class builds a decision tree based on the features of the dataset. It uses criteria like "entropy" or "gini" to split nodes. Decision trees are easy to interpret and can handle both numerical and categorical data.
-
- Library: sklearn.neighbors
- Class: KNeighborsClassifier
- Description: The KNeighborsClassifier class implements the KNN algorithm, where a sample is classified based on the majority vote of its nearest neighbors. KNN is simple and effective for smaller datasets, especially where the decision boundary is not linear.
- Import Libraries: The program imports several libraries for data manipulation (pandas, numpy), machine learning (scikit-learn, imblearn), and visualization (matplotlib, mlxtend).
- Load Dataset: The dataset is loaded into a Pandas DataFrame, and the first few rows are displayed.
-
Data Preprocessing:
- Feature Scaling: The features (independent variables) are scaled to a range of 0 to 1 using MinMaxScaler.
- Handling Imbalanced Data: SMOTE (Synthetic Minority Over-sampling Technique) is applied to balance the dataset, generating synthetic samples for the minority class.
- Train-Test Split: The dataset is split into training and testing sets with an 80-20 ratio.
- Feature Scaling (Standardization): The features are standardized using StandardScaler to have a mean of 0 and a standard deviation of 1.
-
Model Training and Evaluation:
- Naive Bayes Classifier:
- The GaussianNB classifier is trained on the standardized data.
- Predictions are made on the test set.
- The confusion matrix and classification report are generated and visualized.
- Accuracy, precision, recall, and F1 score are manually calculated.
- Decision Tree Classifier:
- A Decision Tree with entropy as the criterion is trained and evaluated similarly.
- K-Nearest Neighbors Classifier:
- A KNN classifier with 5 neighbors is trained and evaluated similarly.
- Naive Bayes Classifier:
-
Comparison of Models:
- Bar charts are created to compare the accuracy, precision, recall, and F1 score of the three models. (For more details, view the 'Doc' pdf within this repository
This program provides a comprehensive approach to classification, addressing class imbalance, standardizing features, and evaluating multiple models using key performance metrics. It concludes with a visual comparison of model performance, which aids in selecting the most suitable model for the given dataset.