A Streamlit web application that performs Exploratory Data Analysis (EDA), Data Preprocessing, and Supervised Machine Learning to classify Iris species from the Iris dataset (Setosa, Versicolor, and Virginica) using Decision Tree Classifier and Random Forest Regressor.
Dataset
- Brief description of the Iris Flower dataset used in this dashboard.EDA
- Exploratory Data Analysis of the Iris Flower dataset. Highlighting the distribution of Iris species and the relationship between the features. Includes graphs such as Pie Chart, Scatter Plots, and Pairwise Scatter Plot Matrix.Data Cleaning / Pre-processing
- Data cleaning and pre-processing steps such as encoding the species column and splitting the dataset into training and testing sets.Machine Learning
- Training two supervised classification models: Decision Tree Classifier and Random Forest Regressor. Includes model evaluation, feature importance, and tree plot.Prediction
- Prediction page where users can input values to predict the Iris species using the trained models.Conclusion
- Summary of the insights and observations from the EDA and model training.
Through exploratory data analysis and training of two classification models (Decision Tree Classifier
and Random Forest Regressor
) on the Iris Flower dataset, the key insights and observations are:
- The dataset shows moderate variation across the sepal and petal features.
petal_length
andpetal_width
has higher variability than the sepal features further suggesting that these features are more likely to distinguish between the three Iris flower species. - All of the three Iris species have a balanced class distribution which further eliminates the need to rebalance the dataset.
- Pairwise Scatter Plot analysis indicates that
Iris Setosa
forms a distinct cluster based on petal features which makes it easily distinguishable fromIris Versicolor
andIris Virginica
. - Petal Length emerged as the most discriminative feature especially for distinguishing
Iris Setosa
from other Iris species.
- The
Decision Tree Classifier
achieved 100% accuracy on the training data which suggests that using a relatively simple and structured dataset resulted in a strong performance for this model. However, this could also imply potential overfitting due to the model's high sensitivity to the specific training samples. - In terms of feature importance results from the Decision Tree Model,
petal_length
was the dominant predictor having 89% importance value which is then followed bypetal_width
with 8.7%.
- The Random Forest Regressor achieved an accuracy of 98.58% on training and 99.82% on testing which is slightly lower compared to the performance of the Decision Tree Classifier Model
- Feature importance analysis also highlighted
petal_length
as the primary predictor having 58% importance value followed bypetal_width
with 39%.
Throughout this data science activity, it is evident that the Iris dataset is a good dataset to use for classification despite of its simplicity. Due to its balanced distribution of 3 Iris flower species and having 0 null values, further data cleansing techniques were not used. 2 of the classifier models trained were able to leverage the features that can be found in the dataset which resulted to a high accuracy in terms of the two models' predictions. Despite of the slight overlap between Iris Versicolor and Iris Virginica, the two models trained were able to achieve high accuracy and was able to learn patterns from the dataset.