A comprehensive machine learning project implementing and comparing Decision Trees and k-Nearest Neighbors (k-NN) algorithms for classifying Iris flowers. This project focuses on binary classification between Versicolor and Virginica species using their petal measurements.
- Project Overview
- Key Features
- 📂 Project Structure
- Installation
- 📊 Results and Analysis
- Usage
- 🔬 Technical Details
- 🤝 Contributing
- 📄 License
This project implements and analyzes two fundamental machine learning algorithms:
- k-Nearest Neighbors (k-NN) with various distance metrics
- Decision Trees with two different splitting strategies (Brute-force and Binary Entropy)
The implementation uses the Iris dataset, specifically focusing on distinguishing between Versicolor and Virginica species using only their second and third features.
-
Advanced k-NN Implementation:
- Multiple k values (1, 3, 5, 7, 9)
- Different distance metrics (L1, L2, L∞)
- Comprehensive error analysis across parameters
-
Dual Decision Tree Approaches:
- Brute-force approach constructing all possible trees
- Binary entropy-based splitting strategy
- Visualizations of tree structures and decision boundaries
.
├── models/ # Core ML model implementations
│ ├── __init__.py
│ ├── decision_trees.py # Decision tree algorithms
│ └── knn.py # k-NN implementation
├── results/ # Generated visualizations
│ ├── decision_tree_errors.png
│ ├── decision_tree_figure1_visualization.png
│ ├── decision_tree_figure2_visualization.png
│ └── k-NN_errors.png
├── data_utils.py # Data handling utilities
├── main.py # Main execution script
├── metrics.py # Evaluation metrics
└── visualization.py # Visualization tools
-
Clone the repository:
git clone https://github.com/yourusername/iris-classification.git cd iris-classification
-
Set up a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
The k-NN implementation was tested with various parameters:
- k values: 1, 3, 5, 7, 9
- Distance metrics: L1 (Manhattan), L2 (Euclidean), L∞ (Chebyshev)
💡 Key Findings:
- Higher k values generally resulted in more stable predictions
- L2 distance metric showed slightly better performance
- Best performance achieved with k=9 using L2 distance
Two decision tree implementations were compared:
-
Brute-Force Approach 🔍:
- Error rate: 5.00%
-
Entropy-Based Approach 🎯:
- Error rate: 7.00%
Run the main analysis script:
python main.py
This will execute:
- 📥 Load and preprocess the Iris dataset
- 📊 Perform k-NN analysis with various parameters
- 🌳 Generate decision trees using both approaches
- 📈 Create visualizations and error analysis
-
k-Nearest Neighbors:
- Custom implementation with multiple distance metrics
- Parameter evaluation framework
- Cross-validation with 100 iterations
-
Decision Trees:
- Brute-force tree construction
- Entropy-based splitting
- Visualization of tree structures and decision boundaries
The project employs several metrics for evaluation:
- Classification error rates
- Training vs. Test set performance
- Error difference analysis
We welcome contributions! Please feel free to submit a Pull Request. For major changes:
- 🍴 Fork the repository.
- 🌿 Create a new branch (
git checkout -b feature-branch
). - 💡 Commit your changes (
git commit -m 'Add new feature'
). - 📤 Push to the branch (
git push origin feature-branch
). - 🔍 Open a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.