This project provides a flexible command-line tool for running a variety of clustering algorithms on tabular data. It supports preprocessing, optional cleaning, and evaluation with clear, user-friendly feedback.
- Multiple clustering algorithms (KMeans, DBSCAN, HDBSCAN, Agglomerative, Spectral, GMM, Autoencoder, etc.)
- Optional data cleaning (by default, cleaning is skipped; enable with
--clean
) - Easy parameterization via the command line
- Outputs cluster assignments to a CSV file
- Helpful error messages and documentation references
-
Install dependencies
pip install -r requirements.txt
-
Run clustering
python main.py <csv_file> [algorithm] [param1=value1 ...] [--clean]
<csv_file>
: Path to your data CSV (required)[algorithm]
: Clustering algorithm key (optional, defaults toagglomerative_average
)[param1=value1 ...]
: Optional algorithm parameters[--clean]
: Enable cleaning of unrealistic values (default is skipped)
-
Examples
- KMeans:
python main.py data.csv kmeans n_clusters=4
- DBSCAN:
python main.py data.csv dbscan eps=0.5 min_samples=5
- Default (hierarchical average, no cleaning):
python main.py data.csv
- With cleaning:
python main.py data.csv --clean
- KMeans:
-
See
GUIDE.md
for full documentation, algorithm list, and troubleshooting.
main.py
— Main entry point for clustering and evaluationpreprocessing.py
— Data cleaning and preprocessing utilitiesclustering_algorithms/
— Contains implementations for each clustering methodrequirements.txt
— Python dependencies.gitignore
— Ignores CSV files, cache, and system filesGUIDE.md
— Complete usage guide and examples
- All output cluster assignments are saved as
<algorithm>_clusters.csv
. - If you encounter errors, check the error message for a reference to
GUIDE.md
.
Happy clustering!