Problem-oriented AutoML in Clustering (PoAC) is a flexible and powerful framework designed to enhance the automation of clustering tasks within the AutoML landscape. PoAC leverages meta-learning and surrogate modeling to optimize clustering pipelines, offering a flexible approach that allows customization of meta-features, Clustering Validation Indices (CVIs).
- Problem Space Generation: Synthesize labeled clustering datasets through combinatorial analysis of dataset archetype parameters.
- Clustering Simulations: Create partitionings with multiple noise levels, calculate CVIs, and similarity metrics to simulate clustering performance.
- Feature Space Construction: Extract meta-features from the problem space datasets and combine them with the CVIs and similarity metrics to build a comprehensive meta-database.
- Surrogate Modeling: Train a regression model as a surrogate to predict the quality of clustering pipelines, enabling task-agnostic optimization across various clustering scenarios.
- Clustering pipeline synthesis: Seamlessly integrate the trained surrogate model with popular AutoML frameworks like TPOT to enhance clustering evaluations.
To get started with PoAC, follow these steps:
-
Clone the repository:
git clone git@github.com:Mcamilo/poac.git \ cd PoAC
It’s recommended to use a virtual environment to manage dependencies.
- Create a virtual environment:
python3 -m venv poac-env source poac-env/bin/activate # On Windows, use `poac-env\Scripts\activate`
- Install the required packages:
pip install -r requirements.txt
We have divided the PoAC framework into two main stages: Training of the Surrogate Model and the Pipeline Synthesis. While the framework is designed to guide users through these stages sequentially, it is flexible enough to allow users to execute individual modules based on their specific needs. Additionally, PoAC comes with a pre-trained default surrogate model, enabling users to quickly start synthesizing and optimizing clustering pipelines without the need for training a new model.
import poac
import joblib
surrogate = poac.Surrogate()
# Start by defining the problem space, where you synthesize clustering datasets:
surrogate.populate_problem_space(sample_size=5, keep=False)
# Simulate clustering partitionings with varying levels of noise:
surrogate.simulate_solutions()
# Extract meta-features and combine with CVIs and similarity metrics
surrogate.extract_metafeatures()
# Train the surrogate model
surrogate_model = surrogate.build_model()
# Optionally, save the surrogate model
joblib.dump(surrogate_model, 'optimization/tpot/models/random_forest_model.joblib')
import poac
from sklearn.datasets import load_breast_cancer
# Example of using PoAC with TPOT
data = load_breast_cancer().data
optimizer = poac.Optimizer(data)
sv6light_meta_features = ['attr_ent.sd','sparsity.sd', 'cov.mean','var.mean','eigenvalues.mean','sparsity.mean', 'wg_dist.sd', 'iq_range.mean','sil','dbs']
code, pipeline, labels = optimizer.synthesize(generations=3,population_size=5,meta_features=sv6light_meta_features)
In our experiments, integrating the PoAC surrogate model into TPOT achieved a mean Adjusted Rand Index (ARI) of 70% across 100 synthetic datasets. The model's flexibility and robustness make it suitable for a wide range of clustering tasks and AutoML applications.
We welcome contributions to PoAC! Please fork the repository, create a new branch, and submit a pull request. For major changes, please open an issue to discuss your proposed changes.
This project is licensed under the MIT License - see the LICENSE file for details.