Skip to content

awslabs/amazon-denseclus

Amazon DenseClus

build total download month download weekly download PyPI version PyPI - Python Version PyPI - Wheel PyPI - License Code style: black Github Super-Linter

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

python3 -m pip install amazon-denseclus

Quick Start

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

from denseclus import DenseClus
from denseclus.utils import make_dataframe


df = make_dataframe()
clf = DenseClus(df)
clf.fit(df)

scores = clf.evaluate()
print(scores[0:10])

Usage

Prediction

DenseClus uses a predict method when umap_combine_method is set to ensemble. Results are return in 2d array with the first part being the labels and the second part the probabilities.

from denseclus import DenseClus
from denseclus.utils import make_dataframe

RANDOM_STATE = 10

df = make_dataframe(random_state=RANDOM_STATE)
train = df.sample(frac=0.8, random_state=RANDOM_STATE)
test = df.drop(train.index)
clf = DenseClus(random_state=RANDOM_STATE, umap_combine_method='ensemble')
clf.fit(train)

predictions = clf.predict(test)
print(predictions) # labels, probabilities

On Combination Method

For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.

clf = DenseClus(
    umap_combine_method="intersection_union_mapper",
)

To Use with GPU with Ensemble

To use with gpu first have rapids installed. You can do this as setup by providing cuda verision. pip install amazon-denseclus[gpu-cu12]

Then to run:

clf = DenseClus(
    umap_combine_method="ensemble",
    use_gpu=True
)

Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing dictionaries into DenseClus class for either UMAP or HDBSCAN.

For example:

from denseclus import DenseClus
from denseclus.utils import make_dataframe

umap_params = {
    "categorical": {"n_neighbors": 15, "min_dist": 0.1},
    "numerical": {"n_neighbors": 20, "min_dist": 0.1},
}
hdbscan_params = {"min_cluster_size": 10}

df = make_dataframe()

clf = DenseClus(umap_combine_method="union"
             , umap_params=umap_params
             , hdbscan_params=hdbscan_params
             , random_state=None) # this will run in parallel

clf.fit(df)

Examples

Notebooks

A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.

Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook

Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook

Blogs

AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data

TDS Blog: How To Tune HDBSCAN

TDS Blog: On the Validation of UMAP

References

@article{mcinnes2018umap-software,
  title={UMAP: Uniform Manifold Approximation and Projection},
  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
  journal={The Journal of Open Source Software},
  volume={3},
  number={29},
  pages={861},
  year={2018}
}
@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}