clusterval

For validating clustering results

Motivation

This package was made to facilitate the process of clustering of a dataset. The user needs only to specifiy the dataset or the pairwise distances in the form of a list-like structure and clustering will be performed as well as evaluation of the results, through the use of CVIs (Clustering Validation Indices). clusterval outputs the best partition of the data, a dendrogram and k, the number of clusters. Clustering algorithms available are: 'single', 'complete', 'ward', 'centroid', 'average' and 'kmeans.

Installation

You can get the stable version from PyPI:

pip install clusterval

Or the development version from GitHub:

pip install git+https://github.com/Nuno09/clusterval.git

Basic usage

1. Load libraries.

from clusterval import Clusterval
from sklearn.datasets import load_iris, make_blobs

2. Let's use the iris dataset

data = load_iris()['data']

3. Use clusterval to determine the optimal number of clusters. The code below will create a Clusterval object that for an input dataset will partition the data into 2-8 clusters using hierarchical aglomerative clustering, with ward criteria, then calculate various CVIs across the results.

c = Clusterval()
c.evaluate(data)

Outupt:
    Clusterval(min_k=2, max_k=8, algorithm=ward, bootstrap_samples=250, index=['all'])
    final_k = 2

4. If user wishes more information on the resulting clustering just type below command.

print(c.long_info)

Long output:

* Minimum number of clusters to test: 2
* Maximum number of clusters to test: 8
* Number of bootstrap samples generated: 250
* Clustering algorithm used: ward

* Validation Indices calculated: ['AR', 'FM', 'J', 'AW', 'VD', 'H', 'F', 'VI', 'K', 'Phi', 'RT', 'SS', 'CVNN', 'XB', 'SDbw', 'DB', 'S', 'SD', 'PBM', 'Dunn']

* Among all indices: 



* According to the majority rule, the best number of clusters is 2


* 15 proposed 2 as the best number of clusters 

* 3 proposed 3 as the best number of clusters 

* 1 proposed 6 as the best number of clusters 

* 1 proposed 8 as the best number of clusters 

			***** Conclusion *****			
         AR        FM         J        AW        VD         H         F        VI         K           Phi        RT        SS      CVNN        XB      SDbw        DB         S         SD        PBM  \
2  0.999426  0.526738  0.356955  0.000331  0.385982  0.000283  0.526045  1.312383  0.527433  2.974318e-06  0.333611  0.217291  1.000000  0.261539  0.752460  0.191376  0.722234  10.714331  20.485220   
3  1.000172  0.379244  0.233104 -0.000791  0.525643 -0.000891  0.378001  1.972355  0.380492 -9.809481e-06  0.356957  0.131954  0.974122  0.414386  0.791627  0.388819  0.560392  10.764615  25.473608   
4  1.000080  0.285500  0.166032 -0.000607  0.586732 -0.000677  0.284666  2.511133  0.286337 -8.672063e-06  0.418338  0.090562  1.233481  0.610347  0.798949  0.448981  0.458409  12.500304  19.111932   
5  1.000070  0.263834  0.150461 -0.000815  0.591750 -0.000972  0.261470  2.676828  0.266223 -1.296032e-05  0.434558  0.081374  1.552875  0.578900  0.699964  0.527533  0.436093  15.188085  18.932200   
6  1.000042  0.192872  0.106553 -0.000079  0.675911 -0.000065  0.192512  3.096166  0.193233 -8.572647e-07  0.524317  0.056289  1.518029  0.767859  0.695073  0.580975  0.367109  18.297956  16.657177   
7  1.000036  0.170423  0.092730 -0.001502  0.687232 -0.001626  0.169647  3.246090  0.171204 -2.903045e-05  0.554676  0.048633  1.479109  0.884422  0.743185  0.737307  0.335528  21.942174  13.619411   
8  1.000032  0.155010  0.083478 -0.000446  0.694161 -0.000496  0.154014  3.349153  0.156014 -9.595088e-06  0.581493  0.043571  1.455532  0.884422  0.701268  0.681691  0.370146  22.229196  11.636157   

       Dunn  
2  0.338909  
3  0.112795  
4  0.123508  
5  0.123508  
6  0.131081  
7  0.131081  
8  0.150756  

* The best partition is:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]

4. It's also possible to visualize the hierarchical clustering. Note: in linux systems installation of library "python3-tk" might be needed.

c.plot()

5. The user can also change some execution parameters. For example, clustering algorithm, range of k to test, bootstrap simulations and CVI to use.

data, _ = make_blobs(n_samples=700, centers=10, n_features=5, random_state=0)
c = Clusterval(min_k=5, max_k=15, algorithm='kmeans', bootstrap_samples=200, index='CVNN')
c.evaluate(data)

Output:
    Clusterval(min_k=5, max_k=15, algorithm=kmeans, bootstrap_samples=200, index=['CVNN'])
    final_k = 10

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.idea		.idea
clusterval		clusterval
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clusterval

Motivation

Installation

Basic usage

About

Releases

Packages

Languages

License

Nuno09/clusterval

Folders and files

Latest commit

History

Repository files navigation

clusterval

Motivation

Installation

Basic usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages