There are many ways to divide your data into subgroups. This is a short overview of some of these methods, enriched with a few comments on each method. In the second table I list several methods of calculating pairwise distances between records in your data.
There are many nuances to each of the methods described here - they are just a quick duckduckgo search away on stack overflow and myriad textbooks.
Name | Input | Particularly useful for | Pros | Cons | R function |
---|---|---|---|---|---|
K-means | Continuous variables only (standardized) | Continuous data sets, quick exploration of data. Can be used on already-transformed data (e.g. unit activities in neural networks, principal components) | Simple, widely used, many tools to compute optimal clusters | Variables need to be standardized; clusters need to span similar size in space and be spherical. Breaks down in so many ways | stats::kmeans |
K-medoids | Dissimilarity matrix | Data with some outliers (more robust than k-means), mixed data that can be converted to distance | Works with any distance matrix. Slightly more robust than k-means. Cluster centres are actual exemplars from data. | Like k-means, not a model-based method so no metrics of model fit. | cluster::pam |
Decision tree | Any mix of variable types | Segmentations with one (specific/composite) outcome variable to guide the process; quick test of most strongly predictive variables | Segment construction very transparent and splitting rules are explicit. Leaves should be profiled to enrich description of each segment beyond few variables in decision tree. | Population split by most predictive variables only, so requires relevant and reliable outcome. Requires additional work to achieve reliable tree. | rpart::rpart |
Latent class analysis | Categorical variables only | Data sets that are categorical or have been converted to categorical. | One of few methods that can deal with categorical variables. Results simple to interpret (just class probabilities). As mixture model, outputs metrics to evaluate optimal class number. | Lose information as continuous distributions are discretized. | poLCA::poLCA |
Normal mixture models | Multivariate normally distributed data | Continuous data where you can reasonably assume subpopulations have normal distributions on each variable. | Mclust package very powerful in performing cluster selection, dimensionality reduction, and other useful functions. | Unclear if useful if continuous variables have non-normal distributions, which is often the case in real-world data. | mclust::Mclust |
Hierarchical clustering | Dissimilarity matrix | Quick explorations, visualizing clustering solutions with dendrograms. | Versatile technique, dendrogram gives useful intuition on number of clusters. | Commits to joining most adjacent records without reconsidering groupings – might require e.g. k-means to refine clusters. | stats::hclust |
Two-step clustering | Continuous and categorical variables | Large data (solves some memory issues), hands-off solution. Similar to hierarchical clustering. Sequentially groups records that are similar, then performs hierarchical clustering across these groups. | SPSS provides list of relevance of each variable to clustering solution. Does the heavy lifting from start to finish including optimal solution suggestion. | Order of records matters for final solution. Bit of a black box. | SPSS-native, possibly also available in prcr package or deprecated birch:: |
Factor segmentation | Reduce variables to a limited set of factors. Assign each record to cluster for which it has highest factor score. | ||||
Manual segmentation | Any data | Situations in which you have strong priors on what is important and common sense can tell you what a relevant subgroup is (e.g. in ecommerce, everyone who hasn’t been to the shop in 30+ days) | Fast, leverages domain knowledge, easy to implement, not subject to vagaries of algorithms. | Might miss subgroups, will only take into account 1 or 2 variables. | N/A |
Some of the more common clustering methods – including partitioning around medoids (PAM) and hierarchical clustering – work with "distance" or "dissimilarity" matrices. These are square matrices with number of rows and columns equal to number of records in your data. Each element in this matrix represents the dissimilarity between two records. The diagonal will be 0 (each record is identical to itself), and the unit on the non-diagonal depends on the method of calculating distance.
There are many such methods and there is no need to cover them all. Some of the more important and common ones are described here.
Name | Input | Particularly useful for | Pros | Cons | R function |
---|---|---|---|---|---|
Euclidean distance | Continuous variables only, each on the same scale (e.g. standard deviations from standardization) | Run-of-the-mill clustering with continuous variables | Literally the shortest distance between two points in the space of the variables. | All distances approximate the same value in high-dimensional space (great Stack Exchange post), so be careful if you have many variables. | stats::dist(df, method=’euclidean’) |
Manhattan distance | As Euclidean. | Working with k-medoids, high-dimensional data, and for outlier data | Less influenced by outliers than Euclidean distance. Less bad in high-dimensional settings. | stats::dist(df, method='manhattan') |
|
Gower distance | Any mix of variables | Mixed dataframes. Gower distance will use an appropriate distance metric on each variable independently, then combine them across variables. No standardization needed. | Hugely flexible and allows virtually any data set to be segmented by PAM and hierarchical clustering | Still need to consider outliers and skewed distributions as with most other methods. | cluster::daisy(df, metric='gower') |
Random forest distance | Any mix of variables | Taking a raw dataframe and brute-forcing it into a distance matrix by building 1000s of decision trees and seeing how often each pair of records ends up in the same leaf (or leaves close to one another) | Rather elegant one-size-fits-all approach. Can be used for both directed and undirected distance calculation (by setting target of each tree to random variable or non-random target variable). R’s randomforest package is very good. | Takes substantial computation and has some knobs to tweak. | randomForest::randomForest(~., data=df, ntree=10000, proximity=TRUE, oob.proximity=TRUE) |
Correlation-based distance | Any mix of variables | Clustering based on patterns of values rather than actual values themselves. | Places individuals close together that exhibit a similar pattern (e.g. high-high-low-low). Helps you avoid some of the pitfalls of other distance metrics in high dimensions. | Depending on your variables, the actual values of variables might be much more important than patterns of responses. Do patterns matter in your data? | r = cor(t(df)) then dist = as.dist(1-r) or use psych::mixed.cor(t(df), method='gower') for mixed variables link |