Common segmentation algorithms and distance metrics

There are many ways to divide your data into subgroups. This is a short overview of some of these methods, enriched with a few comments on each method. In the second table I list several methods of calculating pairwise distances between records in your data.

There are many nuances to each of the methods described here - they are just a quick duckduckgo search away on stack overflow and myriad textbooks.

Name	Input	Particularly useful for	Pros	Cons	R function
K-means	Continuous variables only (standardized)	Continuous data sets, quick exploration of data. Can be used on already-transformed data (e.g. unit activities in neural networks, principal components)	Simple, widely used, many tools to compute optimal clusters	Variables need to be standardized; clusters need to span similar size in space and be spherical. Breaks down in so many ways	`stats::kmeans`
K-medoids	Dissimilarity matrix	Data with some outliers (more robust than k-means), mixed data that can be converted to distance	Works with any distance matrix. Slightly more robust than k-means. Cluster centres are actual exemplars from data.	Like k-means, not a model-based method so no metrics of model fit.	`cluster::pam`
Decision tree	Any mix of variable types	Segmentations with one (specific/composite) outcome variable to guide the process; quick test of most strongly predictive variables	Segment construction very transparent and splitting rules are explicit. Leaves should be profiled to enrich description of each segment beyond few variables in decision tree.	Population split by most predictive variables only, so requires relevant and reliable outcome. Requires additional work to achieve reliable tree.	`rpart::rpart`
Latent class analysis	Categorical variables only	Data sets that are categorical or have been converted to categorical.	One of few methods that can deal with categorical variables. Results simple to interpret (just class probabilities). As mixture model, outputs metrics to evaluate optimal class number.	Lose information as continuous distributions are discretized.	`poLCA::poLCA`
Normal mixture models	Multivariate normally distributed data	Continuous data where you can reasonably assume subpopulations have normal distributions on each variable.	Mclust package very powerful in performing cluster selection, dimensionality reduction, and other useful functions.	Unclear if useful if continuous variables have non-normal distributions, which is often the case in real-world data.	`mclust::Mclust`
Hierarchical clustering	Dissimilarity matrix	Quick explorations, visualizing clustering solutions with dendrograms.	Versatile technique, dendrogram gives useful intuition on number of clusters.	Commits to joining most adjacent records without reconsidering groupings – might require e.g. k-means to refine clusters.	`stats::hclust`
Two-step clustering	Continuous and categorical variables	Large data (solves some memory issues), hands-off solution. Similar to hierarchical clustering. Sequentially groups records that are similar, then performs hierarchical clustering across these groups.	SPSS provides list of relevance of each variable to clustering solution. Does the heavy lifting from start to finish including optimal solution suggestion.	Order of records matters for final solution. Bit of a black box.	SPSS-native, possibly also available in `prcr` package or deprecated `birch::`
Factor segmentation		Reduce variables to a limited set of factors. Assign each record to cluster for which it has highest factor score.
Manual segmentation	Any data	Situations in which you have strong priors on what is important and common sense can tell you what a relevant subgroup is (e.g. in ecommerce, everyone who hasn’t been to the shop in 30+ days)	Fast, leverages domain knowledge, easy to implement, not subject to vagaries of algorithms.	Might miss subgroups, will only take into account 1 or 2 variables.	N/A

Calculating distances between records

Some of the more common clustering methods – including partitioning around medoids (PAM) and hierarchical clustering – work with "distance" or "dissimilarity" matrices. These are square matrices with number of rows and columns equal to number of records in your data. Each element in this matrix represents the dissimilarity between two records. The diagonal will be 0 (each record is identical to itself), and the unit on the non-diagonal depends on the method of calculating distance.

There are many such methods and there is no need to cover them all. Some of the more important and common ones are described here.

Name	Input	Particularly useful for	Pros	Cons	R function
Euclidean distance	Continuous variables only, each on the same scale (e.g. standard deviations from standardization)	Run-of-the-mill clustering with continuous variables	Literally the shortest distance between two points in the space of the variables.	All distances approximate the same value in high-dimensional space (great Stack Exchange post), so be careful if you have many variables.	`stats::dist(df, method=’euclidean’)`
Manhattan distance	As Euclidean.	Working with k-medoids, high-dimensional data, and for outlier data	Less influenced by outliers than Euclidean distance. Less bad in high-dimensional settings.		`stats::dist(df, method='manhattan')`
Gower distance	Any mix of variables	Mixed dataframes. Gower distance will use an appropriate distance metric on each variable independently, then combine them across variables. No standardization needed.	Hugely flexible and allows virtually any data set to be segmented by PAM and hierarchical clustering	Still need to consider outliers and skewed distributions as with most other methods.	`cluster::daisy(df, metric='gower')`
Random forest distance	Any mix of variables	Taking a raw dataframe and brute-forcing it into a distance matrix by building 1000s of decision trees and seeing how often each pair of records ends up in the same leaf (or leaves close to one another)	Rather elegant one-size-fits-all approach. Can be used for both directed and undirected distance calculation (by setting target of each tree to random variable or non-random target variable). R’s randomforest package is very good.	Takes substantial computation and has some knobs to tweak.	`randomForest::randomForest(~., data=df, ntree=10000, proximity=TRUE, oob.proximity=TRUE)`
Correlation-based distance	Any mix of variables	Clustering based on patterns of values rather than actual values themselves.	Places individuals close together that exhibit a similar pattern (e.g. high-high-low-low). Helps you avoid some of the pitfalls of other distance metrics in high dimensions.	Depending on your variables, the actual values of variables might be much more important than patterns of responses. Do patterns matter in your data?	`r = cor(t(df))` then `dist = as.dist(1-r)` or use `psych::mixed.cor(t(df), method='gower')` for mixed variables link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Common segmentation algorithms and distance metrics

Calculating distances between records

Files

README.md

Latest commit

History

README.md

File metadata and controls

Common segmentation algorithms and distance metrics

Calculating distances between records