Skip to content

Latest commit

 

History

History
36 lines (27 loc) · 10.2 KB

README.md

File metadata and controls

36 lines (27 loc) · 10.2 KB

Common segmentation algorithms and distance metrics

There are many ways to divide your data into subgroups. This is a short overview of some of these methods, enriched with a few comments on each method. In the second table I list several methods of calculating pairwise distances between records in your data.

There are many nuances to each of the methods described here - they are just a quick duckduckgo search away on stack overflow and myriad textbooks. Towards Data Science has a great post with animations. There are many interactive applications to see clustering in action, e.g. Stanford class, or ClusterEng which can use your own data or sample (genomic) data.

Lastly, real-world segmentation rarely looks like textbook examples. To quote this excellent blogpost:

If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions?

Name Input Particularly useful for Pros Cons R function Example
K-means Continuous variables only (standardized) Continuous data sets, quick exploration of data. Can be used on already-transformed data (e.g. unit activities in neural networks, principal components) Simple, widely used, many tools to compute optimal clusters Variables need to be standardized; clusters need to span similar size in space and be spherical. Breaks down in so many ways stats::kmeans Maibach et al.: Translating Health Psychology into Effective Health Communication: The American Healthstyles Audience Segmentation Project
K-medoids Dissimilarity matrix Data with some outliers (more robust than k-means), mixed data that can be converted to distance Works with any distance matrix. Slightly more robust than k-means. Cluster centres are actual exemplars from data. Like k-means, not a model-based method so no metrics of model fit. cluster::pam Kenyon et al.: Classification of incidence and prevalence of certain sexually transmitted infections by world regions
Decision tree Any mix of variable types Segmentations with one (specific/composite) outcome variable to guide the process; quick test of most strongly predictive variables Segment construction very transparent and splitting rules are explicit. Leaves should be profiled to enrich description of each segment beyond few variables in decision tree. Population split by most predictive variables only, so requires relevant and reliable outcome. Requires additional work to achieve reliable tree. rpart::rpart Ozgulbas et al.: Financial profiling of public hospitals: an application by data mining
Latent class analysis Categorical variables only Data sets that are categorical or have been converted to categorical. One of few methods that can deal with categorical variables. Results simple to interpret (just class probabilities). As mixture model, outputs metrics to evaluate optimal class number. Lose information as continuous distributions are discretized. poLCA::poLCA Kemperman et al.: Heterogeneity in Urban Park Use of Aging Visitors: A Latent Class Analysis
Gaussian mixture models Multivariate gaussian distributed data Continuous data where you can reasonably assume subpopulations have gaussian distributions on each variable (fits a distribution described by mean and standard deviation). Where k-means assumes circular distribution, GMMs can fit any ellipse-shaped data. Mclust package very powerful in performing cluster selection, dimensionality reduction, and other useful functions. Inaccurate if continuous variables for a subgroup have non-normal distributions in some variables, which is often the case in real-world data. mclust::Mclust Raaijmakers et al.: Consumer segmentation based on health-related motive orientations and fruit and vegetable consumption
Hierarchical clustering Dissimilarity matrix Quick explorations, visualizing clustering solutions with dendrograms. Versatile technique, dendrogram gives useful intuition on number of clusters. Commits to joining most adjacent records without reconsidering groupings – might require e.g. k-means to refine clusters. stats::hclust Martins et al.: Clustering symptoms of non-severe malaria in semi-immune Amazonian patients
TwoStep clustering Continuous and categorical variables Large data (solves some memory issues), hands-off solution. Similar to hierarchical clustering. Sequentially groups records that are similar, then performs hierarchical clustering across these groups. SPSS provides list of relevance of each variable to clustering solution. Does the heavy lifting from start to finish including optimal solution suggestion. Order of records matters for final solution. Bit of a black box. SPSS only. SPSS-native, possibly also available in prcr package or deprecated birch:: Helm et al: Subgrouping outpatients of an environmental medicine unit using SCL-90-R and cluster analysis
Manual segmentation Based on existing knowledge or eyeballing distributions, set manual cut-offs Situations in which you have strong priors on what is important and common sense can tell you what a relevant subgroup is (e.g. in ecommerce, everyone who hasn’t been to the shop in 30+ days) Fast, leverages domain knowledge, easy to implement, not subject to vagaries of algorithms. Will never be better than your intuition, human mind can only take into account handful of variables. Any intervention where e.g. some age cut-off is set manually.

Calculating distances between records

Some of the more common clustering methods – including partitioning around medoids (PAM) and hierarchical clustering – work with "distance" or "dissimilarity" matrices (as does k-nearest neighbours). These are square matrices with number of rows and columns equal to number of records in your data. Each element in this matrix represents the dissimilarity between two records. The diagonal will be 0 (each record is identical to itself), and the unit on the non-diagonal depends on the method of calculating distance.

More advanced applications calculate distance in some abstract space, such as the feature space spanned by the activations of all neurons in the final layer of a neural net, or in a space generated for word embeddings.

There are many such methods and there is no need to cover them all. Some of the more important and common ones are described here.

Name Input Particularly useful for Pros Cons R function
Euclidean distance Continuous variables only, each on the same scale (e.g. standard deviations from standardization) Run-of-the-mill clustering with continuous variables Literally the shortest distance between two points in the space of the variables. All distances approximate the same value in high-dimensional space (great Stack Exchange post), so be careful if you have many variables. stats::dist(df, method=’euclidean’)
Manhattan distance As Euclidean. Working with k-medoids, high-dimensional data, and for outlier data Less influenced by outliers than Euclidean distance. Less bad in high-dimensional settings. stats::dist(df, method='manhattan')
Gower distance Any mix of variables Mixed dataframes. Gower distance will use an appropriate distance metric on each variable independently, then combine them across variables. No standardization needed. Hugely flexible and allows virtually any data set to be segmented by PAM and hierarchical clustering Still need to consider outliers and skewed distributions as with most other methods. cluster::daisy(df, metric='gower')
Random forest distance Any mix of variables Taking a raw dataframe and brute-forcing it into a distance matrix by building 1000s of decision trees and seeing how often each pair of records ends up in the same leaf (or leaves close to one another) Rather elegant one-size-fits-all approach. Can be used for both directed and undirected distance calculation (by setting target of each tree to random variable or non-random target variable). R’s randomforest package is very good. Takes substantial computation and has some knobs to tweak. randomForest::randomForest(~., data=df, ntree=10000, proximity=TRUE, oob.proximity=TRUE)
Correlation-based distance Continuous variables on same scale only Clustering based on patterns of values rather than actual values themselves. Places individuals close together that exhibit a similar pattern (e.g. high-high-low-low). Helps you avoid some of the pitfalls of other distance metrics in high dimensions. Depending on your variables, the actual values of variables might be much more important than patterns of responses. Only works if you can get each variable in the same space (e.g. z-score first) r = cor(t(df)) then dist = as.dist(1-r) link
Cosine (dis)similarity Continuous variables on the same scale High-dimensional data, popular in natural language processing Reduce arbitrarily dimensional space down to single number without some of the unexpected issues with euclidean distance not sure cos_angle = lsa::cosine(t(df)) then dist = as.dist(1 - cos_angle) (to be checked as cos_angle can be negative)