Skip to content

Latest commit

 

History

History
43 lines (38 loc) · 2.7 KB

Tutorial.md

File metadata and controls

43 lines (38 loc) · 2.7 KB

Fitting mixtures of t-distributions -- a tutorial

Additional Background

Introduction

Like all other clustering techniques, mixture models have advantages and disadvantages and are only optimal for certain kinds of problems. The most common mixture model is the Gaussian mixture, which assumes all clusters are approximately normal distributions. The number of components can be chosen using an information criterion-based heuristic (a "finite mixture") or by fitting either finite or infinite mixtures using a variational mean-field approximation; this approach automatically selects the number of clusters most appropriate given the data and a user-specified prior. Alternatively, mixture models can be fitted using MCMC sampling techniques (at considerably greater computational expense).

The assumption that all clusters are normal distributions might seem arbitrary and fairly problematic. One does however encounter a fair number of datasets where this assumption is approximately true or at least reasonable, and hence a Gaussian mixture is sometimes a useful approach. Its usefulness, however, is limited by the fact that it can easily be thrown off by the presence of a sprinkling of outliers. Consider the toy dataset below, clustered using either a mixture of Student's t-distributions or a Gaussian mixture. Even with 5 restarts, the presence of some outliers was enough to make the Gaussian mixture worse than useless.
mixture_comp
The moral of the story is that most of the time, if you're going to cluster data using a mixture model, a mixture of Student's t-distributions is a better bet than a Gaussian mixture. If there are no outliers, you'll get a clustering similar to what you would get from a Gaussian mixture. If there are outliers...the Gaussian mixture may be useless, while the Student's t-mixture can still perform well.

This package does not provide functions or classes for MCMC-based fitting at this time because the excellent PyMC3 package already provides good tools for that purpose. Rather, it focuses on fitting using either EM or variational mean-field. These approaches are much less computationally expensive, less well adapted to inference and significantly more suitable for making predictions. Just like scitkit-learn, then, these classes focus on prediction rather than inference.