Classification methods often attempt to assign labels to a new data point from a set of possible labels. When different classification methods assign different sets of labels to the same data point, it can be unclear whether the labels that overlap among the assigned sets are simply due to chance or are due to some property of the new data point recognized by all the methods. To answer this question, chyper
is an R package for working with conditional hypergeometric distributions, distributions describing the number of overlapping labels when sets (of fixed but arbitrary size) of labels are randomly assigned by a fixed but arbitrary number of classification methods.
The package chyper
is an R package for implementing conditional hypergeometric distributions to aid in comparing classification methods. While conditional or hierarchical hypergeometric models have been described in bioinformatics literature (Kim 2018), an implementation in R has not yet been produced. Like other R packages built for distributions, chyper
includes a probability mass function, a cumulative distribution function, a quantile function, and a random number generator. In addition, it implements functions to give the mean of a conditional hypergeometric distribution, a p-value from observing a particular number of overlapping labels, a maximum likelihood estimator (MLE) for an unknown database overlap size, an MLE and method of moments (MOM) estimator for an unknown database non-overlap size, and an MLE and MOM estimator for an unknown label set size.
This package was designed to be used in comparing classification methods for problems where each data point can receive multiple labels and each classification method can assign multiple labels to each data point. For example, in microbiome analysis, a single metagenomic sample can have many taxa assigned to it by methods with databases containing thousands of taxa, but the assignments will likely differ by method, and the databases will also differ by method (Sun et al. 2021). Alternatively, image classification often involves methods assigning multiple tags to an image, but different methods might assign different tags, and those different methods might differ in which tags are contained in their databases (Czerniawski and Leite 2020). This package provides a mathematically sound way to quantify the probability that labels assigned to some data point by multiple methods are due to some actual feature of the data rather than chance.
The package is available on CRAN, and can be installed as follows:
install.packages("chyper")
Consider two overlapping populations with
This solution can be extended to the case of
since
Two dynamic programming optimizations speed this computation significantly. First, calculating this PMF can be optimized by storing all the
Second, the calculation can be optimized by storing the
In the equation
This holds for any level; that is, multiplying
I’d like to thank Srihari Ganesh and Skyler Wu for their help in reviewing and improving the two-population case; Max Li for his help in optimizing the PMF calculation; and Meghan Short and Kelsey Thompson from the Huttenhower Lab at the Harvard T.H. Chan School of Public Health and the Broad Institute for their help in reviewing the math behind the arbitrary-population-number extension.