forked from hail-is/hail
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[query] add hl.pgenchisq (hail-is#12605)
* [query] add hl.pgenchisq CHANGELOG: Add `hl.pgenchisq` the cumulative distribution function of the generalized chi-squared distribution. The [Generalized Chi-Squared Distribution](https://en.wikipedia.org/wiki/Generalized_chi-squared_distribution) arises from weighted sums of sums of squares of independent normally distributed variables and is used by `hl.skat` to generate p-values. The simplest formulation I know for it is this: w : R^n k : Z^n lam : R^n mu : R sigma : R x ~ N(mu, sigma^2) y_i ~ NonCentralChiSquared(k_i, lam_i) Z = x + w y^T = x + sum_i{ w_i y_i } Z ~ GeneralizedNonCentralChiSquared(w, k, lam, mu, sigma) The non-central chi-squared distribution arises from a sum of independent normally distributed variables with non-zero mean and unit variance. The non-centrality parameter, lambda, is defined as the sum of the squares of the means of each component normal random variable. Although the non-central chi-squared distribution has a closed form implementation (indeed, Hail implements this CDF: `hl.pchisqtail`), the generalized chi-squared distribution does not have a closed form. There are at least four distinct algorithms for evaluating the CDF. To my knowledge, the oldest one is by Robert Davies: Davies, Robert. "The distribution of a linear combination of chi-squared random variables." Applied Statistics 29 323-333. 1980. The [original publication](http://www.robertnz.net/pdf/lc_chisq.pdf) includes a Fortran implementation in the publication. Davies' [website](http://www.robertnz.net/QF.htm) also includes a C version. Hail includes a copy of the C version as `davies.cpp`. I suspect this code contains undefined behavior. Moreover, it is not supported on Apple M1 machines because we don't ship binaries for that platform. It seemed to me that the simplest solution is to port this algorithm to Scala. This PR is that port. I tested against the 39 test cases provided Davies with the source code. I also added some doctests based on the CDF plots from Wikipedia. The same 39 test cases are tested in Scala and in Python. I am open to suggestions for the name. `pgenchisq` seems to strike a balance between clarity and brevity. I believe this is the first CDF which can fail to converge. I included some relevant debugging information. I think we should standardize on a schema, but I need more examples before I am certain of the right standard. I am open to critique of `GeneralizedChiSquaredDistribution.scala` but I will strongly argue against significant refactoring. I worry that we will subtly break this algorithm. I directly reached out to Robert Davies to clarify the licensing of this algorithm. It appears to have been released at least under both GPL2 and MIT by unaffiliated third parties (who, really, have no right to apply a license to it). Do not remove WIP until I resolve this. With this PR in place, `hl.skat` can be implemented entirely in Python. * clarify license
- Loading branch information
Showing
11 changed files
with
1,544 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.