Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize BaldingNichols to PritchardStephensDonnally #3206

Merged
merged 1 commit into from
Mar 22, 2018

Conversation

jbloom22
Copy link
Contributor

@jbloom22 jbloom22 commented Mar 21, 2018

A simple but powerful extension requested by @alexb-3 and Christina to allow for synthetic genotypes with very general and realistic-looking PCA plots with [redacted]. Alex pointed out that BaldingNichols is special case of PritchardStephensDonnelly in a degenerate sense, just as one-hot encoded Categorical(p_1,...,p_k) is the distributional limit of Dirichlet(a * p_1,..., a * p_k) as a goes to 0. So the substantive changes took about 10 lines.

It's turned on by the mixture parameter which defaults to False and is marked as experimental. True means treat pop_dist as the parameters of Dirichlet rather than Categorical. @alexb-3 , it'd be great if you and Christina could experiment with it and extend the documentation accordingly. Once we have that, I'll add tests and remove "experimental". The plots below are already quite convincing.

import hail as hl
import matplotlib.pyplot as plt

mt = hl.balding_nichols_model(3, 500, 50, pop_dist=[0.01, 0.02, 0.05], fst=[.2, .3, .5])
_, pcs, _ = hl.hwe_normalized_pca(mt, 3)
plt.scatter(pcs.PC1.collect(), pcs.PC2.collect())

ex0

mt = hl.balding_nichols_model(3, 500, 50, pop_dist=[0.01, 0.02, 0.05], fst=[.2, .3, .5], mixture=True)

ex1

mt = hl.balding_nichols_model(3, 500, 50, pop_dist=[0.1, 0.2, 0.5], fst=[.2, .3, .5], mixture=True)

ex2

@tpoterba
Copy link
Contributor

This is pretty awesome.

Also, this:

mt = hl.balding_nichols_model(3, 500, 50, pop_dist=[0.01, 0.02, 0.05], fst=[.2, .3, .5])
_, pcs, _ = hl.hwe_normalized_pca(mt, 3)
plt.scatter(pcs.PC1.collect(), pcs.PC2.collect())

is exactly what I want to write. 👍

@jbloom22
Copy link
Contributor Author

Yeah, amazing how far the Python interface has come!

Here are the essential changes:

val popOfSample_n = DenseMatrix.zeros[Double](if (mixture) K else 1, N)
if (mixture) {
  val popDistRV = Dirichlet(popDist_k)
  (0 until N).foreach(j => popOfSample_n(::, j) := popDistRV.draw())
} else {
  popDist_k :/= sum(popDist_k)
  val popDistRV = Multinomial(popDist_k)
  (0 until N).foreach(j => popOfSample_n(0, j) = popDistRV.draw())
}
val p =
  if (mixture)
    popOfSample_nBc.value(::, i) dot popAF_k
  else
    popAF_k(popOfSample_nBc.value(0, i).toInt)

@tpoterba
Copy link
Contributor

actually, this is what I want to write, I think:

mt = hl.balding_nichols_model(3, 500, 50, pop_dist=[0.01, 0.02, 0.05], fst=[.2, .3, .5])
_, pcs, _ = hl.hwe_normalized_pca(mt, 3)
hl.plot.scatter(pcs.PC1, pcs.PC2)

@patrick-schultz
Copy link
Collaborator

Would it make sense to expose the a parameter, to make it easier to move between the three examples you showed? The mixture parameter could just be floating point rather than boolean, treating the default mixture=0 case specially.

@patrick-schultz patrick-schultz merged commit cdfb75b into hail-is:master Mar 22, 2018
@cseed
Copy link
Collaborator

cseed commented Mar 22, 2018

actually, this is what I want to write, I think

I starting write this and then I noticed you already said it. Hailers are the best.

jbloom22 added a commit to jbloom22/hail that referenced this pull request Mar 22, 2018
konradjk pushed a commit to konradjk/hail that referenced this pull request Jun 12, 2018
jackgoldsmith4 pushed a commit to jackgoldsmith4/hail that referenced this pull request Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants