Implemented the variational inference (VI) algorithm to infer Bayesian linear regression coefficients that uses a spike-and-slab prior and reparametrized into Bernoulli-Gaussian prior.
Variational inference is a technique used in Bayesian statistics and machine learning to approximate complex posterior distributions when exact inference is computationally infeasible. It is particularly useful when dealing with models that involve high-dimensional latent variables or complex dependencies.
Polygenic risk score (PRS) regression is a statistical method used in genetics to predict complex traits or diseases based on the aggregate effects of multiple genetic variants. It involves calculating a polygenic risk score for an individual by summing the effects of genetic variants, each weighted by its effect size.
A spike-and-slab prior is a type of prior distribution commonly used in Bayesian regression to handle variable selection. It consists of two parts:
- Spike: This part of the prior assigns high probability mass to a specific value of the regression coefficient, often set to zero, representing the "sparsity" assumption, meaning that many coefficients are expected to be exactly zero.
- slab: The slab part assigns a broader distribution to the coefficients, allowing them to take non-zero values. By combining the spike and slab components, the spike-and-slab prior encourages sparsity in the model, effectively performing variable selection by setting some coefficients to exactly zero.
When applying variational inference to Bayesian polygenic risk score regression with a spike-and-slab prior, the main goal is to approximate the posterior distribution over the regression coefficients and other model parameters. This approximation is achieved by transforming the inference problem into an optimization problem. Variational inference optimizes a variational lower bound on the log marginal likelihood (evidence) of the model.
In particular, the model is specified as below:
There are M = 100 SNPs,
In the Expectation step, we cycle through one SNP at a time to update its posterior precision and posterior mean of the effect size and its PIP while fixing all of the rest of the SNPs.
Set the initial values for the posterior estimates for all SNPs to the following values:
Set the initial values for the hyperparameters to the following values:
For numerical stability, after a full cycle of E-step, cap the resulting PIP
In Maximization step, we didn't update
The evidence lower bound (ELBO) of the model is:
More specifically,
where $γ_j^* $ , $μ_j^* $ , and
I ran EM-update algorithm for 10 iterations and the plot the ELBO as a function of iterations.
Then, predict PRS for both the 439 training patients and the 50 testing patients using their genotypes:
where
I then calculated the Pearson correlation coefficient (PCC) between your predicted phenotypes and the true phenotypes, and generated the following scatter plots PRS prediction on training and testing set
Inferred PIP γ. The causal SNPs are rs9482449, rs7771989, and rs2169092 are colored in red.