Total genetic contribution assessment (TGCA) based on genome-wide summary association statistics.
The TGCA analysis models the Z statistics from genome-wide association studies (GWAS) for each single genetic variant as drawn from a mixture of:
and thereafter makes inference on the TGCA parameter:
assessing the total genetic contribution of the variant on the analyzed set of phenotypes.
Start by installing the TGCA R package (v 1.0-0 tested), which works on all popular platforms including Windows, macOS, and Linux. We recommend directly install from GitHub via:
require(devtools)
install_github('https://github.com/xiashen/TGCA')
The installation should normally take less than 1 minute, as long as the internet connection has a reasonable speed. Then the package can be loaded via:
require(TGCA)
or
library(TGCA)
In the R package, there is an embedded example dataset, which can be loaded as:
data(tgca)
Once loaded, three objects exist in the working space: tstat
, MAF
, and N
, which contain a matrix of GWAS Z-scores for 1,996 SNPs and 122 traits, minor allele frequencies of these SNPs, and the sample sizes of these phenotypes (See also ?tgca
). Such data are mandatory, the low-MAF SNPs will be used for a good estimate of the phenotypic correlation matrix based on the GWAS Z-scores (Shen et al. 2020). N
is used for determining the weight on each phenotype, to avoid total genetic contribution assessment being dominated by a few powerful GWASed traits. The rownames of the tstat
matrix are required, i.e., the SNP names.
tstat[1:5,1:5]
## 134 135 2966_irnt 3761_irnt 2443
## rs375826246 -0.5767310 1.7772500 0.585063 0.885373 1.076920
## rs117619944 1.0435500 0.0123881 -1.835130 0.279999 1.144640
## rs1789089 -0.7954500 -0.5381990 0.710191 0.209710 -0.295909
## rs186756650 -2.9892300 1.0387600 -1.111360 -1.022240 0.123447
## rs1789088 -0.0372302 -1.1471900 -0.222836 -0.486009 -0.767688
head(MAF)
## rs375826246 rs117619944 rs1789089 rs186756650 rs1789088 rs1789087
## 0.00404353 0.03348060 0.48774500 0.00256649 0.05275140 0.34970100
head(N)
## 134 135 2966_irnt 3761_irnt 2443 2453
## 361136 361141 87024 72232 360192 359981
The matrix of GWAS Z-scores can be decorrelated via:
decor <- TGCA.decorrelate(tstat, N, MAF)
which uses all the SNPs with MAF < 5e-4 to estimate the phenotypic correlations (See also ?TGCA.decorrelate
). If you don't have sufficient number of SNPs with such low MAFs, a MAF < 0.05 threshold is also fine for most analysis, which can be set via the argument maf.cut = 0.05
. The returned object decor
is a list with two elements, where z.decorrelated
is the decorrelated Z-score matrix, and the number of columns was determined so that the eigenvectors therein capture 90% information of the original Z-score matrix. The cor.pheno
element gives the estimated phenotypic correlation matrix.
image(cor(decor$z.decorrelated))
image(decor$cor.pheno)
With the decorrelated Z-score matrix, we can apply the TGCA mixture model on each of the SNP with MAF > 5e-4 across the included set of phenotypes:
idx <- which(MAF < 5e-4)
zmat <- decor$z.decorrelated[-idx,]
res <- TGCA.scan(zmat)
This will take 2-3 minutes to complete. The returned result object res
is a data frame, containing the estimated TGCA statistic $theta
and every individual parameter in the mixture model. Additional columns in the same data frame provide the corresponding standard errors and p-values. Note that we emphasise the use of TGCA statistic $theta
itself rather than its statistical inference based on the p-values.
head(res)[,1:6]
## theta.P theta.est theta.se pi0.P pi0.est pi0.se
## rs375826246 0.844227512 0.421103262 2.143140101 7.1202925e-01 4.8024344e-01 1.30100692
## rs117619944 0.615133975 0.188594321 0.375119005 9.9999785e-01 4.1456785e-06 1.53765096
## rs1789089 0.813586488 0.085283405 0.361674140 1.5681319e-03 7.5724755e-01 0.23950112
## rs35305608 0.092087544 0.223738862 0.132823126 2.2767430e-01 3.8919665e-01 0.32261759
## rs34783785 0.018746482 0.208043503 0.088509028 6.0087478e-05 5.8482376e-01 0.14575167
## rs538737363 0.912468872 0.393878832 3.583160387 7.8154267e-01 5.1246755e-01 1.84800031
The link to our whole-genome TGCA analysis results for 5 different UK Biobank trait domains are provided in our paper (Li et al. 2020).
Li T, Ning Z, Yang Z, Zhai R, Xu W, Ying K, Wang Y, Chen Y, Shen X (2020). Total genetic contribution assessment across the human genome. Nature Communications, 12, 2845. https://www.nature.com/articles/s41467-021-23124-w
Li T, Ning Z, Shen X (2020). Improved estimation of phenotypic correlations using summary association statistics. Frontiers in Genetics, in press. On bioRxiv: https://www.biorxiv.org/content/10.1101/2020.12.10.419325v2
If you have questions, please feel free to email xia (dot) shen (at) ed (dot) ac (dot) uk.