Skip to content

Format of summary statistics

zhenin edited this page Feb 25, 2020 · 8 revisions

The extension of the input data file can be either .rds or .txt. It should include following columns: SNP, SNP ID; A1, effect allele; A2, reference allele; N, sample size; Z, z-score; If Z is not available, alternatively, you may provide: b, estimate of marginal effect in GWAS; and se, standard error of the estimates of marginal effects in GWAS. The summary statistics should look like this (b and se can be absent since Z is available):

##          SNP A1 A2      N        b       se      Z
## 1  rs3131962  G  A 205475 0.001004 0.004590 0.2187
## 2 rs12562034  A  G 205475 0.005382 0.005011 1.0740
## 3 rs11240779  A  G 205475 0.002259 0.003691 0.6119
## 4 rs57181708  G  A 205475 0.005401 0.005114 1.0562
## 5  rs4422948  G  A 205475 0.005368 0.003604 1.4893
## 6  rs4970383  A  C 205475 0.004685 0.003582 1.3080

How to transform raw GWAS summary statistics into HDL input

If you are clear about how to transform the GWAS into the above format, you can do it yourself. However, we have prepared an R script HDL.data.wrangling.R to make the transformation easier. Its function is to (i) extract the overlapped SNPs between your GWAS and HDL reference panel, and (ii) extract the columns that HDL needs. Depending on the source of the GWAS, you may use HDL.data.wrangling.R in two ways:

The GWAS is downloaded from a typical analysis/consortium

HDL.data.wrangling.R has some built-in functions for transforming GWASs from typical analyses and consortia. However, the current version supports only the Neale Lab round 2 GWAS of UK Biobank. The performance of HDL.data.wrangling.R for other typical GWASs is still under testing.

Here we take the GWAS results for birth weight as an example (file name: 20002_1223.gwas.imputed_v3.both_sexes.tsv, see this page about downloading the data). You can run HDL.data.wrangling.R like below

Rscript HDL.data.wrangling.R \
gwas.path=/Path/to/gwas/20002_1223.gwas.imputed_v3.both_sexes.tsv \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
GWAS.type=UKB.Neale
output.file=/Path/to/output/test

In this case, there are several arguments you should pass to HDL.data.wrangling.R. Please note that when you specify arguments, there should not be any space on any side of =.

  • Mandatory arguments
    • gwas.path, the path to the downloaded GWAS results;
    • LD.path, the path to the directory where linkage disequilibrium (LD) information is stored (i.e. where the reference panel located);
    • GWAS.type, which analysis/consortium the GWAS is from. Here are the values for the analyses and consortia supported by the current version of HDL.data.wrangling.R:
  • Optional arguments
    • output.file, the path and file name where the transformed data should be saved. If specified, the transformed data will be saved as output.file.hdl.rds. If not specified, the transformed data will be saved as gwas.path.hdl.rds.

During data wrangling, for the above example, these messages will be printed:

Program starts on Tue Feb 25 12:23:39 2020
Loading GWAS summary statistics from  /Path/to/gwas/20002_1223.gwas.imputed_v3.both_sexes.tsv
Data is loaded successfully. Data wrangling starts.
Data wrangling completed.
1029876 out of 1029876 (100%) SNPs in reference panel are available in GWAS.
The output is saved to /Path/to/output/test.hdl.rds

As the last line of the above message suggests, the transformed GWAS is saved as /Path/to/output/test.hdl.rds and is ready as input for HDL.

The GWAS is from other sources

In this case, instead of specifying GWAS.type, you need to explicitly tell HDL.data.wrangling.R how to understand the variable names in the GWAS. Other than this, the syntax is the same as that in the previous section. For example, if your GWAS looks like this:

##         rsid alt ref  tstat n_complete_samples     beta       se
## 1  rs3131962   G   A 0.2187             205475 0.001004 0.004590
## 2 rs12562034   A   G 1.0740             205475 0.005382 0.005011
## 3 rs11240779   A   G 0.6119             205475 0.002259 0.003691
## 4 rs57181708   G   A 1.0562             205475 0.005401 0.005114
## 5  rs4422948   G   A 1.4893             205475 0.005368 0.003604
## 6  rs4970383   A   C 1.3080             205475 0.004685 0.003582

You should use the below command to run HDL.data.wrangling.R

Rscript HDL.data.wrangling.R \
gwas.path=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples Z=tstat \
output.file=/Path/to/output/test

or

Rscript HDL.data.wrangling.R \
gwas.path=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples b=beta se=se \
output.file=/Path/to/output/test