Skip to content

Format of summary statistics

Zheng Ning edited this page Feb 28, 2021 · 8 revisions

The extension of the input data file can be either .rds or .txt. It should include following columns: SNP, SNP ID; A1, effect allele; A2, reference allele; N, sample size; Z, z-score; If Z is not available, alternatively, you may provide: b, estimate of marginal effect in GWAS; and se, standard error of the estimates of marginal effects in GWAS. If the GWAS is based on logistic regression, b should be the logarithm of OR (odds ratio) and se is the standard error of log(OR). The summary statistics should look like this (b and se can be absent in this example since Z is available):

##          SNP A1 A2      N        b       se      Z
## 1  rs3131962  G  A 205475 0.001004 0.004590 0.2187
## 2 rs12562034  A  G 205475 0.005382 0.005011 1.0740
## 3 rs11240779  A  G 205475 0.002259 0.003691 0.6119
## 4 rs57181708  G  A 205475 0.005401 0.005114 1.0562
## 5  rs4422948  G  A 205475 0.005368 0.003604 1.4893
## 6  rs4970383  A  C 205475 0.004685 0.003582 1.3080

How to transform raw GWAS summary statistics into HDL input

If you are clear about how to transform the GWAS into the above format, you can do it yourself. However, we have prepared an R script HDL.data.wrangling.R to make the transformation easier. Its function is to (i) extract the overlapped SNPs between your GWAS and HDL reference panel, and (ii) extract the columns that HDL needs. Depending on the source of the GWAS, you may use HDL.data.wrangling.R in two ways:

The GWAS is downloaded from a typical analysis/consortium

HDL.data.wrangling.R has some built-in functions for transforming GWASs from typical analyses and consortia. However, the current version supports only the Neale Lab round 2 GWAS of UK Biobank. The performance of HDL.data.wrangling.R for other typical GWASs is still under testing.

Before we start, you need to download dictionary files to "translate" the variants in the Neale Lab's GWAS into SNPs which can be understood by HDL. You can use wget:

wget https://www.dropbox.com/s/9x44r5lxy5oqz6s/snp.dictionary.imputed.rda?dl=0 \
-O /Path/to/reference/snp.dictionary.imputed.rda

Or you can directly download them here. Note: If you download it manually, please make sure that the dictionary file is in the directory where the reference panel files located.

Here we take the GWAS results for birth weight as an example (file name: 20022_irnt.gwas.imputed_v3.both_sexes.tsv.bgz, see this page about downloading the data). You can run HDL.data.wrangling.R like below

Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/20022_irnt.gwas.imputed_v3.both_sexes.tsv.bgz \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
GWAS.type=UKB.Neale \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1

In this case, there are several arguments you should pass to HDL.data.wrangling.R. Please note that when you specify arguments, there should not be any space on any side of =.

  • Mandatory arguments
    • gwas.file, the path to the downloaded GWAS results;
    • LD.path, the path to the directory where linkage disequilibrium (LD) information is stored (i.e. where the reference panel located);
    • GWAS.type, which analysis/consortium the GWAS is from. Here are the values for the analyses and consortia supported by the current version of HDL.data.wrangling.R:
  • Optional arguments
    • output.file, the path and file name where the transformed data should be saved. If specified, the transformed data will be saved as output.file.hdl.rds. If not specified, the transformed data will be saved as gwas.file.hdl.rds.
    • log.file, the path and file name where the log should be saved. If specified, the log will be saved as log.file.txt. If not specified, the log will not be saved.

During data wrangling, for the above example, these messages will be printed:

Program starts on Tue Feb 25 12:23:39 2020
Loading GWAS summary statistics from /Path/to/gwas/20002_1223.gwas.imputed_v3.both_sexes.tsv.bgz
Data is loaded successfully. Data wrangling starts.
Data wrangling completed.
1029876 out of 1029876 (100%) SNPs in reference panel are available in GWAS.
The output is saved to /Path/to/gwas/gwas1.hdl.rds
The log is saved to /Path/to/log/gwas1.txt 

As the last lines of the above message suggest, the transformed GWAS is saved as /Path/to/gwas/gwas1.hdl.rds and is ready as input for HDL.

The GWAS is from other sources

In this case, instead of specifying GWAS.type, you need to explicitly tell HDL.data.wrangling.R how to understand the variable names in the GWAS. Other than this, the syntax is the same as that in the previous section. For example, if your GWAS looks like this:

##         rsid alt ref  tstat n_complete_samples     beta       se
## 1  rs3131962   G   A 0.2187             205475 0.001004 0.004590
## 2 rs12562034   A   G 1.0740             205475 0.005382 0.005011
## 3 rs11240779   A   G 0.6119             205475 0.002259 0.003691
## 4 rs57181708   G   A 1.0562             205475 0.005401 0.005114
## 5  rs4422948   G   A 1.4893             205475 0.005368 0.003604
## 6  rs4970383   A   C 1.3080             205475 0.004685 0.003582

You should use the below command to run HDL.data.wrangling.R

Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples Z=tstat \
output.file=/Path/to/gwas/test \
log.file=/Path/to/log/test

or

Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples b=beta se=se \
output.file=/Path/to/gwas/test \
log.file=/Path/to/log/test