-
Notifications
You must be signed in to change notification settings - Fork 28
Format of summary statistics
The extension of the input data file can be either .rds or .txt.
It should include following columns: SNP
, SNP ID; A1
, effect allele; A2
, reference allele; N
, sample size; Z
, z-score;
If Z
is not available, alternatively, you may provide: b
, estimate of marginal effect in GWAS; and se
,
standard error of the estimates of marginal effects in GWAS. If the GWAS is based on logistic regression, b
should be the logarithm of OR (odds ratio) and se
is the standard error of log(OR).
The summary statistics should look like this (b
and se
can be absent in this example since Z
is available):
## SNP A1 A2 N b se Z
## 1 rs3131962 G A 205475 0.001004 0.004590 0.2187
## 2 rs12562034 A G 205475 0.005382 0.005011 1.0740
## 3 rs11240779 A G 205475 0.002259 0.003691 0.6119
## 4 rs57181708 G A 205475 0.005401 0.005114 1.0562
## 5 rs4422948 G A 205475 0.005368 0.003604 1.4893
## 6 rs4970383 A C 205475 0.004685 0.003582 1.3080
If you are clear about how to transform the GWAS into the above format, you can do it yourself.
However, we have prepared an R script HDL.data.wrangling.R
to make the transformation easier. Its function is to (i) extract the overlapped SNPs between your GWAS and HDL reference panel, and (ii) extract the columns that HDL needs. Depending on the source of the GWAS, you may use HDL.data.wrangling.R
in two ways:
HDL.data.wrangling.R
has some built-in functions for transforming GWASs from typical analyses and consortia. However, the current version supports only the Neale Lab round 2 GWAS of UK Biobank. The performance of HDL.data.wrangling.R
for other typical GWASs is still under testing.
Before we start, you need to download dictionary files to "translate" the variants in the Neale Lab's GWAS into SNPs which can be understood by HDL
. You can use wget
:
wget https://www.dropbox.com/s/9x44r5lxy5oqz6s/snp.dictionary.imputed.rda?dl=0 \
-O /Path/to/reference/snp.dictionary.imputed.rda
Or you can directly download them here. Note: If you download it manually, please make sure that the dictionary file is in the directory where the reference panel files located.
Here we take the GWAS results for birth weight as an example (file name: 20022_irnt.gwas.imputed_v3.both_sexes.tsv.bgz
, see this page about downloading the data). You can run HDL.data.wrangling.R
like below
Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/20022_irnt.gwas.imputed_v3.both_sexes.tsv.bgz \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
GWAS.type=UKB.Neale \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1
In this case, there are several arguments you should pass to HDL.data.wrangling.R
. Please note that when you specify arguments, there should not be any space on any side of =
.
- Mandatory arguments
-
gwas.file
, the path to the downloaded GWAS results; -
LD.path
, the path to the directory where linkage disequilibrium (LD) information is stored (i.e. where the reference panel located); -
GWAS.type
, which analysis/consortium the GWAS is from. Here are the values for the analyses and consortia supported by the current version ofHDL.data.wrangling.R
:-
UKB.Neale
: The Neale Lab round 2 GWAS of UK Biobank
-
-
- Optional arguments
-
output.file
, the path and file name where the transformed data should be saved. If specified, the transformed data will be saved asoutput.file.hdl.rds
. If not specified, the transformed data will be saved asgwas.file.hdl.rds
. -
log.file
, the path and file name where the log should be saved. If specified, the log will be saved aslog.file.txt
. If not specified, the log will not be saved.
-
During data wrangling, for the above example, these messages will be printed:
Program starts on Tue Feb 25 12:23:39 2020
Loading GWAS summary statistics from /Path/to/gwas/20002_1223.gwas.imputed_v3.both_sexes.tsv.bgz
Data is loaded successfully. Data wrangling starts.
Data wrangling completed.
1029876 out of 1029876 (100%) SNPs in reference panel are available in GWAS.
The output is saved to /Path/to/gwas/gwas1.hdl.rds
The log is saved to /Path/to/log/gwas1.txt
As the last lines of the above message suggest, the transformed GWAS is saved as /Path/to/gwas/gwas1.hdl.rds
and is ready as input for HDL
.
In this case, instead of specifying GWAS.type
, you need to explicitly tell HDL.data.wrangling.R
how to understand the variable names in the GWAS. Other than this, the syntax is the same as that in the previous section. For example, if your GWAS looks like this:
## rsid alt ref tstat n_complete_samples beta se
## 1 rs3131962 G A 0.2187 205475 0.001004 0.004590
## 2 rs12562034 A G 1.0740 205475 0.005382 0.005011
## 3 rs11240779 A G 0.6119 205475 0.002259 0.003691
## 4 rs57181708 G A 1.0562 205475 0.005401 0.005114
## 5 rs4422948 G A 1.4893 205475 0.005368 0.003604
## 6 rs4970383 A C 1.3080 205475 0.004685 0.003582
You should use the below command to run HDL.data.wrangling.R
Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples Z=tstat \
output.file=/Path/to/gwas/test \
log.file=/Path/to/log/test
or
Rscript /Path/to/HDL/HDL.data.wrangling.R \
gwas.file=/Path/to/gwas/your.gwas.txt \
LD.path=/Path/to/reference/UKB_imputed_SVD_eigen99_extraction \
SNP=rsid A1=alt A2=ref N=n_complete_samples b=beta se=se \
output.file=/Path/to/gwas/test \
log.file=/Path/to/log/test