Skip to content

Syntax and results of HDL

Zheng Ning edited this page Jul 12, 2020 · 17 revisions

Overview

In this document, you will have a glimpse of the syntax and results of HDL, as well as how to use its parallel version to speed up. For a fast illustration, we use two sets of cleaned UKB GWAS summary statistics for array SNPs as examples. More details about this example can be found later.

Please notice although we use array SNPs as reference panel here, it is recommended to use imputed SNPs as reference panel for more precise estimates (see Reference panels). More real examples and applications can be found in the other pages of the wiki.

Estimating genetic correlation using HDL

Command line user

You can simply run HDL.run.R like below to use HDL tool:

Rscript /Path/to/HDL/HDL.run.R \
gwas1.df=/Path/to/gwas1/gwas1.array.example.rds \
gwas2.df=/Path/to/gwas2/gwas2.array.example.rds \
LD.path=/Path/to/reference/UKB_array_SVD_eigen90_extraction \
output.file=/Path/to/output/test.Rout

There are several arguments you should pass to HDL.run.R. Please note that when you specify arguments, there should not be any space on any side of =.

  • Mandatory arguments
    • gwas1.df, the path to the file including the GWAS results for trait 1. Most of the common file extensions are supported, including .gz files. If a GWAS is not successfully loaded, it is recommended to transfer it to .txt or .rds;
    • gwas2.df, the path to the file including the GWAS results for trait 2;
    • LD.path, the path to the directory where linkage disequilibrium (LD) information is stored (i.e. where the reference panel located).
  • Optional arguments
    • output.file, where the log and results should be written. If you do not specify a file, the log will be printed in the console;
    • Nref, the sample size of the reference sample where LD is computed. If the default UK Biobank reference sample is used, Nref = 335265;
    • N0, the number of individuals included in both cohorts. However, the estimated genetic correlation is usually robust against misspecified N0. If not given, the default value is set to the minimum sample size across all SNPs in cohort 1 and cohort 2.
    • eigen.cut, which eigenvalues and eigenvectors in each LD score matrix should be used for HDL. Users are allowed to specify a numeric value between 0 and 1 for eigen.cut. For example, eigen.cut = 0.99 means using the leading eigenvalues explaining 99 and their correspondent eigenvectors. If the default 'automatic' is used, the eigen.cut gives the most stable heritability estimates will be used.
    • jackknife.df, logical, FALSE by default. Should the block-jackknife estimates be returned? If setting jackknife.df=TRUE, in the command-line version, the block-jackknife estimates will be written to a file named by the combination of output.file, and "_jackknife.df.txt".

R user

HDL.rg is the function to perform HDL method. The arguments for HDL.rg are the same as the above arguments for the command-line implementation:

data(gwas1.example)
data(gwas2.example)
LD.path <- "/Path/to/reference/UKB_array_SVD_eigen90_extraction"
res.HDL <- HDL.rg(gwas1.example, gwas2.example, LD.path)
res.HDL

A list is returned with

  • rg, the estimation of genetic correlation.
  • rg.se, the standard error of estimated genetic correlation.
  • P, the Wald test P-value for rg.
  • estimates.df, a detailed matrix includes the estimates and standard errors of heritabilities, genetic covariance and genetic correlation.
  • eigen.use, the eigen.cut used in computation.
  • jackknife.df, only if argument jackknife.df is true. A matrix includes the block-jackknife estimates of heritabilities, genetic covariance and genetic correlation.

Reading HDL results

The first section shows specified arguments :

Function arguments:
gwas1.df=/opt/working/wilson/projects/prj_990_ldsc_enrich/hdl_test/HDL/gwas1.array.example.rds
gwas2.df=/opt/working/wilson/projects/prj_990_ldsc_enrich/hdl_test/HDL/gwas2.array.example.rds
LD.path=/opt/storage/wilson/projects/prj_994_UKB_ldscore/UKB_array_SVD_eigen90_extraction/
output.file=/opt/working/wilson/projects/prj_990_ldsc_enrich/hdl_test/test.out

Followed by some basic information about the installed version of HDL:

HDL: High-definition likelihood inference of genetic correlations (HDL)
Version 1.3.2 (2020-06-06) installed
Author: Zheng Ning, Xia Shen
Maintainer: Zheng Ning <zheng.ning@ki.se>
Tutorial: https://github.com/zhenin/HDL
Use citation("HDL") to know how to cite this work.

In the next section, the proportions of overlap SNPs between GWAS summary statistics and reference panel are reported. Because a low SNP overlap leads to poor estimation, HDL will generate a warning if the overlap rate is lower than 99% (i.e. more than 3,075 SNPs missing for array reference panel and 10,299 SNPs missing for imputed reference panel).

Analysis starts on Sat Jun  6 22:51:47 2020
307519 out of 307519 (100%) SNPs in reference panel are available in GWAS 1.  
307519 out of 307519 (100%) SNPs in reference panel are available in GWAS 2.  

The last section gives the genetic correlation, its standard error, and P-value based on the Wald test. The estimates and standard errors of heritabilities and genetic covariance are also provided.

Heritability of phenotype 1:  0.1609 (0.0075) 
Heritability of phenotype 2:  0.0131 (0.0012) 
Genetic Covariance:  -0.0101 (0.0018) 
Genetic Correlation:  -0.2206 (0.0391) 
P:  1.70e-08

Note: Although estimates of heritabilities and genetic covariance are also provided, they should be interpreted with caution. For LDSC, there have been some concerns about the potential bias when estimates these quantities. However, the estimate of genetic correlation is much more robust due to its ratio form. As HDL is a natural extension of LDSC, we suggest focusing the application of HDL on estimating genetic correlations. Please see more details and discussions on this in the HDL paper.

Using parallel computing to speed up HDL

If there are multiple cores available in your machine or server, they can be fully used to greatly speed up HDL. We have prepared a function HDL.rg.parallel to make parallelism very simple.

Command line user

You can run HDL.parallel.run.R like below to use the parallel version of HDL with two cores.

Rscript /Path/to/HDL/HDL.parallel.run.R \
gwas1.df=/Path/to/gwas1/gwas1.array.example.rds \
gwas2.df=/Path/to/gwas2/gwas2.array.example.rds \
LD.path=/Path/to/reference/UKB_array_SVD_eigen90_extraction \
output.file=/Path/to/output/test.Rout \
numCores=2

Comparing to a non-parallel HDL run, there are only two changes in syntax:

  1. You should use HDL.parallel.run.R instead of HDL.run.R;
  2. The number of cores to be used should be specified with the argument numCores.

R user

HDL.rg.parallel is the function to perform parallel HDL. Same as the only change in the command line version, there is an extra argument numCores to specify the number of cores to be used:

data(gwas1.example)
data(gwas2.example)
LD.path <- "/Path/to/reference/UKB_array_SVD_eigen90_extraction"
res.HDL <- HDL.rg.parallel(gwas1.example, gwas2.example, LD.path, numCores = 2)
res.HDL

Estimating heritability using HDL

Command line user

You can run HDL.run.R like below with only one GWAS to estimate heritability:

Rscript /Path/to/HDL/HDL.run.R \
gwas.df=/Path/to/gwas1/gwas1.array.example.rds \
LD.path=/Path/to/reference/UKB_array_SVD_eigen90_extraction \
output.file=/Path/to/output/test.Rout

The arguments are almost identical to those for estimating genetic correlation. The only difference is the use of gwas.df instead of gwas1.df and gwas2.df. Please note that when you specify arguments, there should not be any space on any side of =.

  • Mandatory arguments
    • gwas.df, the path to the file including the GWAS results for the trait. Most of the common file extensions are supported, including .gz files. If a GWAS is not successfully loaded, it is recommended to transfer it to .txt or .rds;
    • LD.path, the path to the directory where linkage disequilibrium (LD) information is stored (i.e. where the reference panel located).
  • Optional arguments
    • output.file, where the log and results should be written. If you do not specify a file, the log will be printed in the console;
    • Nref, the sample size of the reference sample where LD is computed. If the default UK Biobank reference sample is used, Nref = 335265;
    • eigen.cut, which eigenvalues and eigenvectors in each LD score matrix should be used for HDL. Users are allowed to specify a numeric value between 0 and 1 for eigen.cut. For example, eigen.cut = 0.99 means using the leading eigenvalues explaining 99 and their correspondent eigenvectors. If the default 'automatic' is used, the eigen.cut gives the most stable heritability estimates will be used.

R user

HDL.h2 is the function to estimate heritability with HDL. The arguments for HDL.h2 are the same as the above arguments for the command-line implementation:

data(gwas1.example)
LD.path <- "/Path/to/reference/UKB_array_SVD_eigen90_extraction"
res.HDL <- HDL.h2(gwas.df = gwas1.example, LD.path = LD.path)
res.HDL

A list is returned with

  • h2, the estimated heritability.
  • h2.se, the standard error of estimated heritability.
  • P, the Wald test P-value for h2.
  • eigen.use, the eigen.cut used in computation.