Skip to content

Tutorial 1: The Basics

Jonathan Jala edited this page Sep 14, 2023 · 10 revisions

Introduction

This introductory tutorial is meant to give some practice in using the mtag from the command line, but it is not intended to be comprehensive. It would be helpful take a look at the mtag's available options by navigating to its directory and entering:

python mtag.py -h

in the command line before beginning.

Getting Started

You should have already downloaded the mtag repository. For simplicity, all commands are executed from the directory including both the mtag folder and the input summary statistics, but this is not at all necessary. You can clone the repository with the command:

git clone https://github.com/omeed-maghzian/mtag.git

Sample GWAS Results and Data Format

Performing MTAG is useful when applied to the summary statistics of multiple GWAS. The GWAS summary statistics must be formatted in a way that mtag can read. This preprocessing has been done in the sample data used below; all that you need to do is unzip the files in the folder containing the mtag directory and you will be able to directly execute the mtag commands below.

We demonstrate the use of mtag with GWAS results on neuroticism and subjective well-being from Okbay et. al. (2016). We use a subset of the original GWAS results from a random subsample of SNPs found in Hapmap3. The neuroticism and subjective well-being data summary statistics can be found here (Note: requires registering for an account).

These sample files are already formatted in the default output accepted by the command line tool. That is, they are whitespace-delimited .txt files with the following column names (the specific order is not necessary):

snpid    chr    bpos    a1    a2    freq    z    pval    n

The following columns are necessary for MTAG to run:

  • snpid The unique snp identifier, typically an rsid number. The multiple summary statistics files are matched together using this column. This is the also the SNP identifier that is passed to ldsc when munging the summary statistics and estimating the residual covariance matrix. Other column names may be passed via the --snp_name option.
  • a1/a2: Alleles observed at the particular locus. a1 is considered to be the effect allele, which should also be reflected in the signs of the Z-scores. These columns are also passed to the ldsc routine. mtag checks and flips the a1 and a2 alleles so that they are identical across input files. Other column names may be passed via the --a1_name and --a2_name options.
  • freq: The effect allele frequency (that is, the frequency of a1), used to filter rare variants and weight convert MTAG results to unstandardized effect sizes. Other column names may be passed via the --eaf_name option.
  • z: The Z-scores associated with the SNP effect sizes for the GWAS. Along with the sample size column, they are used in virtually every essential step in the mtag program. Other column names may be passed with the --z_name option. Note: If both the effect sizes and the standard errors are present in the input GWAS summary statistics, z is not required by MTAG. To use this option, check that the repository is up-to-date and specify the flag --use_beta_se. The default column names are beta and se, which can be specified via the flags --beta_name and --se_name.
  • n: The SNP sample sizes. Like z, they form an essential component of the analysis. Other column names may be passed with the --n_name option.

The other columns are not directly used by mtag.py but are part of the munging procedure implemented via ldsc. Therefore, at minimum, they should be in a format recognized by the ldsc see here for details. The chromosome number chr and base pair position bpos of SNPs are also used as reference columns printed in the results files. Other names can be passed with the --chr_name and --bpos_name options, respectively.

Running MTAG with the defaults

Using mtag with the default options implements the following steps: 1. Read in the input GWAS summary statistics and filter the SNPs by minor allele frequency (MAF) >= 0.01 and sample size N >= (2/3) * 90th percentile. 2. Merge the filtered GWAS summary statistics results together, taking the intersection of available SNPs. 3. Estimate the residual covariance matrix via LD Score regression. 4. Estimate the genetic covariance matrix Omega. 5. Perform MTAG and output results.

We apply MTAG by simply calling the mtag.py script via python, with minimal options:

python /[path]/mtag.py  \
	--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
	--out ./tutorial_results_1.1NS \
	--n_min 0.0 \
      --stream_stdout &

Setting --sumstats is the only necessary option when calling mtag.py, but it is highly recommended to also set --out. The lower sample size bound is set to 0 via --n_min to avoid dropping more SNPs in this example. --stream_stdout allows the log of the analysis to be streamed to the console during runtime (off by default).

  • --sumstats should be given a whitespace-delimited list of GWAS summary statistics. The order of the files does not affect the analysis, but it does correspond to the index used for the results files.
  • --out should be given a string in the form of [DIR]/[TAG], where [DIR] is the output directory and [TAG] is the prefix attached to all files written by mtag. The default is ./mtag_results.

mtag output

Letting [TAG] and [DIR] denote the components of the --out option above, running mtag should have produced five files in your current directory:

  • [TAG].log timestamps the different steps taken by mtag.py.
  • [TAG]_sigma_hat.txt stores the estimated residual covariance matrix.
  • [TAG]_omega_hat.txt stores the estimated genetic covariance matrix.
  • [TAG]_trait_1.txt and [TAG]_trait_2.txt are tab-delimited results files corresponding to the MTAG-adjusted effect sizes and standard errors for the neuroticism and subjective well-being summary statistics, respectively. Note that the files are numbered in the order they were presented in the list provided to the --sumstats flag.

Log file

Opening up tutorial_results_1.log, we see that the header lists all the arguments passed to mtag.py.

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>
<> MTAG: Multi-trait Analysis of GWAS
<> Version: 1.0.7
<> (C) 2017 Omeed Maghzian, Raymond Walters, and Patrick Turley
<> Harvard University Department of Economics / Broad Institute of MIT and Harvard
<> GNU General Public License v3
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Calling ./mtag.py \
  --stream-stdout \
--n-min 0.0 \
--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
--out ./tutorial_results_1.1NS 

Most of the generic filters based on sample size, allele frequency, etc., are implemented through calling ldsc's munge_sumstats.py in the local version of ldsc contained with the mtag repository; users should consult the ldsc documentation to understand what is going on under the hood. The output of munge_sumstats is redirected into the mtag log file.

Apart from providing time stamps for the estimating the different matrices needed for MTAG, the log file also displays the calculated values of Omega and Sigma along with a summary output of the results:

Summary of MTAG results:
------------------------
Trait                 N (max)  N (mean)  # SNPs used  GWAS mean chi^2  MTAG mean chi^2  GWAS equiv. (max) N
1  ...hm3samp_NEUR.txt  111111   55913     167467       1.243            1.278            127214
2  ..._hm3samp_SWB.txt  111111   71824     167467       1.135            1.221            180879

Estimated Omega:
[[  4.677e-06  -2.302e-06]
[ -2.302e-06   1.959e-06]]

Estimated Sigma:
[[ 1.052 -0.167]
[-0.167  1.016]]


MTAG results saved to file.
MTAG complete. Time elapsed: 44.8766591549s

All values for each column are computed from the set of SNPs that passed all mtag filters. The GWAS mean chi^2 column is adjusted by the diagonal terms of Sigma. MTAG mean chi^2 is calculated using the effect sizes and standard errors produced by mtag. A clear way to assess the increases in power from MTAG is to approximate the sample size needed to achieve the same mean chi2 value in a standard GWAS. GWAS equiv. (max) N provides this measure by multiplying the maximum sample size (N (max)) by the ratio (chi2MTAG, mean-1)/(chi2GWAS, mean-1).

Results files

tutorial_results_1.1NS_trait_1.txt provides the MTAG-adjusted effect sizes for the neuroticism GWAS:

snpid    chr    bpos    a1    a2    z    n    freq    mtag_beta    mtag_se    mtag_z    mtag_pval
rs2736372    8    11106041    T    C    -7.71614161262    111111.111111    0.4179    -0.0324880486907    0.00419105765062    -7.7517541869    9.06317063823e-15
rs2060465    8    11162609    T    C    7.69444599845    62500.0    0.6194    0.038971244976    0.00536428475564    7.26494709944    3.73184288437e-13
rs10096421    8    10831868    T    G    -7.561098219    111111.111111    0.4646    -0.0306226260845    0.00414457338635    -7.38860751877    1.48374420103e-13
rs2409722    8    11039816    T    G    -7.38261601808    111111.111111    0.4627    -0.0321870047337    0.00414572461396    -7.76390323306    8.23547416667e-15
rs11991118    8    10939273    T    G    7.32202915636    111111.111111    0.5056    0.0306039239803    0.0041344320288    7.40220755042    1.33938936247e-13

The first eight columns are copied from the corresponding input file. mtag_beta and mtag_se provide the unstandardized weights and standard errors calculated by mtag, yielding the corresponding z-scores mtag_z and p-values mtag_pval. For standardized results, simply include the --std_betas flag (the p-values are unaffected).

The input and output files both treat the effect size as increasing in a1. The program uses the a1 and a2 columns of the first file as the reference for all other traits: the signs of the Z-scores for an input file are reversed if its a2 column is the first trait's a1 column. The allele columns in the results files are printed after this reversal, so you may need to be careful with the signs of mtag_beta when comparing them to your original GWAS results.

A word of caution about using MTAG with SNP sample sizes that are small: Generally, applications of mtag with a small final list of SNPs may not generate positive semidefinite estimates of the covariance matrices and may even lead to runtime errors in mtag.

Specifying matrix paths

If for some reason you would like to rerun MTAG but already have an estimate of Omega in hand, you can use the --gencov_path flag to directly pass an estimate of Omega to mtag. Likewise, estimates of Sigma may be passed with --residcov_path. In both cases, the typical estimation routine is skipped. It is best to manually pass these matrices only when you are confident that they are the likelihood-optimal (or close to optimal) values for the given set of SNPs used in MTAG. As a result, it is strongly recommended to only pass file paths to --gencov_path and --residcov_path that have been produced from an earlier application of mtag.

Without the need to estimate these covariance matrices, mtag's runtime is primarily limited by the reading andwriting of files (which is unusually small for our example set of summary statistics).

python /[path]/mtag.py  \
	--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
	--out ./tutorial_results_1.2SN \
	--n_min 0 \
	--gencov_path tutorial_results_1.1NS_omega_hat.txt \
	--residcov_path tutorial_results_1.1NS_sigma_hat.txt &

Summary of MTAG results:
------------------------
Trait                 N (max)  N (mean)  # SNPs used  GWAS mean chi^2  MTAG mean chi^2  GWAS equiv. (max) N
1  ...hm3samp_NEUR.txt  111111   55913     167467       1.243            1.278            127214
2  ..._hm3samp_SWB.txt  111111   71824     167467       1.135            1.221            180879

Estimated Omega:
[[  4.677e-06  -2.302e-06]
[ -2.302e-06   1.959e-06]]

Estimated Sigma:
[[ 1.052 -0.167]
[-0.167  1.016]]


MTAG results saved to file.
MTAG complete. Time elapsed: 23.8759078979s