mtag_analysis.qmd

---
title: "MTAG_analysis"
format: html
editor: visual
---

## MTAG Analysis Workflow

This was roughly the methodology I was using in my MTAG analysis. It was written in a Quarto document just for the sake of rendering a document with code chunks from different languages.

#### Step 1: Setup and run jvfe-LDSC pipeline to get traits of interest

This is an optional step. If you want to have a broader behaviour of the genetic correlation between a big number of traits, this nextflow workflow can be quite handy. It calculates the genetic correlation between GWAS sumstats from [Neale's lab UKB GWAS](https://github.com/Nealelab/UK_Biobank_GWAS) More info about it at [jfve's github page](https://github.com/jvfe/jvfe-ldsc).

```{bash}

#!/bin/bash
#!/usr/bin/env bash

git clone https://github.com/jvfe/jvfe-ldsc.git

conda install bioconda::nextflow

nextflow run jvfe/jvfe-ldsc \
    --fasta subset.csv \
    --input ukbb_samplesheet.csv \
    --depression dep.sumstats.gz \
    --variants variants.tsv.bgz \
    --european_ref ./eur_w_ld_chr/ \
    --weights  ./1000G_weights/1000G_Phase3_weights_hm3_no_MHC/ \
    --outdir resultados \
    -profile docker \
    -with-tower \
    -r main \
    -resume

```

#### Step 2: Setup LDSC on machine

Here are all files/setup needed to run LDSC in your machine. The links can expire, and if this happens you can check for [LDSC's repository issues](https://github.com/bulik/ldsc/issues) to find more updated links.

```{bash}
# Clone and install LDSC
git clone https://github.com/bulik/ldsc.git
cd ldsc
git checkout b02f2a6
conda env create --file environment.yml
source activate ldsc

# Get european reference files
wget https://zenodo.org/records/7768714/files/1000G_Phase3_baselineLD_v2.2_ldscores.tgz
tar -xvzf 1000G_Phase3_baselineLD_v2.2_ldscores.tgz
mkdir eur_w_ld_chr
mv baselineLD* eur_w_ld_chr

# Get snp list
wget https://ibg.colorado.edu/cdrom2021/Day06-nivard/GenomicSEM_practical/eur_w_ld_chr/w_hm3.snplist

```

#### Step 3: Format sumstats to LDSC format

An R script that I used to get my sumstats to LDSC format. But it may vary depending on the sumstats you have. Just keep in mind you need SNP A1 A2 P BETA N columns, SNP being in the rsID format. A broader formating script is under development.

```{r}

library(vroom)
library(dplyr)

# Get file names from command line arguments or environment variables
sumstats_file <- ifelse(length(Sys.getenv("SUMSTATS_FILE")) > 0, Sys.getenv("SUMSTATS_FILE"), "test_sumstats.tsv")

variants_file <- "variants.tsv"

# Read data from files
sumstats <- vroom(sumstats_file, col_select = c("variant", "beta", "se", "pval", "n_complete_samples"))
ref <- vroom(variants_file, col_select = c("variant" ,"rsid", "chr", "ref", "alt"))

# Perform the join and selection
ldsc <- sumstats %>%
  inner_join(ref, by = "variant") %>%
  dplyr::select(SNP=rsid, CHR=chr, A1=ref, A2=alt, P=pval, BETA=beta, N=n_complete_samples)

# Write the result to a file
output_file <- ifelse(length(commandArgs(trailingOnly = TRUE)) >= 3,
                      commandArgs(trailingOnly = TRUE)[3],
                      ifelse(length(Sys.getenv("OUTPUT_FILE")) > 0, Sys.getenv("OUTPUT_FILE"), "test_sumstats_ldsc.tsv"))

vroom_write(ldsc, output_file)
```

#### Step 4: Munge sumstats with LDSC

LDSC is very strict with the input format, so it has its own munging algorithm. You have to run it for both sumstats you want to calculate the genetic correlation. If you have an INFO column with the imputation score of the SNPs, remove the --merge-alleles option.

```{bash}

# Munge sumstats files. File must contain only "SNP","A1","A2","N","P","BETA or OR or Z". Should use this exact header notation

conda activate ldsc

./munge_sumstats.py \
--sumstats ../data/sumstats/townsend_hg19_sumstats.txt \
--merge-alleles w_hm3.snplist \
--out ../data/outputs/ldsc_munge/tow \
--chunksize 500000

./munge_sumstats.py \
--sumstats ../data/sumstats/townsend_hg19_sumstats.txt \
--merge-alleles w_hm3.snplist \
--out ../data/outputs/ldsc_munge/tow \
--chunksize 500000
```

#### Step 5: Run LDSC with chosen traits

This is the LDSC run itself with the munged sumstats. It will return a log file with the genetic correlation result. Mae sure you have the right data paths as inputs.

```{bash}

conda activate ldsc

./ldsc.py --rg ../data/sumstats/pgcdep.sumstats.gz,../data/outputs/ldsc-munge/tow.sumstats.gz --ref-ld-chr eur_w_ld_chr/ --w-ld-chr 1000G_weights/1000G_Phase3_weights_hm3_no_MHC/ --out ../data/outputs/ldsc/tow_sdep

```

#### Step 6: Setup MTAG on machine

Just clone the repository and create a [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) environment based on the recipe inside it. 

```{bash}

git clone https://github.com/omeed-maghzian/mtag.git
cd mtag
conda env create -f environment.yml
```

#### Step 7: Run MTAG with all traits for Broad DEP and Strict DEP

This is the multi-trait analysis itself. Replace the inputs with the right path to the sumstats. The --n_approx, --intervals --cores parameters are only necessary if you are using a high number of traits (more than 5) and you want MTAG to perform the MaxFDR calculations.

```{bash}

conda activate mtag
python mtag.py --sumstats ../data/sumstats/depression_sumstats.txt,../data/sumstats/adhd_sumstats_mtag.txt,../data/sumstats/anxiety_sumstats_mtag.txt,../data/sumstats/bipolar_sumstats_mtag.txt,../data/sumstats/insomnia_sumstats_mtag.txt,../data/sumstats/neuroticism_sumstats.txt,../data/sumstats/wellbeing_sumstats.txt,../data/townsend_sumstats_mtag.txt --out ../data/outputs/mtag/dep_new   --snp_name variant_id --a1_name effect_allele --a2_name other_allele --eaf_name effect_allele_frequency --z_name z_score --beta_name beta --se_name standard_error   --n_name n --chr_name chromosome --bpos_name base_pair_location --stream_stdout --fdr --force --stream_stdout --n_approx --intervals 5 --cores 8
```

#### Step 8: Put MTAG results on FUMA

FUMA software can be found at https://fuma.ctglab.nl/
Create an account and start a new job. Upload your sumstats and use the following parameters:

| Parameter   | Value          |
|-------------|----------------|
| FUMA        | v1.5.2         |
| MAGMA       | v1.08          |
| GWASCatalog | e0_r2022-11-29 |
| ANNOVAR     | 2017-07-17     |
| chrcol      | CHR            |
| poscol      | BP             |
| rsIDcol     | SNP            |
| pcol        | mtag_pval      |
| eacol       | A1             |
| neacol      | A2             |
| orcol       | NA             |
| becol       | mtag_beta      |
| secol       | mtag_se        |

: FUMA Parameters

FUMA will clump the data to find Independent signal SNPs and Lead SNPs, plot Manhattan and Q-Q plots, perform Gene-set analysis using MAGMA, and tissue expression analysis using GTEx data. It also annotates the variants with Annovar, and map eQTLs and Chromatin interactions.