-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmtag_analysis.qmd
172 lines (120 loc) · 6.72 KB
/
mtag_analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "MTAG_analysis"
format: html
editor: visual
---
## MTAG Analysis Workflow
This was roughly the methodology I was using in my MTAG analysis. It was written in a Quarto document just for the sake of rendering a document with code chunks from different languages.
#### Step 1: Setup and run jvfe-LDSC pipeline to get traits of interest
This is an optional step. If you want to have a broader behaviour of the genetic correlation between a big number of traits, this nextflow workflow can be quite handy. It calculates the genetic correlation between GWAS sumstats from [Neale's lab UKB GWAS](https://github.com/Nealelab/UK_Biobank_GWAS) More info about it at [jfve's github page](https://github.com/jvfe/jvfe-ldsc).
```{bash}
#!/bin/bash
#!/usr/bin/env bash
git clone https://github.com/jvfe/jvfe-ldsc.git
conda install bioconda::nextflow
nextflow run jvfe/jvfe-ldsc \
--fasta subset.csv \
--input ukbb_samplesheet.csv \
--depression dep.sumstats.gz \
--variants variants.tsv.bgz \
--european_ref ./eur_w_ld_chr/ \
--weights ./1000G_weights/1000G_Phase3_weights_hm3_no_MHC/ \
--outdir resultados \
-profile docker \
-with-tower \
-r main \
-resume
```
#### Step 2: Setup LDSC on machine
Here are all files/setup needed to run LDSC in your machine. The links can expire, and if this happens you can check for [LDSC's repository issues](https://github.com/bulik/ldsc/issues) to find more updated links.
```{bash}
# Clone and install LDSC
git clone https://github.com/bulik/ldsc.git
cd ldsc
git checkout b02f2a6
conda env create --file environment.yml
source activate ldsc
# Get european reference files
wget https://zenodo.org/records/7768714/files/1000G_Phase3_baselineLD_v2.2_ldscores.tgz
tar -xvzf 1000G_Phase3_baselineLD_v2.2_ldscores.tgz
mkdir eur_w_ld_chr
mv baselineLD* eur_w_ld_chr
# Get snp list
wget https://ibg.colorado.edu/cdrom2021/Day06-nivard/GenomicSEM_practical/eur_w_ld_chr/w_hm3.snplist
```
#### Step 3: Format sumstats to LDSC format
An R script that I used to get my sumstats to LDSC format. But it may vary depending on the sumstats you have. Just keep in mind you need SNP A1 A2 P BETA N columns, SNP being in the rsID format. A broader formating script is under development.
```{r}
library(vroom)
library(dplyr)
# Get file names from command line arguments or environment variables
sumstats_file <- ifelse(length(Sys.getenv("SUMSTATS_FILE")) > 0, Sys.getenv("SUMSTATS_FILE"), "test_sumstats.tsv")
variants_file <- "variants.tsv"
# Read data from files
sumstats <- vroom(sumstats_file, col_select = c("variant", "beta", "se", "pval", "n_complete_samples"))
ref <- vroom(variants_file, col_select = c("variant" ,"rsid", "chr", "ref", "alt"))
# Perform the join and selection
ldsc <- sumstats %>%
inner_join(ref, by = "variant") %>%
dplyr::select(SNP=rsid, CHR=chr, A1=ref, A2=alt, P=pval, BETA=beta, N=n_complete_samples)
# Write the result to a file
output_file <- ifelse(length(commandArgs(trailingOnly = TRUE)) >= 3,
commandArgs(trailingOnly = TRUE)[3],
ifelse(length(Sys.getenv("OUTPUT_FILE")) > 0, Sys.getenv("OUTPUT_FILE"), "test_sumstats_ldsc.tsv"))
vroom_write(ldsc, output_file)
```
#### Step 4: Munge sumstats with LDSC
LDSC is very strict with the input format, so it has its own munging algorithm. You have to run it for both sumstats you want to calculate the genetic correlation. If you have an INFO column with the imputation score of the SNPs, remove the --merge-alleles option.
```{bash}
# Munge sumstats files. File must contain only "SNP","A1","A2","N","P","BETA or OR or Z". Should use this exact header notation
conda activate ldsc
./munge_sumstats.py \
--sumstats ../data/sumstats/townsend_hg19_sumstats.txt \
--merge-alleles w_hm3.snplist \
--out ../data/outputs/ldsc_munge/tow \
--chunksize 500000
./munge_sumstats.py \
--sumstats ../data/sumstats/townsend_hg19_sumstats.txt \
--merge-alleles w_hm3.snplist \
--out ../data/outputs/ldsc_munge/tow \
--chunksize 500000
```
#### Step 5: Run LDSC with chosen traits
This is the LDSC run itself with the munged sumstats. It will return a log file with the genetic correlation result. Mae sure you have the right data paths as inputs.
```{bash}
conda activate ldsc
./ldsc.py --rg ../data/sumstats/pgcdep.sumstats.gz,../data/outputs/ldsc-munge/tow.sumstats.gz --ref-ld-chr eur_w_ld_chr/ --w-ld-chr 1000G_weights/1000G_Phase3_weights_hm3_no_MHC/ --out ../data/outputs/ldsc/tow_sdep
```
#### Step 6: Setup MTAG on machine
Just clone the repository and create a [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) environment based on the recipe inside it.
```{bash}
git clone https://github.com/omeed-maghzian/mtag.git
cd mtag
conda env create -f environment.yml
```
#### Step 7: Run MTAG with all traits for Broad DEP and Strict DEP
This is the multi-trait analysis itself. Replace the inputs with the right path to the sumstats. The --n_approx, --intervals --cores parameters are only necessary if you are using a high number of traits (more than 5) and you want MTAG to perform the MaxFDR calculations.
```{bash}
conda activate mtag
python mtag.py --sumstats ../data/sumstats/depression_sumstats.txt,../data/sumstats/adhd_sumstats_mtag.txt,../data/sumstats/anxiety_sumstats_mtag.txt,../data/sumstats/bipolar_sumstats_mtag.txt,../data/sumstats/insomnia_sumstats_mtag.txt,../data/sumstats/neuroticism_sumstats.txt,../data/sumstats/wellbeing_sumstats.txt,../data/townsend_sumstats_mtag.txt --out ../data/outputs/mtag/dep_new --snp_name variant_id --a1_name effect_allele --a2_name other_allele --eaf_name effect_allele_frequency --z_name z_score --beta_name beta --se_name standard_error --n_name n --chr_name chromosome --bpos_name base_pair_location --stream_stdout --fdr --force --stream_stdout --n_approx --intervals 5 --cores 8
```
#### Step 8: Put MTAG results on FUMA
FUMA software can be found at https://fuma.ctglab.nl/
Create an account and start a new job. Upload your sumstats and use the following parameters:
| Parameter | Value |
|-------------|----------------|
| FUMA | v1.5.2 |
| MAGMA | v1.08 |
| GWASCatalog | e0_r2022-11-29 |
| ANNOVAR | 2017-07-17 |
| chrcol | CHR |
| poscol | BP |
| rsIDcol | SNP |
| pcol | mtag_pval |
| eacol | A1 |
| neacol | A2 |
| orcol | NA |
| becol | mtag_beta |
| secol | mtag_se |
: FUMA Parameters
FUMA will clump the data to find Independent signal SNPs and Lead SNPs, plot Manhattan and Q-Q plots, perform Gene-set analysis using MAGMA, and tissue expression analysis using GTEx data. It also annotates the variants with Annovar, and map eQTLs and Chromatin interactions.