R
Installing packages:
if(!require("remotes"))
install.packages("remotes") # if necessary
library(remotes)
install_github("chr1swallace/coloc",build_vignettes=TRUE)
library(dplyr)
library(reshape2)
library(glmnet)
library(methods)
library(doMC)
library(doRNG)
library(tidyr)
library(tibble)
bash
Python
Installing packages:
import os
import os.path
import argparse
- The pipeline was written to only run one population at a time
- The user must cd into the 'Coloc' repo in order to properly run the pipeline
- If the phenotypes have dashes in the name, replace the dash with an underscore
- SNP annotation, genotype, and vcf files should be named with the convention 'chr#.' in relation to chromosome
Run GWAS-eQTL colocalization pipeline.
optional arguments:
-h, --help show this help message and exit
--gwas GWAS GWAS summary statistics directory path
--eqtl EQTL Exact eQTL txt file path
--vcf VCF Directory containing the vcf files from the eQTL
population
--snp_annot SNP_ANNOT
SNP annotation files directory path
--gene_annot GENE_ANNOT
Exact gene annotation file path
--geno GENO Genotype files directory path
--frq FRQ Exact frq file path
--out OUT Specify main output directory for all output files
--pop1 POP1 Population used for vcf files
--pop4 POP4 Population used for frq and eQTL files
--pop_size POP_SIZE Size of population
--phenotypes [PHENOTYPES [PHENOTYPES ...]]
Phenotypes to test
--chrs [CHRS [CHRS ...]]
Indicate what chromosomes to run
-gene_id [GENE_ID [GENE_ID ...]]
Input specific genes ids to run colocalization on,
default is set to False
- User must place all GWAS summary statistic, vcf, SNP annotation, and genotype files into their respective directories and use those directory paths for the respective flags
- The default is running COLOC on all genes in the chromosomes chosen. However, the user can input a list of gene IDs under the optional --gene_id flag to only run COLOC on a specific subset of genes.
cd /homes/anovak9/Coloc
nohup python3 main_wrapper.py --gwas ~/COLOC_input_data/GWAS_SS --eqtl ~/COLOC_input_data/cis_eQTLs_AFA_WG_all_cis.txt.gz
--vcf ~/COLOC_input_data/vcfs --snp_annot ~/COLOC_input_data/snp_annot --gene_annot ~/COLOC_input_data/annotation_all_aptamers_ENSG.txt
--geno ~/COLOC_input_data/genotypes --frq ~/COLOC_input_data/AFA_prot_hg38.frq --out ~/coloc_output --pop1 ASW --pop4 AFA --pop_size 183
--phenotypes BMI C_reactive HDL_cholesterol Total_cholesterol --chrs 1 2 3 &
- The wrapper should always be run as nohup (note beginning of command)
- In this example, the vcf files come from the population ASW and the frq and QTL files come from the population AFA and the pipeline will only run chromsomes 1-3. In addition, phenotypes BMI, C_reactive, HDL_cholesterol, and Total_cholesterol will be run.
nohup python3 main_wrapper.py --gwas ~/COLOC_input_data/GWAS_SS --eqtl ~/COLOC_input_data/cis_eQTLs_AFA_WG_all_cis.txt.gz
--vcf ~/COLOC_input_data/vcfs --snp_annot ~/COLOC_input_data/snp_annot --gene_annot ~/COLOC_input_data/annotation_all_aptamers_ENSG.txt
--geno ~/COLOC_input_data/genotypes --frq ~/COLOC_input_data/AFA_prot_hg38.frq --out ~/coloc_output --pop1 ASW --pop4 AFA --pop_size 183
--phenotypes BMI --chrs 1 2 3 -gene_id SL000001_ENSG00000185499.16 SL000020_ENSG00000084674.14 SL000024_ENSG00000117525.14
SL000027_ENSG00000073756.12 SL000048_ENSG00000115718.17 SL000049_ENSG00000184500.15 SL000051_ENSG00000132693.12 &
- This example is similar to above except it's only running the phenotype BMI and only the genes specified after the -gene_id flag will by run by COLOC.
- Link to google drive folder with test data
- Add the test folder to your local Coloc repo for the command to work
cd /homes/anovak9/Coloc
nohup python3 main_wrapper.py --gwas ~/Coloc/Test/GWAS_SS --eqtl ~/Coloc/Test/cis_eQTLs_AFA_WG_all_cis.txt.gz
--vcf ~/Coloc/Test/vcfs --snp_annot ~/Coloc/Test/snp_annot --gene_annot ~/Coloc/Test/annotation_all_aptamers_ENSG.txt
--geno ~/Coloc/Test/genotypes --frq ~/Coloc/Test/AFA_prot_hg38.frq --out ~/coloc_output --pop1 ASW --pop4 AFA --pop_size 183
--phenotypes BMI --chrs 1 &
These scripts use imputed transcript levels from the eQTL population to generate lists of significant SNPs that are within 1Mb of a predicted gene. The goal is to pull all significant cis-acting variants. The main wrapper runs script 1b for every chromosome:
- Bim files from script 2
- Directory of SNP annotation files
- Exact file path of gene annotation file
- Directory of genotype files
- Population
- out/LD_matrix/{pop}/{pop}_chr_{chrom}_{gene}_1Mb_of_gene.txt
This script takes in genotype dosage files in vcf file format and converts them to PLINK binary formats, .bed, .bim, and .fam. These data will be used to calculate LD for the significant SNPs pulled by script 1. The main wrapper runs script 2 for every chromosome.
- Directory of genotype dosage files
- out/LD_matrix/{pop}/{pop}_chr{chrom}_dose.bed
- out/LD_matrix/{pop}/{pop}_chr{chrom}_dose.bim
- out/LD_matrix/{pop}/{pop}_chr{chrom}_dose.fam
This script uses the significant SNP lists produced by script 1, along with the bfiles produced by script 2, to calculate LD matrices between every significant SNP around every predicted gene. The wrapper runs script 3 for every chromosome
- Directory of significant SNP lists
- Directory of bed/bim/fam files
- out/LD_matrix/{pop}/{pop}_1Mb_coords_LDMatrix/{pop}_chr_{chrom}_{gene}_1Mb_LD.ld.gz
- out/LD_matrix/{pop}/{pop}_1Mb_coords_LDMatrix/{pop}_chr_{chrom}_{gene}_1Mb_LD.snplist.gz
This script takes the GWAS summary statistics, QTL, and frq files input by the user and formats them properly in preparation of COLOC analysis.
- Directory of GWAS summary statistic files
- Exact file path of QTL txt file
- Exact file path of frq file
- out/eQTL/{pop}/eQTL_{pop}_{phenotype}.txt.gz
- out/GWAS_TOPMED/{pop}/GWAS_TOPMED_{pop}_{phenotype}.txt.gz
These scripts take the input arguments from the user and run COLOC analysis for each population and phenotype specified by the user using the outputs from scripts 3 and 4. A summary table of the COLOC reults are written to an outfile.
- out/eQTL/{pop}/eQTL_{pop}_{phenotype}.txt.gz
- out/GWAS_TOPMED/{pop}/GWAS_TOPMED_{pop}_{phenotype}.txt.gz
- out/LD_matrix/{pop}/{pop}_1Mb_coords_LDMatrix/{pop}_chr_{chrom}_{gene}_1Mb_LD.ld.gz
- out/LD_matrix/{pop}/{pop}_1Mb_coords_LDMatrix/{pop}_chr_{chrom}_{gene}_1Mb_LD.snplist.gz
- out/Coloc_output/{pop}_{phenotype}.txt