This workflow has been extensively expanded by Jean-Tristan Brandenburg
An association study is a complex analysis and each analysis has to consider
- the disease/phenotype being studied and its mode of inheritance
- population structure
- other covariates
For this reason it is difficult to build a high quality, generic pipeline to do an association study.
The purpose of this pipeline is to perform a very superficial initial analysis that can be used as one piece of information to guide a rigorous analysis. Of course, we would encourage users to build their own Nextflow script for their rigorous analysis, perhaps using our script as a start.
Our script, assoc takes as input PLINK files that have been through quality control and
- does a principal component analysis on the data, and produces pictures from that;
- performs a simple association test giving odds ratio and raw and adjusted p values
The pipeline is run: nextflow run assoc
By default a chi2 test for association is done. But you can do multiple different tests in one run by settintg the appropriate parameter to 1. Note at least one must be set to 1
The key options are:
input_dir
,output_dir
: where input and output goes to and comes from;input_pat
: the base of set of PLINK bed,bim and fam files (this should only match one);
data
: a tab-separated file that contains phenotype data for your particpants. The row is a header line, with one participant per line after that. The first two columns should FID IID followed by any phenotype values you want to use for testing or for covariates. It can contain other data too -- as long as the ones that you need are in this file.pheno
: a comma-separated list of phenotypes that you want to test. Each phenotype is tested separately. If you wish to do a transform to the phenottype, you can suffix the phenotype with /function where function is a numpy function. For example you might say--pheno bmi/np.log
which will apply the log function to the phenotype. Any numpy function can be used, typical uses being np.log and np.sqrt. We plan to support user provision of a user-given function.covariates
: a comma-separated list of phenotypes that you want to use
- build relatdness and GRM, you have different way to buil matrix of relatdness for boltlmm, gemma, fast gwa or fastlmm, regenie
- by default all snps are used and for boltlmm 950,000 are shuffled
- you can give a rs contains id or rs
file_rs_buildrelat
- you can used
sample_snps_rel
(default 0), will used plink to sample snps :sample_snps_rel_paramplkl
: parameter for plink to obtained independnat pairwise ([default : 100 20 0.1 --maf 0.01 ])snps_include_rel
: bed file (chro begin end rs) to include on plink command to defined independant snps [default : ""]snp_rel_param_plk
: [default --maf 0.01 --mac 10]
file_rs_buildrelat
: file with rs list (one by lines) to build genetics models (relatdness), for gemma-snps
, boltlmm--modelSnps
, regenie to build step 1
genetic_map_file
: genetic map used in boltlmmprint_pca
: by default pipeline compute and print pca (print_pca
=1), if you want avoid this step (print_pca
= 0)exclude_snps
option to exclude some snps active just for boltlmm (see--exclude
in boltlmm manual) : SNP ID must be first column (default none)
- optionnal input bgen ( for bolt, saige and fastGWA):
bgen
: bgen filebgen_sample
: list of samplelist_bgen
: file contains a list bgen, must be have at leastbgen_sample
- specificity :
- bolt-lmm : dosage bgen or impute2 in paralel with genotype plink file
- saige :
- vcf file are prioritise to bgen file (option bgen ignore for saibe)
- relatdness done using genotype of plink
- vcf priotise to bgen and bgen prioritise compared to plink genotype where association not done
- regenie :
- step 1 (reladness) done with plink
- step 2 : bgen prioritise on plink
- fastgwa :
- grm done using genotype of plink
- bgen priotise on plink
| Software | plink | vcf | bgen | impute 2 | allele effect | GxE | X |
| --- | --- | --- | --- | --- | --- |
| gemma | yes | no | no | no | allele1 | yes | not tested |
| plink | yes | no | no | no | A1 | yes | not tested |
| gcta/fastGWA | yes | no | yes | no | - | no | not tested |
| saige | yes | yes | yes | no | allele2 | no | yes |
| bolt-LMM | yes | no | yes | yes | ALLELE1 | no | not tested |
| fast-lmm | yes | no | no | no | - | no | not tested |
| regenie | yes | no | yes | - | - | yes | not tested |
| description | genotype | dosage | dosage | dosage | - | - | - |
| Option | --input_dir
/--input_pat
| --list_vcf
| --bgen
/--bgen_sample
| bolt_impute2filelist
/bolt_impute2fidiid
| - | --gxe
/ --plink_gxe
/ --gemma_gxe
/ --regenie_gxe
| no option
|
assoc
: should a chi2 test be used (0 or 1)fisher
: Fisher exact test (default 0)linear
: should linear regreession be used? (default 0)logistic
: should linear regression be used? (default 0)- plink gwas option :
adjust
: do we want to do explicit testing for Bonferroni correction et al that PLINK odesmperm
: do you want to test doing permutation testing. If so, how many tests? By default this is 1000.sexinfo_available
presence of sex in plink file 1, yes 2 : no [default : 0]
gemma
: should gemma be used? (default 0)- see manual
- relatdness :
gemma_mat_rel
: file contains in gemma format matrix relatdness used by gemma (format 1, see manual), matrix must be in same order than fam file. Avoid computation of relatdness by pipeline.
- ressource usage :
gemma_num_cores
: if gemma is used set this up to 8gemma_mem_req
: memory for gemmagemma_multi
: option run gemma by chromosome separately (is not a loco). to increase time computation and used more cpus, 0 no 1 yes [default : 0 (No)]
-
boltlmm
: should boltlmm be used? -
see manual
-
build relatdness model :
-
if SNPs is higher than 950,000, 950,000 SNPs are chosen randomly to build the model (see --modelSnps option in bolt)
-
covariates_type
/bolt_covariates_type
: for bolt need to define if covariate is binary (0) or continue (1), a comma-separated list as same order than covariates -
bolt_ld_score_file
: A table of reference LD scores for boltlmm is needed to calibrate the BOLT-LMM statistic (option in boltlmm --LDscoresFile),to choice a specific column in Ld file you can usebolt_ld_scores_col
option (by default : LDSCORE) if option is not provided --LDscoresUseChip used. -
bolt_use_missing_cov
: option to "missing indicator method", by default no activate (0), to activate (1) (--covarUseMissingIndic option in boltlmm), which adds indicator variables demarcating missing status as additional covariates. -
ressource usage
-
bolt_num_cores
if bolt is used set this up to 8 -
bolt_mem_req
memory required for boltlmm, (default : 6GB) -
dosage with Impute2 data in bolt :
bolt_impute2filelist
: list of impute2 files, each line contains :chronumber
file
, file must be in full patternbolt_impute2fidiid
: list of individual in same order thanbolt_impute2filelist
-
bgen : see bgen previously
if bolt-LMM output are empty, algoritm canno't converge can be due to :
- check .stdout and .stderr error
- low sample size with less than 5000 individuals (bolt-LMM is recommended for 5000 individuals`
example of error :
ERROR: Heritability estimate is close to 0; LMM may not correct confounding
Instead, use PC-corrected linear/logistic regression on unrelateds"
- documentation of bolt-lmm : When a heritability estimate reaches 0, then linear mixed model association tests (including BOLT-LMM and other methods) all degenerate to simple linear regression, hence the error message. This situation is dangerous because the “mixed model” will no longer correct for population stratification and relatedness.
“ERROR: Heritability estimate is close to 1; algorithm may not converge. Analysis may be unsuitable due to low sample size or case ascertainment.”
- documentation of bolt-lmm : This error most frequently arises when sample size is low, resulting in estimated heritability having a very large standard error (perhaps even greater than 1) such that the estimate could be anywhere in the range 0 to 1 and might hit one of the boundaries. BOLT-LMM is not recommended for analyses of smaller samples; in this situation, we recommend trying other software packages such as GEMMA or GCTA.
fastlmm
: should fastlmm be used?- see manual
fastlmm_num_cores
: if fastmll is used set this up to 8fastlmm_mem_req
: memory required for fasttlmm (default : 15GB)fastlmm_multi
: memory used by fastmll is very big and increase with snp number, option run fastlmm by chromosome, with relatedness matrix computed before with gemma (-gk 1)fastlmmc_bin
: should change binary for fastlmmc (default fastlmmc) and then for all the tests except gemma, boltlmm and fastlmm, do you want to adjust for multiple testing
fastGWA
: do an analyse with fastGWA (GCTA), specific option fastGWA manual- fastgwa (if 1 perfom fastGWA, otherwise no [default 0])
fastgwa_type
: [default : --fastGWA-mlm-exact]fastgwa_mem_req
: memory for fastgwa and grm [default: 10G]fastgwa_num_cores
: cpus for fastgw and grm [default: 5]covariates_type
: similar tobolt_covariates_type
, give for each covariable type of covariable qualitatif (0) or quantitatif (1), must be len of covariates, if nothing, consider all covariable just as quantitatif covariable [default ""]- grm :
gcta_grmfile
: index of file with grm withgcta_grmfile
.grm.id andgcta_grmfile
.grm.sp, if extension will not here, grm will build see below- to build grm :
grm_nbpart
: nb part to build grm [default : 100]gcta64_bin
: binary for gcta64 [default : gcta64]grm_cutoff
: cutoff value for grm matrix (using option --make-bK-sparse) [default : 100]grm_maf
: cutoff value for frequencies and build grm matrix [default : 0.01]gcta64_bin
: binary for gcta64 [default : gcta64]
saige
: perform association with saige, yes : 1, no : 0 [default, 0]- saige used a plink file for computed and fit model, and can use vcf or bgen file with dosage or genotype with vcf file.
grm_maf
: cutoff value for frequencies and build grm matrix [default : 0.01]saige_loco
: option for loco in saige 1: yes, other value no [default : 1]list_vcf
: file contains list of file vcf, one file by line [default : ""], optionalvcf_field
: vcf field used for association [default : DS] :- GT : used genotype
- DS : used dosage of imputation
saige_imputed_data
: if data need to be imputed, individual with [default : 1]saige_impute_method
: Imputation method for missing dosages. best_guess, mean or minor. ['best_guess']- bin :
- step 1
saige_bin_fitmodel
: binary R for saige and fit model : [default : step1_fitNULLGLMM.R] - step2
saige_bin_spatest
: binary R for saige and do association : [default : step2_SPAtests.R] - ressource :
saige_mem_req
: memory required for saige [default : 10GB]saige_num_cores
: memory required for saige [default : 10], just used for step1 of SAIGE- add other option :
saige_otheropt
: other option for saigesaige_otheropt_step1
: other option in command line of step 1saige_otheropt_step2
: other option in command line of step 2cut_geno
: missingness, [default : 0.9]
- more information relative to regenie
regenie_bin
: binary of regenie [default: regenie]regenie
: perform a regenie analyse yes : 1, no : 0 [default : 0]- maf :
- for step1 see :
snp_rel_param_plk
- bsize (size of the genotype blocks) :
regenie_bsize
: bsize used for step 1 and 2 ifregenie_bsize_step1
andregenie_bsize_step2
are 0 [default 100]regenie_bsize_step1
: bsize for step 1 [default 0]regenie_bsize_step2
: bsize for step 2 [default 0]regenie_otheropt_step1
: other option in command line of step 1regenie_otheropt_step2
: other option in command line of step 2regenie_loco
: option loco for bgen, yes : 1, no : 0, [default 1]regenie_num_cores
: number of cpus used [default : 6]regenie_mem_req
: number of cpus used [default : 20GB ]- you can performed a GxE
regenie_gxe
: performed a gxe, see--gxe
for vairablecovariates_type
: accept categorical and continuous : 0 continuous, 1 categorical comma between argument [""]- ** bug ** : regenie doesn't accept just one categarical covariates
gxe
: environmental variables to do gxe analysis withpheno
, must be coded in 1 and 2 for plinkgemma_gxe
: GxE interation with gemma [default : 0], seecovariates
to add covariates in gemma modelsregenie_gxe
: GxE interation with gemma [default : 0], seecovariates
to add covariates in gemma models- pipeline computed frequencies, N for each group with plink and add to files
plink_gxe
: GxE interation with plink (see option -gxe, in plink manual) [default : 0], no covariate could be provided.- pipeline computed frequencies, N for each group with plink and add to files, futhermore they add A1 and A2.
- furthermore BetaGxE and SeGxE computed by pipeline as SeGxE : sqrt(se1^2 + se2^2), BetaGxE : Z_GXE * SeGxE
##Other flags are:
thin
. You can set this to a floating point number in the range (0, 1] and then the PLINK data files are thinned leaving only that proportion of the SNPs. This allows pipeline to be tested with a small proportion of the data This is probably only needed for debugging purposes and usually this should not be be set.chrom
. Only do testing on this chromosome.
##Example
nextflow run assoc/assoc.nf --input_pat raw-GWA-data --chi2 1 --logistic 1 --adjust 1
analyses the files raw-GWA-data
bed, bim, fam files and performs a chi2 and logistic regression test, and also does multiple testing correction.
- Data and command line can be found h3agwas-examples
nextflow run h3abionet/h3agwas/assoc --input_dir data/imputed/ --input_pat imput_data \
--data data/pheno/pheno_test.all --pheno pheno_qt1,pheno_qt2 \
--output_dir assoc --output assoc \
--gemma 1 --assoc 1 --sample_snps_rel 1 --linear 1 \
-profile slurmSingularity \
--bgen data/imputed/bgen/out.bgen --bgen_sample data/imputed/bgen/out.sample