Skip to content

Latest commit

 

History

History
127 lines (77 loc) · 7.88 KB

README.md

File metadata and controls

127 lines (77 loc) · 7.88 KB

Snakemake BSA-seq pipeline

A Snakemake pipeline to perform QTL mapping using Next Generation Sequencing Bulk Segregant Analysis (BSA-seq) to identify QTLs underlying a phenotype.

From Mansfeld et al. (2018) Plant Genome 11(2). doi: 10.3835/plantgenome2018.01.0006. PMID: 30025013.

The BSA-seq procedure is performed by establishing and phenotyping a segregating population and selecting individuals with high and low values for the trait of interest. DNA from these individuals is pooled into high and low bulks which are subject to sequencing and single nucleotide polymorphism (SNP) calling, thus mitigating a need to develop markers in advance. In bulks selected from F2 populations, SNPs detected in reads derived from regions not linked to the trait of interest should be present in ∼50% of the reads. However, SNPs in reads aligning to genomic regions closely linked to the trait should be over- or under-represented depending on the bulk. Thus, comparing relative allele depths, or SNP-indices (defined as the number of reads containing a SNP divided by the total sequencing depth at that SNP) between the bulks can allow quantitative trait loci (QTL) identification (Takagi et al., 2013).

The output of the pipeline is a GATK tab-delimited variant file containing all SNP information from all samples. This can then be used with the QTLseqR package.
The input of QTLseqr is the variant table as produced by the GATK VariantsToTable function.

1. Installation 🔨

1.1 Install Miniconda and mamba

Miniconda is a lightweight installation of the conda package manager.
Mamba is a faster re-implementation of conda. These commands should be run inside your favorite Shell (e.g. bash).

To install conda, follow the instructions: - Conda installation

Once conda is installed, you can get mamba easily in your default (base) conda environment:

conda install mamba -n base -c conda-forge --yes

mamba is a much quicker alternative to conda with most commands being the same (replace "conda" by "mamba").

1.2 Clone the pipeline repository 🐱

From GitHub, copy the repo link: https://github.com/SilkeAllmannLab/snakemake_bsaseq.git

Then run git clone https://github.com/SilkeAllmannLab/snakemake_bsaseq.git on a cluster e.g. crunchomics. You will now have a folder named "snakemake_bsaseq/" from where all following commands should be run.

1.3 Install the Snakemake pipeline dependencies

Place yourself in the snakemake_bsaseq/ folder and run:

mamba env create -f environment.yaml

2. Test run 🧪

The test run reproduces the BSA-seq analysis to identify a QTL related to rice plant height coined qPH9 in the original study of Xin et al. 2022.

Reference: Xin W, Liu H, Yang L, Ma T, Wang J, Zheng H, Liu W, Zou D. BSA-Seq and Fine Linkage Mapping for the Identification of a Novel Locus (qPH9) for Mature Plant Height in Rice (Oryza sativa). Rice (N Y). 2022 May 20;15(1):26. doi: 10.1186/s12284-022-00576-2. PMID: 35596038; PMCID: PMC9123124.

2.1 Fastq test files

The test fastq files are available from the Zenodo here. The BSA-seq data for this study can be found in the National Center for Biotechnology Information Sequence Read Archive under the accession numbers SRR13306959 (low plant height) and SRR13306960 (high plant height) under project number PRJNA687818. A subset (10%) of SSRR13306959 and SRR13306960 were made for test purposes.

Download the paired-end fastq files for both high height and low height individual pools and place them inside the config/fastq/ folder (create it if necessary).

2.2 Reference genome

The rice genome reference assembly used is Os-Nipponbare-Reference-IRGSP-1.0 that can be downloaded from the Rice Annotation Project Database.

The path to the genome should be added to the config.yaml file in the ref_genome argument. By default the chromosome 09 of the rice genome is used since the qPH9 QTL identified is located on chromosome 9.

2.3 Change default parameters if needed

  1. The pipeline parameters are visible in config/config.yaml and can be edited before the run is executed.
  2. Change the file path to your fastq files for the bulk mutant in config/samples.csv.

⚠️ In the samples.csv file, columns have to be named sample, fq1,fq2.

2.4 Run the pipeline

  1. Activate the required environment to have all dependencies accessible in your $PATH: conda activate bsaseq
  2. Execute the pipeline: snakemake -j 1 (specify N threads with -j N).

On a cluster managed with SLURM such as the UvA-FNWI crunchomics, if you specify 10 CPUs you can run with:
conda activate bsaseq && sbatch -J qtlseq --time=24:00:00 --cpus-per-task=10 --mem-per-cpu=4G snakemake -j 10

This will (1) activate the bsaseq conda environment with all softwares required and (2) submit a SLURM job named "bsaseq" with 10 cpus and 4Gb of RAM per CPU.

3. Graphs

These graphs display the order of tasks from beginning to end.

For each sample (including reference sample)

dag graph

For one given sample

dag graph

4. Downstream analysis with QTLseqR

An example variant table file called rice.variants.tsv.gz and an example R script for QTL-seq analysis with QTLseqr is available in the qtlseqr/ folder.

5. References 📖

QTLseqR software

  • GitHub link

  • Citation: Mansfeld BN, Grumet R. QTLseqr: An R Package for Bulk Segregant Analysis with Next-Generation Sequencing. Plant Genome. 2018 Jul;11(2). doi: 10.3835/plantgenome2018.01.0006. PMID: 30025013.

Snakemake

  • Snakemake Documentation

  • Citation: Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33. Link.

VCF format specifications