A bioinformatics package that includes worflows to generate bams and a vcf file from multiplexed GBS paired end fastq files.
This tool utilizes the workflow management system Snakemake and allows you to accomplish a complete pipeline process with just a few command line calls, even on a computing cluster.
An Example Workflow:
- Generate GBS data from Paired-End Fastq's generated by Illumina sequencing technologies
- Split multiplexed libraries into separate fastq's for each sample (read 1 and read 2) for in-line barcoded libraries.
- Process GBS libraries that are made of different species, or samples that require different reference genomes.
- End to end processing with only 2 command line calls to GridEngine. One for producing BAMs, one for producing VCF's
- Quality control - statictics produced for each sample and summarized in a MultiQC report
- Designed to run on high performance clusters
- Clean data management. This pipeline will remove all the intermediate files, and zip up the demultiplex files once complete to save disc space on your system.
- Custom iGenomics Riptide GBS library processing
- Ease of use - you need to only provide the tool with paths to a sample sheet, barcode file and a reference genome
ATTENTION Already have demultiplexed fastq's? Try using the Demux Pipeline
This quick use guide is assuming you have followed the installation. If not, proceed first to installations.
Once your environment is set up you can start running the pipeline!
Start at the beginning with Generating BAMs, or if you have BAMs you want to process, move on to generating VCFs!
Take paired end Fastq files (R1 and R2) and produce sorted BAM files and BAM index files
-
Create and save sample sheet into cluster - generally helpful to put it in same directory as fastq files. For format see samplesheet in supporting file readme.
-
Change directory to where fastq files are located
$ cd directory/to/samples.fastq.gz
-
Update the config/config.yaml file. In terminal, input the following commands.
$ nano ~/gbs/GBS_snakemake_pipeline/config/config.txt
manually change the fields for your project (note that the raw fastq files need to be named _R1.fastq.gz and _R2.fastq.gz)
sample_read1: "/input/absolute/path/to/sample_R1.fastq.gz" samplesheet: "/input/absolute/path/to/samplesheet.txt or samplesheet.csv" barcodefile: "input/path/to/workflow/resources/barcodes/barcodes.txt"
cntrl-o
to save. (press enter)cntrl-x
to exit nano editor -
Run the pipeline by submitting the process via shell script to the GridEngine queue in the terminal.
$ qsub ~/gbs/GBS_snakemake_pipeline/workflow/gbs.sh
-
Done!
Notes:
- This will give you a job number if submitted properly.
- The job will produce snakemake stats (which rule is running) in your current working directory and be named
gbs.sh.e(jobnumber)
- You can check to make sure the program is running by typing
qstat -f
. If not, check thegbs.sh.e(jobnumber)
for why. - Runtime ~ 1 to 2 days for a standard output from Illumina HiSeq.
- Output of the tool will be in the directory where your fastq files were located.
Take a list of BAM directories, and produce a single or multiple VCF files from all the BAM files included in those directories.
- Update the config/config_vcf.yaml file for the 4 fields.
-
For format see config files in supporting file readme.
-
note if you want to produce a seperate VCF for each chromosome in the species, put True into split: field. This will speed up the process as each VCF will be written in parallel. If your reference has scaffolds, it will produce a seperate VCF just for scaffolds (all together)
$ nano ~/gbs/GBS_snakemake_pipeline/config/config_vcf.txt
sample_directories: - "/absolute/path/to/some/sorted_bams" - "/absolute/path/to/some/more/sorted_bams" - "/absolute/path/to/even/more/sorted_bams" # produce a single whole genome VCF (False) or split into 1 VCF for each chromosome (True) -- case sensitive split: True outfile: "/absolute/path/to/output/vcf/directory" reference: "/absolute/path/to/reference_genome.fasta"
cntrl-o
to savecntrl-x
to exit
-
(optional) Create a directory for your VCF output (the same directory name as the config-vcf.yaml output path) and cd into that directory so your snakemake ouput log lives with the VCF file. The name of the VCF will be whatever you name the output folder!
$ mkdir vcf_output/ && cd vcf_output
-
Run the pipeline by submitting the process via shell script to the GridEngine queue in the terminal.
$ qsub ~/gbs/GBS_snakemake_pipeline/workflow/vcf.sh
-
Done!
(optional) 4. Process the VCF file Check out the VCF_Analysis_tools ReadMe to see a couple tools to process and visualize your VCF
- A filtering program detailed in the the following ReadMe can be used to filter the data for a variety of parameters
- Determine coverage of SNPs across the genome and graph that using the windowed_variance.py script
Notes:
- This will give you a job number if submitted properly.
- You can check to make sure the program is running by typing
qstat
. Look forvcf.sh
in the list. If not, check thevcf.sh.e(jobnumber)
for why. - Final outputs will be in the folder you designated in the config_vcf.yaml file
-
In your home directory on the cluster/local computer create a directory for the pipeline to live called
gbs
$ mkdir gbs $ cd gbs/
-
Clone the repository to that gbs folder you created on the cluster
-
You will need git installed on your cluster account to accomplish this. Can be done using
conda install -c anaconda git
on your cluster account.$ git clone https://github.com/elderberry-smells/GBS_snakemake_pipeline.git
press
enter
, your local clone should be created with the following pathways for the pipeline:~/gbs/GBS_snakemake_pipeline/snakefiles ~/gbs/GBS_snakemake_pipeline/config/ ~/gbs/GBS_snakemake_pipeline/workflow/envs/ ~/gbs/GBS_snakemake_pipeline/workflow/rules/ ~/gbs/GBS_snakemake_pipeline/workflow/scripts/ ~/gbs/GBS_snakemake_pipeline/workflow/resources/
- Install the GBS snakemake environment on your computer
- Create an environment for snakemake using the
workflow/envs/gbs-spec-file.txt
file from the repo you downloaded, then activate it.$ conda create --name gbs --file ~/gbs/GBS_snakemake_pipeline/workflow/envs/gbs-spec-file.txt $ conda activate gbs
- When that is done running, you should see
(gbs)
on the left side of your command line 4a. Install Novocraft 3.10 if you don't have it (https://www.novocraft.com/support/download/) - Untar with command tar -xzf novo…tar.gz 4b. Allow Novosort to be a callable program within gbs environment
- move that entire novocraft directory into the recently created gbs environment, and add that path to your bashrc profile
$ mv ~/gbs/GBS_Snakemake_pipeline/workflow/resources/novocraft ~/miniconda3/envs/gbs/bin $ nano ~/.bashrc
- at the bottom of that file, add the following line:
export PATH="~/miniconda3/envs/gbs/bin/novocraft":$PATH
-
cntrl + O to save (enter)
-
cntrl + X to close
-
make the novosort file executable by changing permissions
$ chmod 777 ~/miniconda3/envs/gbs/bin/novocraft/novosort
-
everything should be installed to run the pipeline now.
all of these dependencies should be now be installed and callable by typing in their names in terminal
python=3.6
snakemake=4.0
trimmomatic=0.39
bwa=0.7.17
samtools=1.9
bcftools=1.8
Novosort=2.00.00
If you have data and a samplesheet to run, test to see if the snakemake process is working
Update the config.yaml
file to direct the pipeline to the resources it needs.
$ nano ~/gbs/GBS_snakemake_pipeline/config/config.yaml
Invoke the snakemake process in test mode to see if outputs are being called correctly.
$ snakemake -s ~/gbs/GBS_snakemake_pipeline/Snakefile -np
the snakemake process should show you all it intends to do with the samples located in the samplesheet, and how many process for each module/rule will run.
example:
Job counts:
count jobs
1 all
184 bwa_map
1 demultiplex
184 novosort
184 trimmomatic
1 fastqc
1 multiqc
1 zipup_demux
1 zipup_unmatched
557
A list of user inputs to run the program can be found in the Supporting Files ReadMe.
This includes the config files, samplesheets, and barcode files.
Refer to the docs for formatting of the files.
Examples for samplesheets can be found in workflow/resources/example_samplesheet.csv
Brief documentation on each step in this pipeline can be found in the Features ReadMe
The protocol for library construction is also included in the resources folder
- Author: Brian James (brian.james4@canada.ca)
- Testing and Improvements: Dr. Jana Ebersbach (jana.ebersbach@canada.ca)
- Lead Scientist: Dr. Isobel Parkin (isobel.parkin@canada.ca)
© His Majesty the King in Right of Canada, represented by the Minister of Agriculture and Agri-Food Canada 2020