Much of this pipeline was inspired by https://github.com/snakemake-workflows and https://github.com/crazyhottomy. The fastq2jason.py script was modified from the original by https://github.com/crazyhottomy, but the Snakefile and modularized rules were inspired by https://github.com/snakemake-workflows. All Files in rules and scripts are my own work. If you use this pipeline, please cite Manninm/MiKTMCSnakemakePipeline
Most of the specifics of the pipeline can be handled in the config.yaml file. The snakefile, rules and cluster.json SHOULD NOT BE EDITED BY HAND. If you absolutley need to edit cluster.json, I recommend https://jsoneditoronline.org/. Snakemake is very sensitive to syntax, and just saving a file in the wrong format can cause problems.
Download the pipeline from Github or transfer the pipeline from my home directory on 76 server
tar -xvf MiKTMCSnakemakePipeline.tar.gz
mv -v MiKTMCSnakemakePipeline/* .
rm -r MiKTMCSnakemakePipeline/
Do dry run to check outputs and rules
snakmake -npr -s Snakefile
Make DAG or Rulegraph
snakemake --forceall --rulegraph -s Snakefile | dot -Tpng > rulegrap.png
snakemake --forceall --rulegraph -s Snakefile | dot -Tpdf > rulegrap.pdf
snakemake --forceall --dag -s Snakefile | dot -Tpng > dag.png
snakemake --forceall --dag -s Snakefile | dot -Tpdf > dag.pdf
Run locally using 22 cores
snakemake -j 22 -s Snakefile
Run on Greatlakes and Slurm FYI, the --flags used in the snakemake command call must be somewhere in cluster.json, wwether under the default heading, or the rule heading. If --tasks-per-node is called in the command call, and only --tasks-per-cpu is in your default/rule heading, snakemake will complain that "Wildcards have no attribute..."
snakemake -j 999 --cluster-config cluster.json --cluster 'sbatch --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem-per-cpu {cluster.mem-per-cpu} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
The pipeline expects a directory format as the below example CAUTION Four or more samples must be included, or the PCA scripts will break. It expects pair-end reads. To my knowledge, the pipeline will not accomodate single-end reads.
RNAseqTutorial/
├── Sample_70160
│ ├── 70160_ATTACTCG-TATAGCCT_S1_L001_R1_001.fastq.gz
│ └── 70160_ATTACTCG-TATAGCCT_S1_L001_R2_001.fastq.gz
├── Sample_70161
│ ├── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R1_001.fastq.gz
│ └── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R2_001.fastq.gz
├── Sample_70162
│ ├── 70162_CGCTCATT-ATAGAGGC_S3_L001_R1_001.fastq.gz
│ └── 70162_CGCTCATT-ATAGAGGC_S3_L001_R2_001.fastq.gz
├── Sample_70166
│ ├── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R1_001.fastq.gz
│ └── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R2_001.fastq.gz
├── scripts
├── groups.txt
└── Snakefile
The pipeline uses two types of annotation and feature calling for redundancy in the event that one pipeline fails/gives 'wonky' results Upon initiating the snakemake file, the snakemake preamble will check fastq file extensions (our lab uses .fq.gz for brevity) and change any fastq.gz to fq.gz. The preamble will then generate a samples.json file using fastq2json.py. You should check samples.json and makesure it is correct because the rest of the pipeline uses this file to create wildcars, which is the driving force behind snakemake. If no groupfile (groups.txt) was provided, the preample will generate one for you. This file is necessary to run ballgown as well as the PCA plots. This should also be checked for errors. If you provide your own groups.txt, it should be in the format below
Directory Samples Disease Batch
Sample_70160/ Sample_70160 Sample Batch
Sample_70161/ Sample_70161 Sample Batch
Sample_70162/ Sample_70162 Sample Batch
Sample_70166/ Sample_70166 Sample Batch
The directory and sample names should correspond and be in the order as they appear in the directory. The sample and batch columns can be used to designate phenotype data and any batchs you may have. If you have varying 'Disease' types, you can then use this file for differential expression and use the batch column to correct for batch affects. The PCA plotting scripts will plot Disease types in different colors, and different Batchs with different shapes
I have attempted to make this pipeline as streamlined and automatic as possible. It could incorporate differential expression, but I feel that the pipeline completes sufficient tasks for review before Differetial Analysis. In the even that a cohort has Glom and Tub samples, it would be wise to run each separately in their own pipeline. Adding another child directory would be more difficult to code rules for. If there are any plots, qc tools or metrics that you use in your personal analysis, those can be integrated upon request.