Snakefile, config and scripts to run an RNA seq analysis

To use this pipeline, create a new repo using this one as a template (click the Use Template button).

Set up Environment

Now, you need to install all the needed software. The easiest way to do this is to install miniconda, and run using the --use-conda option. To create a minimal environment, which is just enough to run Snakemake, use

conda env create -f envs/snake.yaml

This will create a conda environment called snake, if you want a different name, change it in the yaml file before creating. The you can activate it by:

conda activate snake

The environment snake_no_conda should have everything needed to execute all the steps, but it has not been tested extensively, and on some systems it will not install at all.

Set up Config File and Metadata

You will need to create a metadata file for the samples.

The metadata file should be a tab delimited file with a line for each sample. There should be a column the gives the sample ID which matches the sample ID in the configuration file.

You will need to create a configuration file.

The configuration file format is:

samples:
	Con_1:
		F: Con_1_1.fq.gz
		R: Con_1_2.fq.gz
	T_120d_1:
		F: T_120d_1_1.fq.gz
		R: T_120d_1_2.fq.gz

reads_folder: reads
input_file_suffix: .fq.gz

reference_base: Data/hg38_rna
salmon_bootstraps: 30

# Note: organism_db also needs to be added to rnaseq_GO.yaml in envs
organism_db: org.Hs.eg.db
kegg_db: hsa

metadata_file: meta.txt
sample_column: Sample

# Differential Analysis
experiments:
	120d:
		prefix: 120d
		filter: Time == '120d' | Time == '0'
		PCA_Group: Time
		design: Time
		contrast: c('Time', '120d', '0')
		display_column: Sample

Item	Example	Comment
samples:	Con1: F: Con_1_1.fq.gz R: Con_1_2.fq.gz	Sample ID, followed by the forward and reverse file names. If you have a lot of files, you may find the following shell command helpful, assuming the files look like {sample_name}_R1.fastq.gz: for f in R1; do nm=${f%_R*}; echo " ${nm}:"; echo " F: $f"; echo " R: ${f/R1/R2}"; done Note the spacing is important to get the indent levels correct. Also note that you should avoid using an integer as the sample name, this will likely cause issues.
reads_folder:	reads	The folder where the reads are stored
input_file_suffix:	.fq.gz	The suffix of the read files
reference_base:	Data/hg38_rna	Location and name of the reference transcripts. The actual filename should end with .fa. I generally get this from Ensembl, getting both CDS and ncRNA, then cat them into a single file.
salmon_bootstraps:	30	Number of bootstraps in the salmon quantification. This is mainly important if you are going to do transcript level differential expression (outside this pipeline) and don't want to re-run salmon for that.
organism_db:	org.Hs.eg.db	What organism is this? A list of prebuild annotations is here. Additional resources are at AnnotationHub, or build your own using AnnotationForge. This organism also needs to be added to rnaseq_
kegg_db:	hsa	Three letter KEGG organism ID. Here is a list of available IDs.
metadata_file:	meta.txt	The metadata file. See above for a description.
sample_column:	Sample	The name of the column in the meta data file that corresponds to the sample ID.
experiments:	120d:	A list of every contrast to make on the data
Experiment Parameters
prefix:	120d	A repeat of the experiment name. Output folders will have this name.
filter:	Time == '120d' \| Time == '0'	Select the samples you want using standard R logic. This will be executed inside a dplyr filter command. Any columns in the metadata file can be used. If you do not want to do any filtering, simple use TRUE.
PCA_Group:	Time	A column from the metadata file, it will be used for color groups in the PCA plot.
design:	Time	Design of the experiment. See the DESeq2 documentation for details.
contrast:	c('Time', '120d', '0')	The contrast to be used; what do you want compared. See the DESeq2 documentation for details.
display_column:	Sample	A column from the metadata file that you want samples labeled with in figures.

To Run the Analysis

conda activate snake

snakemake -n
nohup snakemake --cores 32 --use-conda -k &

conda deactivate

The first snakemake command is a dry run, which will build the dag and determine which jobs will need to be run. This is good practice, and helps trouble shoot issues with the config file or filenames. The second snakemake command will run the pipeline. Specify how many cores you want to use. --use-conda tells snakemake to create a conda environment for each step. -k is an instruction to run as many independent jobs as possible if some of the jobs fail. I prefer the ease of nohup, but use screen, tmux, or whatever you like, it should have no effect on how the jobs run.

The Results

Two folders will be created, one called QC and the other results.

QC

report.html is the multiqc report on read and quantification quality. FastQC is a directory that includes individual FastQC reports for each sample.

results

For each contrast, a separate folder is created named with the experiment prefix. In this folder is a self contained html report which gives an overview of the results, and shows some figures. Also created under each experiment folder are three sub folders, deseq, GO, and KEGG. These contain output tables and several figures.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
envs		envs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
dag.svg		dag.svg
meta.txt		meta.txt
test.html		test.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakefile, config and scripts to run an RNA seq analysis

Set up Environment

Set up Config File and Metadata

You will need to create a metadata file for the samples.

You will need to create a configuration file.

To Run the Analysis

The Results

QC

results

The Pipeline

About

Releases

Packages

Languages

NielInfante/RNAseq_snakemake

Folders and files

Latest commit

History

Repository files navigation

Snakefile, config and scripts to run an RNA seq analysis

Set up Environment

Set up Config File and Metadata

You will need to create a metadata file for the samples.

You will need to create a configuration file.

To Run the Analysis

The Results

QC

results

The Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages