Bulk-RNA Seq Data Processing Pipeline using NextFlow / Docker

This is an automated workflow pipeline for analyzing and processing Bulk-RNA seq data, implemented primarily in bash, python and R, and wrapped in a NextFlow workflow to characterize the gene landscape in the samples. Here are the steps for data processing:

Quality Control - Generate FastQC and MultiQC reports
Genome Alignment - Map reads to the reference genome using STAR 2_1. Mapping Metrics - Generate mapping statistics and quality reports
RPost-Alignment Processing - Filter, deduplicate, and index aligned reads 3_1. Quality Assessment - Generate Qualimap reports for aligned reads
Transcript Assembly and Quantification - Generate counts using StringTie
Raw Count Generation - Generate raw counts using HTSeq && Feature Counts - Generate counts using Rsubread's featureCounts

Running the Bulk-RNA-Sequencing-Nextflow-Pipeline is pretty straight forward, however a good understanding of bash is recommended.

There are two ways to run the Bulk-RNA Seq pipeline: either by installing the necessary packages manually on the system, or by using a Docker container, where everything is pre-packaged. If you choose to use Docker, skip to the next section Running the Tool in Docker.

Installation/Setup of Bulk-RNA Seq NextFlow Pipeline:

You can install Bulk-RNA Seq NextFlow Pipeline via git:

git clone https://github.com/utdal/Bulk-RNA-Seq-Nextflow-Pipeline.git

To execute the tool, essential modifications need to be made to the file(s):

a) pipeline.config
b) rna_seq_samples.txt

Note:

Install nextflow, conda, R & Rsubread-featureCounts, FastQC, MultiQC, STAR, HTSeq, stringtie, samtools, sambamba, bedtools and qualimap packages.

Or simply use the docker container instead from docker-hub: docker pull

Download the reference genome: hg38canon.fa and, to build the index, execute: bowtie2-build hg38canon.fa /path/to/reference_genome/index/hg38

Running the Tool:

Here is an example of how to run the pipeline:

Edit all the parameters, except for the clipping parameters in pipeline.config file:

Note: Clipping parameters can only be updated once Step-2 is run from the FastQC Reports.

Command to run the Quality Control steps in Bulk-RNA Seq analysis:

nextflow run bulk_rna_seq_nextflow_pipeline.nf -c pipeline.config --run_fastqc true --run_rna_pipeline false

Update the pipeline.config file and run the rest of the pipeline:

nextflow run bulk_rna_seq_nextflow_pipeline.nf -c pipeline.config --run_fastqc false --run_rna_pipeline true

Command to re-run the steps if a fail-point occurs:

nextflow run bulk_rna_seq_nextflow_pipeline.nf -c pipeline.config --run_fastqc true --run_rna_pipeline false -resume

or

nextflow run bulk_rna_seq_nextflow_pipeline.nf -c pipeline.config --run_fastqc false --run_rna_pipeline true -resume

respectively ...

The results generated are stored in the params.config_directory = '/path/to/config' directory, as mentioned in the pipeline.config file, as shown below:

Running the Tool in Docker:

Running Bulk-RNA Seq in Docker is straightforward, here is an example of how to run the Bulk-RNA Seq pipeline using Docker.

Check if docker is already installed:
```
docker --version
```
Below are the required input and configuration files needed to run the tool:
- pipeline.config
- rna_seq_samples.txt
The below files are user defined:
- gencode.v38.primary_assembly.annotation.gtf
- filter.bed
- blacklist.bed
- STAR_index
- fastq_files
Place all the necessary files in the config directory/data, i.e., /mnt/Working/BulkRNA-NextFlow-Pipeline/data using docker volume

Note: The config directory in the docker image would be: /mnt/Working/BulkRNA-NextFlow-Pipeline and all the data that would be added via a docker volume mount would be accessible from the data directory (/mnt/Working/BulkRNA-NextFlow-Pipeline/data). Modify the pipeline.config file accordingly.
1. Paired-end fastq files in a fastq_files directory.
2. Bowtie2 genome index files in a directory (e.g., hg38`).
3. Reference genome from NCBI in the refdata-gex-GRCh38-2020-A directory.
4. rna_seq_samples.txt containing sample names without paired-end information.
5. pipeline.config file containing paths to all the necessary files and the genome reference.

Run the docker image by setting up a working directory and mounting a volume where the input and configuration files are located.

docker run -it -v /path/to/mount:/mnt/Working/Bulk_RNA_Seq_NextFlow_Pipeline/data -w /mnt/Working/Bulk_RNA_Seq_NextFlow_Pipeline utdpaincenter/bulk_rna_sequencing_nextflow_pipeline:latest /bin/bash

After entering the container; follow the following commands:
Activate the working environment:
conda activate bulk_rna_seq
Run the nextflow pipeline:
nextflow run bulk_rna_seq_nextflow_pipeline.nf -c data/pipeline.config --run_fastqc true --run_rna_pipeline false
Modify pipeline.config file with the clipping parameters after the FastQC step and run the rest of the pipeline:
nextflow run bulk_rna_seq_nextflow_pipeline.nf -c data/pipeline.config --run_fastqc false --run_rna_pipeline true
If the pipeline encounters errors, dont worry—fix the issues and resume the process from the last checkpoint with:
nextflow run bulk_rna_seq_nextflow_pipeline.nf -c data/pipeline.config --run_fastqc false --run_rna_pipeline true -resume

Once the run is completed, all output files will be copied back to the mounted volume to a /outs directory here: /path/to/mount

Output:

The output files are stored in the config directory or on the volume mount filepath based on how the pipeline is run; locally or using docker rrespectively. The output should contain the following files and folders:

Pipeline Stages

Step	Description	Screenshot
1. Logs	This step involves the collection of logs
2. FastQC and MultiQC	Quality check using FastQC and report using MultiQC
3. Mapping and Map Metrics	Alignment of reads and generation of mapping statistics
4. Filtering and Quality Metrics	Post-alignment filtering and quality assessment
5. StringTie and Raw Counts	Counts using StringTie and FeatureCounts/HTSeq

Credits and Acknowledgments

This Bulk-RNA Seq Data Processing Pipeline was developed with contributions from the following team members:

Authors:
- Dr. Tavares Ferreira, Diana
- Dr. Mazhar, Khadijah
- Inturi, Nikhil Nageshwar - inturinikhilnageshwar@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
misc		misc
LICENSE		LICENSE
README.md		README.md
bulk_rna_seq_nextflow_pipeline.nf		bulk_rna_seq_nextflow_pipeline.nf
dedup_and_filtering.sh		dedup_and_filtering.sh
fetch_genename_genebiotype_for_counts.py		fetch_genename_genebiotype_for_counts.py
generate_fastqc_reports.sh		generate_fastqc_reports.sh
generate_map_metrics.py		generate_map_metrics.py
generate_qualimap_report.sh		generate_qualimap_report.sh
generate_raw_counts.sh		generate_raw_counts.sh
generate_stats.sh		generate_stats.sh
generate_stringtie_counts.sh		generate_stringtie_counts.sh
pipeline.config		pipeline.config
ref_human_geneid_genename_genebiotype.tsv		ref_human_geneid_genename_genebiotype.tsv
ref_mouse_geneid_genename_genebiotype.tsv		ref_mouse_geneid_genename_genebiotype.tsv
rna_seq_samples.txt		rna_seq_samples.txt
rsubread_featurecount_script.R		rsubread_featurecount_script.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bulk-RNA Seq Data Processing Pipeline using NextFlow / Docker

Installation/Setup of Bulk-RNA Seq NextFlow Pipeline:

Running the Tool:

Running the Tool in Docker:

Output:

Pipeline Stages

Credits and Acknowledgments

About

Releases

Packages

Languages

License

utdal/Bulk-RNA-Seq-Nextflow-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Bulk-RNA Seq Data Processing Pipeline using NextFlow / Docker

Installation/Setup of Bulk-RNA Seq NextFlow Pipeline:

Running the Tool:

Running the Tool in Docker:

Output:

Pipeline Stages

Credits and Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages