Copyright 2017-2022: Vedran Franke, Bora Uyar, Ricardo Wurmus, Altuna Akalin. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.
PiGX scRNAseq is an analysis pipeline for preprocessing and quality control for single cell RNA sequencing experiments. The inputs are read files from the sequencing experiment, and a configuration file which describes the experiment. It produces processed files for downstream analysis and interactive quality reports. The pipeline is designed to work with UMI based methods. It currently supports all methods which output paired adapter - read files. The pipeline was heavily influenced by the Dropseq pipeline from the McCaroll lab.
- Quality control reads using fastQC and multiQC
- Automatically determines the appropriate cell number
- Constructs the digital gene expression matrix
- Calculates per sample and per cell statistics
- Prepares a quality control report
- Normalizes data and does dimensionallity reduction
- bam files
- bigwig files
- UMI and read count matrices
- Quality control report
- SingleCellExperiment object with pre-calculated statistics and dimensionallity reductions
Figure 1: An overview of the PiGx scRNA-seq workflow
You can install this pipeline and all of its dependencies through GNU Guix:
guix install pigx-scrnaseq
You can also install it manually from source. You can find the latest release here. PiGx uses the GNU build system. Please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:
./configure \
--prefix=/some/where
make install
By default the configure
script expects tools to be in a directory
listed in the PATH
environment variable. If the tools are installed
in a location that is not on the PATH
you can tell the configure
script about them with variables. Run ./configure --help
for a list
of all variables and options.
You can prepare a suitable environment with Conda or with GNU Guix.
Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available:
guix shell
To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type. To run the pipeline, you will also need the appropriate genome sequence in fasta format, and the genome annotation in a gtf format.
$ pigx-scrnaseq [options] sample_sheet.csv -s settings.yaml
To see all available options type the --help
option
$ pigx-scrnaseq --help
usage: pigx-scrnaseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET]
[-n] [--graph GRAPH] [--force] [--reason] [--unlock]
samplesheet
PiGx scRNAseq Pipeline.
PiGx scRNAseq is a data processing pipeline for single cell RNAseq read data.
positional arguments:
samplesheet The sample sheet containing sample data in CSV format.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-s SETTINGS, --settings SETTINGS A YAML file for settings that deviate from the defaults.
-c CONFIGFILE, --configfile CONFIGFILE The config file used for calling the underlying snakemake process. By
default the file 'config.json' is dynamically created from the sample
sheet and the settings file.
--target TARGET Stop when the named target is completed instead of running the whole
pipeline. The default target is "final-report". Pass "--target=help"
to describe all available targets.
-n, --dry-run Only show what work would be performed. Do not actually run the
pipeline.
--graph GRAPH Output a graph in Graphviz dot format showing the relations between
rules of this pipeline. You must specify a graph file name such as
"graph.pdf".
--force Force the execution of rules, even though the outputs are considered
fresh.
--reason Print the reason why a rule is executed.
--unlock Recover after a snakemake crash.
This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.
The sample sheet is a tabular file describing the experiment. The table has the following columns:
name | barcode | reads | method | covariate1 | covariate2 |
---|
- name - name for the sample, which will be used to label the sample in all downstream analysis
- barcode - fastq file containing the adapter sequences
- reads - fastq file containing the sequenced reads
- location of these files is specified in
settings.yaml
- location of these files is specified in
- method - sequencing platform on which the experiment was performed (i.e. dropseq)
- covariates - variables which describe the samples. For example: replicate, time, hour post infection, tissue ...
Additional columns may be included which may be used as covariates in the differential expression analysis (sex, age, different treatments).
The settings file is a YAML file which specifies:
-
Locations:
- The locations of the reads (directory where
fastq
files are located) - The location of the output directory
- The location of the
fasta
file with the reference genome (must be prepared by the user) - The location of a
GTF
file with genome annotations
- The locations of the reads (directory where
-
Genome assembly name (i.e. mm10)
-
covariates
In order to get started, enter pigx-scrnaseq --init-settings my_settings.yaml
. This will create a file called my_settings.yaml
with the default structure. The file will look like this:
locations:
output-dir: out/
reads-dir: sample_data/reads/
tempdir:
covariates: 'covariate1, covariate2, ...'
annotation:
primary:
genome:
name: hg19
fasta: sample_data/genome.fa
gtf: sample_data/genome.gtf
execution:
submit-to-cluster: no
jobs: 6
nice: 19
Single cell expression analysis is data intensive, and requires substantial computing resources. The pipeline uses the STAR aligner for read mapping, so the memory requirements will scale with the size of the genome. Please look at the STAR manual for the concrete number about the memory requirements. For the human/mouse genome it requires ~ 40Gb of RAM. The pipeline produces temporary files which require a substantial amount of disk space. Please ensure that you have at least 30Gb of disk space per 100 milion sequenced reads. The location of the temporary directory can be controlled using the tempdir: variable in the settings.yaml. By default the tempdir is set to /tmp.
Important: please make sure that the temporary directory has adequate free space
The output directory structure should look like the following tree
|-- Annotation
| `-- genome_name (i.e. mm10)
| `-- STAR_INDEX
|-- Log
|-- Mapped
| |-- Sample_1
| | `-- genome_name
| |-- Sample_2
| | `-- genome_name
| |-- Sample_3
| | `-- genome_name
| `-- Sample_4
| `-- genome_name
Contains pre-processed fasta and gtf file, along with the STAR genome index. The genome fasta file is processed into a dict header. The gtf file has gene_names replaced with gene_id.
Important: We sincerely advise that you check that the gtf file corresponds to the same organism and genome version as the genome fasta files. The chromosome names have to completely correspond between the two files.
We encourage users to use both the genome annotation and the fasta file from the ENSEMBL database.
Contains execution logs for every step of the pipeline.
The Mapped folder contains per sample processed single cell samples. Additionally, it contains a loom file with merged expression values from all experiments, an RDS file with a saved SingleCellExperiment object, and a quality control report in the html format.
Analaysis results for each sample are done in a separate subdirectory under Mapped. Structure of analysis results:
|-- Sample1.fastq.bam
|-- Sample1_1_fastqc.html
|-- Sample1_1_fastqc.zip
|-- Sample1_2_fastqc.html
|-- Sample1_2_fastqc.zip
|-- genome_name
| |-- adapter_trimming_report.txt
| |-- Sample1_genome_name.bw
| |-- Sample1_genome_name.m.bw
| |-- Sample1_genome_name.p.bw
| |-- Sample1_genome_name_BAMTagHistogram.txt
| |-- Sample1_genome_name_DownstreamStatistics.txt
| |-- Sample1_genome_name_READS.Matrix.txt
| |-- Sample1_genome_name_ReadCutoff.png
| |-- Sample1_genome_name_ReadCutoff.yaml
| |-- Sample1_genome_name_ReadStatistics.txt
| |-- Sample1_genome_name_Summary.txt
| |-- Sample1_genome_name_UMI.Matrix.loom
| |-- Sample1_genome_name_UMI.Matrix.txt
| |-- polyA_trimming_report.txt
| |-- star.Log.final.out
| |-- star.Log.out
| |-- star.Log.progress.out
| |-- star.SJ.out.tab
| |-- star_gene_exon_tagged.bai
| |-- star_gene_exon_tagged.bam
| |-- unaligned_tagged_Cellular.bam_summary.txt
| `-- unaligned_tagged_Molecular.bam_summary.txt
-
Sample1.fastq.bam - contains merged barcode and sequence fq files
-
Sample1_genome_name.bw - bigWig file constructed from selected cells. Files with m/p.bw contain strand separated signal
-
Sample1_genome_name_BAMTagHistogram - Number of reads in coressponding to each cell barcode.
-
Sample1_genome_name_UMI.Matrix.txt/loom - UMI based digital expression matrix in txt and loom format
-
Sample1_genome_name_READS.Matrix.txt - Read count digital expression matrix
-
Sample1_genome_name_ReadCutoff.yaml - contains the UMI threshold for selecting high quality cells (obtained using dropbead). The corresponding .png file visualizes the UMI curve and the threshold.
-
star_gene_exon_tagged.bam - mapped and annotated reads. Each read is tagged by annotation based on it's mapping location.
-
Sample1_genome_name_ReadStatistics/Downstream.txt - quality control statistics used in the html report. They contain values such as number of reads in Exons/Introns.
The combined expression data are subsequently processed into a SingleCellExperiment object. SingleCellExperiment is a Bioconductor class for storing expression values, along with the cell, and gene data, and experimental meta data in a single container. It is constructed on top of hdf5 file based arrays (Pagès 2017), which enables exploration even on systems with limited memory capacity. During the object construction, the pipeline performs expression normalization, dimensionallity reduction, identification of significantly variable genes, assigns the cells to the steps of the cell cycle, and calculates the quality statistics. The SingleCellExperiment object contains all of the necessary data needed for further exploration. The object connects the pigx-pipeline with the Bioconductor single cell computing environment, and enables integration with state of the art statistical, and machine learning mehods (scran, zinbwave, netSmooth, iSEE.
The execution
section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see etc/settings.yaml
.
An example can be found in the tests
directory. The
sample_sheet.csv
file here specifies the following sample data:
The easiest way to install all of the dependencies is through the guix package management system. Firstly download and install guix to your computer. The guix.scm file in the root of the project directory contains the description recipe for installing all of the necessary tools. The following command will install all of the dependencies to the .guix-profile folder
guix environment -l guix.scm --root=`pwd`/.guix-profile
# sets up the directory
basepath='~/pigx=scrnaseq/development'
mkdir -p $basepath; cd $basepath
# downloads the repository
git clone https://github.com/BIMSBbioinfo/pigx_scrnaseq.git
cd pigx_scrnaseq; mkdir run
# uses guix to install all of the dependencies into a separate environment
guix environment -l guix.scm --root=`pwd`'./run/.guix-profile'
# sets the temporary directory - needed for storing large temporary files
export TMPDIR=~/Tmp
# installs the pipeline
./bootstrap.sh && ./configure --prefix=`pwd`/run && make install
# runs the pipeline on the test data
./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
To prepare the environment for the development set the following variable:
export PIGX_UNINSTALLED=1
If this variable is not set pigx-scrnaseq will execute files in the ./run/bin
folder (pre-installed files),
and will not react to changes to scripts.
If you already have a pre-installed dependencies, then execute the following commands to setup your environment
# loads the guix environment.
guix package -p run/.guix-profile --search-path=prefix
export PIGX_UNINSTALLED=1
# runs the pipeline on the test data
./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
guix package -p ./run/.guix-profile --search-path="prefix"
pigx-scrnaseq is the main driver script for the pipeline (user entry point).
It is constructed from pigx-runner.in
(from the pigx-common
repository) during the configuration step.
If you want to update the pigx-scrnaseq, change the copy of common/pigx-runner.in, and run the install the pipeline
step of the development installation, to update the changes.
Is the main SnakeMake script which constructs the execution graph and executes the pipeline.
Any changes to the snakefile.py
are observed directly upon execution.
Folder which contains all R and python scripts. These scripts are called used by the Snake_Dropseq.py
To make changes or add improvements to the pipeline, follow these steps:
-
create a new git branch
-
switch to the branch
-
make your updates
-
check whether the updates work by running the following code:
make install && ./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
-
run the tests with
make distcheck
-
check whether there were updates to master. If there were updates, run git pull -r. Again check whether the pipeline works
-
push the changes to the corresponding branch, and open a pull request.
2022