ddSeeker extracts cellular and molecular identifiers from single cell RNA sequencing experiments.
Input: R1 and R2 FASTQ files from a paired-end single cell sequencing experiment.
Output: one unmapped BAM (uBAM) file containing reads tagged with cell barcodes and Unique Molecular Identifiers (UMI). Default tags are XC and XM for cell barcodes and UMI, XE for errors related to the barcode identification, and XQ and Xq for base quality of cell barcode and UMI respectively1. Users can manually set different tags (see Additional options).
- LX = both linkers not aligned correctly
- L1 = linker 1 not aligned correctly
- L2 = linker 2 not aligned correctly
- I = indel in BC2
- D = deletion in Phase Block or BC1
- J = indel in BC3 or ACG trinucleotide
- K = indel in UMI or GAC trinucleotide
- B = one BC with more than 1 mismatch
- Increment number of CPU units (faster analysis) with
-c/--cores
. - Manually set tags with
--tag-bc
,--tag-umi
,--tag-bc-q
,--tag-umi-q
,--tag-error
. - Print uncompressed SAM file to standard output (allowing direct feeding to other tools for
filtering, sorting etc.) with
-o/--output -
(note the-
sign). - Generate two csv files reporting the number of reads per cell and the distribution
of error tags specifying the path with
-s/--summary-prefix
. - Create plots from the csv summary files using
make_graphs.R
(see ).
Clone the repository and add the folder to your PATH variable
git clone https://github.com/cgplab/ddSeeker.git
export PATH=<path_to_ddSeeker>:$PATH
We suggest to install python packages using pip which should be already installed if you are using Python3 >= 3.4.
pip install biopython
pip install pysam
-
ddSeeker with 20 cores
ddSeeker.py --input sampleA_R1.fastq.gz sampleA_R2.fastq.gz --output sampleA_tagged.bam --cores 20
-
Print to stdout and pipe to samtools for queryname sorting
ddSeeker.py -i sampleA_R* -c 20 -o - | samtools sort -no sampleA_tagged_qsorted.bam
https://github.com/cgplab/ddSeeker_example_dataset
Requires R >=3.4 and the tidyverse package. Three plots are generated: dot plot of error distribution, absolute count of reads per cell, and cumulative distribution of reads per cell. The latter two report by default the whole set of barcodes in the csv file. To limit the report to a lower number, specify it from the command line.
mkdir summary_folder
ddSeeker.py -i sampleA_R* -c 20 -o sampleA_tagged.bam -s summary_folder/sampleA
make_graphs.R summary_folder/sampleA 2000
Several pipelines have been developed to perform single cell analysis. Below we describe the main steps required to integrate our tool with Drop-seq tools, scPipe and dropEst.
Since Drop-seq tools was our choice for our analyses, we provide a ready-to-use bash script. Simply run
ddSeeker_dropSeq_tools.sh [options] sampleA_R1.fastq.gz sampleA_R2.fastq.gz
to produce aligned tagged reads in BAM format.
Table of Counts can be obtained using the DigitalExpression
tool included in Drop-seq tools.
scPipe requires one FASTQ file with cell barcodes and UMIs stored in the header
of each read record. To change the output of ddSeeker use the option
--pipeline scpipe
.
ddSeeker.py -i <read1.fastq.gz> <read2.fast.gz> -o <tagged_reads.fastq.gz -c 20 --pipeline scpipe
In addition, set bc_len=18
and UMI_len=8
with the sc_exon_mapping()
function.
dropEst can work with tagged BAM files. Simply make the BamTags match with the ddSeeker tags specifying them in the config.xml file
<BamTags>
<cb>XC</cb>
<umi>XM</umi>
</BamTags>
Romagnoli D*, Boccalini G*, Bonechi M, Biagioni C, Fassan P, Bertorelli R, De Sanctis V, Di Leo A, Migliaccio I, Malorni L, Benelli M. ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data. BMC Genomics. 2018; 19:960
Link to the paper: ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data
Footnotes
-
Know issue. The cell barcode is composed by three shorter blocks and if any of the block is affected by and indel event it may be difficult to determine which base was inserted/deleted (although each block is fixed by comparing it to a list of available blocks). Therefore, the quality string of the cell barcode may differ in length with respect to the proper cell barcode by one base: it still gives a general view of the quality of the barcode. ↩