N characters in --soloAdapterSequence are not counted as mismatches, …

…allowing for multiple adapters (e.g. ddSeq). SJ.out.tab is sym-linked as features.tsv for Solo SJ output. Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3. Issue #936: Throw an error if an empty whitelist is provided to STARsolo.
alexdobin · Jun 16, 2020 · 502dd0c · 502dd0c
1 parent 3568cae
commit 502dd0c
Show file tree

Hide file tree

Showing 24 changed files with 3,519 additions and 3,420 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,13 +1,16 @@
 *.o
 Depend.list
 
+.project
+
 # Don't track intermediary files from building the manual
-doc/*.aux
-doc/*.fdb_latexmk
-doc/*.fls
-doc/*.log
-doc/*.out
-doc/*.toc
+extras/doc-latex/*.aux
+extras/doc-latex/*.fdb_latexmk
+extras/doc-latex/*.fls
+extras/doc-latex/*.log
+extras/doc-latex/*.out
+extras/doc-latex/*.gz
+extras/doc-latex/*.toc
 
 # Don't track the STAR binary once it has being built
 source/STAR

diff --git a/CHANGES.md b/CHANGES.md
@@ -2,9 +2,13 @@
 * Implemented --readFilesManifest option to input a list of input read files.
 * Change in STARsolo SJ output behavior: junctions are output even if reads do not match genes.
 * Fixed a bug with solo SJ output for large genomes.
+* N characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
+* SJ.out.tab is sym-linked as features.tsv for Solo SJ output.
+* Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3.
 * Issue #883: Patch for FreeBSD in SharedMemory and Makefile improvements.
 * Issue #902: Fixed seg-fault for STARsolo CB/UB SAM attributes output with --soloFeatures GeneFull --outSAMunmapped Within options.
 * Issue #934: Fixed a problem with annotated junctions that was casuing very rare seg-faults.
+* Issue #936: Throw an error if an empty whitelist is provided to STARsolo.
 
 STAR 2.7.4a 2020/06/01
 ======================

diff --git a/bin/Linux_x86_64/STAR b/bin/Linux_x86_64/STAR
diff --git a/bin/Linux_x86_64/STARlong b/bin/Linux_x86_64/STARlong
diff --git a/bin/Linux_x86_64_static/STAR b/bin/Linux_x86_64_static/STAR
diff --git a/bin/Linux_x86_64_static/STARlong b/bin/Linux_x86_64_static/STARlong
diff --git a/extras/doc-latex/STARmanual.tex b/extras/doc-latex/STARmanual.tex
@@ -34,7 +34,7 @@
 
 \newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}}
 
-\title{STAR manual 2.7.4a}
+\title{STAR manual 2.7.5a}
 \author{Alexander Dobin\\
 dobin@cshl.edu}
 \maketitle
@@ -177,16 +177,42 @@ \subsection{Basic options.}
 For bzip2-compressed files, use
 \code{\opt{readFilesCommand} \optv{bunzip2 -c}}.
 
-Multiple samples can be mapped in one job. For single-end reads use a comma separated list (no spaces around commas), e.g.
-\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq}. For paired-end reads, use comma separated list for read1 /space/ comma separated list for read2, e.g.: \opt{readFilesIn} \optv{sample1read1.fq,sample2read1.fq,sample3read1.fq  sample1read2.fq,sample2read2.fq,sample3read2.fq}.
-
 \end{itemize}
 
 \subsection{Advanced options.}
 There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. 
 
+\subsubsection{Mapping multiple files in one run.}
+Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.:
+
+\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq} 
+
+For paired-end reads, use comma separated list 
+for read1, followed by space, followed by comma separated list for read2, e.g.:
+
+\opt{readFilesIn}~\optv{s1read1.fq,s2read1.fq,s3read1.fq s1read2.fq,s2read2.fq,s3read2.fq}
+
+For multiple read files, the corresponding read groups can be supplied with space/comma/space-separated list in \opt{outSAMattrRGline}, e.g.
+
+\opt{outSAMattrRGline} \optv{ID:sample1 , ID:sample2 , ID:sample3}
+
+Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list).
+
+Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}.
+The manifest file should contain 3 tab-separated columns, paired-end reads: 
+
+\ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line}
+
+For single-end reads, the 2nd column should contain the dash -: 
+
+\ofilen{read1-file-name $tab$ -               $tab$ read-group-line}
+
+Spaces, but not tabs are allowed in the file names.
+If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it.
+If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line.
+
 \subsubsection{Using annotations at the mapping stage.}
-Since 2.4.1a, the annotations can  be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or  without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.
+Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or  without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.
 
 \subsubsection{ENCODE options}
 An example of ENCODE standard options for long RNA-seq pipeline is given below:
@@ -285,7 +311,7 @@ \subsubsection{SAM attributes.}
 \item[]
 \optv{vG} : genomic coordiante of the variant overlapped by the read
 \item[]
-\optv{vW} : 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag
+\optv{vW} : WASP filtering tag, see detailed description in Section \ref{section:WASP}==. Requires --waspOutputMode SAMtag
 \item[]
 \optv{CR CY UR UY} : STARsolo: sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected
 \item[]
@@ -474,7 +500,7 @@ \section{Counting number of reads per gene.}
 
 \section{2-pass mapping.}
 
-For the most sensitive novel junction discovery,I would recommend running STAR in the 2-pass mode. It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping.
+For the most sensitive novel junction discovery, it is recommended to run STAR in the 2-pass mode. It does not significantly increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping.
 
 \subsection{Multi-sample 2-pass mapping.}
 For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.
@@ -518,7 +544,7 @@ \section{Detection of personal variants overlapping alignments.}
 SAM attribute vG outputs the genomic coordinate of the variant, allowing for identification of the variant.
 SAM attribute vA outputs which allele is detected in the read: $1$ or $2$ match one of the genotype alleles, $3$ - no match to genotype.
 
-\section{WASP filtering of allele specific alignments.}
+\section{WASP filtering of allele specific alignments.} \label{section:WASP}
 This is re-implementation of the original WASP algorithm by Bryce van de Geijn, Graham McVicker, Yoav Gilad and Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015) \url{https://www.nature.com/articles/nmeth.3582}.
 WASP filtering is activated with \opt{waspOutputMode} \optv{SAMtag}, which will add \optv{vW} tag to the SAM output:
 \optv{vW:i:1} means alignment passed WASP filtering, and all other values mean it did not pass:

diff --git a/extras/doc-latex/parametersDefault.tex b/extras/doc-latex/parametersDefault.tex
@@ -107,7 +107,7 @@
   \optLine{int{\textgreater}0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate{\textunderscore}length - 1)} 
 \optName{sjdbScore}
   \optValue{2}
-  \optLine{int: extra alignment score for alignmets that cross database junctions} 
+  \optLine{int: extra alignment score for alignments that cross database junctions} 
 \optName{sjdbInsertSave}
   \optValue{Basic}
   \optLine{string: which files to save when sjdb junctions are inserted on the fly at the mapping step} 
@@ -143,12 +143,12 @@
   \optLine{string(s): paths to files that contain input read1 (and, if needed,  read2)} 
 \optName{readFilesManifest}
   \optValue{-}
-  \optLine{string: file with the names of read files, formatted as:} 
-  \optLine{Paired-End reads: read1{\textunderscore}file{\textunderscore}name \tab read2{\textunderscore}file{\textunderscore}name \tab read{\textunderscore}group{\textunderscore}line.} 
-  \optLine{Single-End reads: read1{\textunderscore}file{\textunderscore}name \tab - \tab read{\textunderscore}group{\textunderscore}line.} 
+  \optLine{string: path to the "manifest" file with the names of read files. The manifest file should contain 3 tab-separated columns:} 
+  \optLine{paired-end reads: read1{\textunderscore}file{\textunderscore}name $tab$ read2{\textunderscore}file{\textunderscore}name $tab$ read{\textunderscore}group{\textunderscore}line.} 
+  \optLine{single-end reads: read1{\textunderscore}file{\textunderscore}name $tab$ -               $tab$ read{\textunderscore}group{\textunderscore}line.} 
   \optLine{Spaces, but not tabs are allowed in file names.} 
   \optLine{If read{\textunderscore}group{\textunderscore}line does not start with ID:, it can only contain one ID field, and ID: will be added to it.} 
-  \optLine{If read{\textunderscore}group{\textunderscore}line starts with ID:, it can contain several fields separated by \tab, and all fields will be be copied verbatim into SAM @RG header line.} 
+  \optLine{If read{\textunderscore}group{\textunderscore}line starts with ID:, it can contain several fields separated by $tab$, and all fields will be be copied verbatim into SAM @RG header line.} 
   \optLine{-} 
 \optName{readFilesPrefix}
   \optValue{-}
@@ -182,7 +182,7 @@
   \optLine{string(s): adapter sequences to clip from 3p of each mate.  If one value is given, it will be assumed the same for both mates.} 
 \optName{clip3pAdapterMMp}
   \optValue{0.1}
-  \optLine{double(s): max proportion of mismatches for 3p adpater clipping for each mate.  If one value is given, it will be assumed the same for both mates.} 
+  \optLine{double(s): max proportion of mismatches for 3p adapter clipping for each mate.  If one value is given, it will be assumed the same for both mates.} 
 \optName{clip3pAfterAdapterNbases}
   \optValue{0}
   \optLine{int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.} 
@@ -301,7 +301,7 @@
   \optLine{variation:} 
 \begin{optOptTable}
   \optOpt{vA}   \optOptLine{variant allele}
-  \optOpt{ha}   \optOptLine{haplotype (1/2) when mapping to the diplod genome. Requires genome generated with --genomeTransformType Diploid}
+  \optOpt{ha}   \optOptLine{haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid}
   \optOpt{vG}   \optOptLine{genomic coordinate of the variant overlapped by the read}
   \optOpt{vW}   \optOptLine{1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires --waspOutputMode SAMtag.}
 \end{optOptTable}
@@ -572,7 +572,7 @@
   \optLine{insertion extension penalty per base (in addition to scoreInsOpen)} 
 \optName{scoreStitchSJshift}
   \optValue{1}
-  \optLine{maximum score reduction while searching for SJ boundaries inthe stitching step} 
+  \optLine{maximum score reduction while searching for SJ boundaries in the stitching step} 
 \end{optTable}
 \optSection{Alignments and Seeding}\label{Alignments_and_Seeding}
 \begin{optTable}
@@ -584,7 +584,7 @@
   \optLine{real: seedSearchStartLmax normalized to read length (sum of mates' lengths for paired-end reads)} 
 \optName{seedSearchLmax}
   \optValue{0}
-  \optLine{int{\textgreater}=0: defines the maximum length of the seeds, if =0 max seed lengthis infinite} 
+  \optLine{int{\textgreater}=0: defines the maximum length of the seeds, if =0 seed length is not limited} 
 \optName{seedMultimapNmax}
   \optValue{10000}
   \optLine{int{\textgreater}0: only pieces that map fewer than this value are utilized in the stitching procedure} 
@@ -709,7 +709,7 @@
   \optOpt{Junctions}   \optOptLine{Chimeric.out.junction}
   \optOpt{SeparateSAMold}   \optOptLine{output old SAM into separate Chimeric.out.sam file}
   \optOpt{WithinBAM}   \optOptLine{output into main aligned BAM files (Aligned.*.bam)}
-  \optOpt{WithinBAM HardClip}   \optOptLine{(default) hard-clipping in the CIGAR for supplemental chimeric alignments (defaultif no 2nd word is present)}
+  \optOpt{WithinBAM HardClip}   \optOptLine{(default) hard-clipping in the CIGAR for supplemental chimeric alignments (default if no 2nd word is present)}
   \optOpt{WithinBAM SoftClip}   \optOptLine{soft-clipping in the CIGAR for supplemental chimeric alignments}
 \end{optOptTable}
 \optName{chimSegmentMin}
@@ -760,7 +760,7 @@
   \optLine{int: formatting type for the Chimeric.out.junction file} 
 \begin{optOptTable}
   \optOpt{0}   \optOptLine{no comment lines/headers}
-  \optOpt{1}   \optOptLine{comment lines at the end of the file: command line and Nreads: total, unique, multi}
+  \optOpt{1}   \optOptLine{comment lines at the end of the file: command line and Nreads: total, unique/multi-mapping}
 \end{optOptTable}
 \end{optTable}
 \optSection{Quantification of Annotations}\label{Quantification_of_Annotations}
@@ -807,7 +807,7 @@
 \begin{optTable}
 \optName{waspOutputMode}
   \optValue{None}
-  \optLine{string: WASP allele-specific output type. This is re-implemenation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad {\&} Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .} 
+  \optLine{string: WASP allele-specific output type. This is re-implementation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad {\&} Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .} 
 \begin{optOptTable}
   \optOpt{SAMtag}   \optOptLine{add WASP tags to the alignments that pass WASP filtering}
 \end{optOptTable}
@@ -821,11 +821,11 @@
   \optOpt{CB{\textunderscore}UMI{\textunderscore}Simple}   \optOptLine{(a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.}
   \optOpt{CB{\textunderscore}UMI{\textunderscore}Complex}   \optOptLine{one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.}
   \optOpt{CB{\textunderscore}samTagOut}   \optOptLine{output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA{\textunderscore}read1 [cDNA{\textunderscore}read2 if paired-end] CellBarcode{\textunderscore}read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]}
-  \optOpt{SmartSeq}   \optOptLine{Smart-seq: each cell in a separtate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alginments deduplicated according to alignment start and end (after extending soft-clipped bases)}
+  \optOpt{SmartSeq}   \optOptLine{Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)}
 \end{optOptTable}
 \optName{soloCBwhitelist}
   \optValue{-}
-  \optLine{string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB{\textunderscore}UMI{\textunderscore}Complexone allows more than one whitelist file.} 
+  \optLine{string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB{\textunderscore}UMI{\textunderscore}Complex allows more than one whitelist file.} 
 \begin{optOptTable}
   \optOpt{None}   \optOptLine{no whitelist: all cell barcodes are allowed}
 \end{optOptTable}
@@ -922,8 +922,8 @@
   \optValue{CellRanger2.2 3000 0.99 10}
   \optLine{string(s):              cell filtering type and parameters} 
 \begin{optOptTable}
-  \optOpt{CellRanger2.2}   \optOptLine{simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count}
-  \optOpt{TopCells}   \optOptLine{only report top cells by UMI count, followed by the excat number of cells}
+  \optOpt{CellRanger2.2}   \optOptLine{simple filtering of CellRanger 2.2, followed by three numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count}
+  \optOpt{TopCells}   \optOptLine{only report top cells by UMI count, followed by the exact number of cells}
   \optOpt{None}   \optOptLine{do not output filtered cells}
 \end{optOptTable}
 \end{optTable}
diff --git a/source/Genome_genomeOutLoad.cpp b/source/Genome_genomeOutLoad.cpp
@@ -28,8 +28,8 @@ void Genome::genomeOutLoad(){//allocate and load *output* Genome
 
     G=new char[nGenome];
     ifstream GenomeIn;
-    uint64 genomeFileSize = OpenStream("Genome", GenomeIn, nGenome);
-    uint64 genomeReadBytesN=fstreamReadBig(GenomeIn,G,nGenome);
+    //uint64 genomeFileSize = OpenStream("Genome", GenomeIn, nGenome);
+    //uint64 genomeReadBytesN=fstreamReadBig(GenomeIn,G,nGenome);
     GenomeIn.close();
 
     //record required genome parameters in P

diff --git a/source/Genome_transformGenome.cpp b/source/Genome_transformGenome.cpp
@@ -82,7 +82,7 @@ void Genome::transformGenome(GTF *gtf)
         vcfStream.close();
     };
 
-    uint64 nGenome1, nG1allocNew;
+    uint64 nGenome1=0, nG1allocNew;
     char *Gnew=NULL, *G1new=NULL;
 
     if (pGe.transform.type==1) {//haploid: insert alternative alleles into genome sequence, create conversion-block file

diff --git a/source/Parameters.cpp b/source/Parameters.cpp
@@ -279,6 +279,7 @@ Parameters::Parameters() {//initalize parameters info
     parArray.push_back(new ParameterInfoVector <string>   (-1, -1, "soloCellFilter",&pSolo.cellFilter.type));
     parArray.push_back(new ParameterInfoVector <string>   (-1, -1, "soloUMIfiltering",&pSolo.umiFiltering.type));
     parArray.push_back(new ParameterInfoScalar <string>   (-1, -1, "soloClusterCBfile",&pSolo.clusterCBfile));
+    parArray.push_back(new ParameterInfoScalar <string>   (-1, -1, "soloOutFormatFeaturesGeneField3",&pSolo.outFormat.featuresGeneField3));
 
     parameterInputName.push_back("Default");
     parameterInputName.push_back("Command-Line-Initial");

diff --git a/source/ParametersSolo.cpp b/source/ParametersSolo.cpp
@@ -231,6 +231,11 @@ void ParametersSolo::initialize(Parameters *pPin)
                     pP->inOut->logMain << "WARNING: CB whitelist sequence contains non-ACGT base and is ignored: " << seq1 <<endl;
                 };
             };
+            if (cbWL.size()==0) {//empty whitelist
+            	exitWithError("EXITING because of FATAL ERROR: CB whitelist file " + soloCBwhitelist[0] + \
+            			       " is empty. \nSOLUTION: provide non-empty whitelist.\n" , \
+							   std::cerr, pP->inOut->logMain, EXIT_CODE_INPUT_FILES, *pP);
+            };
         };
 
         std::sort(cbWL.begin(),cbWL.end());//sort

diff --git a/source/ParametersSolo.h b/source/ParametersSolo.h
@@ -91,9 +91,15 @@ class ParametersSolo {
 
     //output
     vector<string> outFileNames;    
+    struct {
+    	string featuresGeneField3;
+    } outFormat;
+
     bool samAttrYes;//post-processed SAM attributes: error-corrected CB and UMI
     int32 samAttrFeature;//which feature to use for error correction
 
+
+
     //processing
     uint32 redistrReadsNfiles; //numer of files to resditributes reads into