Skip to content

Commit

Permalink
N characters in --soloAdapterSequence are not counted as mismatches, …
Browse files Browse the repository at this point in the history
…allowing for multiple adapters (e.g. ddSeq). SJ.out.tab is sym-linked as features.tsv for Solo SJ output. Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3. Issue #936: Throw an error if an empty whitelist is provided to STARsolo.
  • Loading branch information
alexdobin committed Jun 16, 2020
1 parent 3568cae commit 502dd0c
Show file tree
Hide file tree
Showing 24 changed files with 3,519 additions and 3,420 deletions.
15 changes: 9 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
*.o
Depend.list

.project

# Don't track intermediary files from building the manual
doc/*.aux
doc/*.fdb_latexmk
doc/*.fls
doc/*.log
doc/*.out
doc/*.toc
extras/doc-latex/*.aux
extras/doc-latex/*.fdb_latexmk
extras/doc-latex/*.fls
extras/doc-latex/*.log
extras/doc-latex/*.out
extras/doc-latex/*.gz
extras/doc-latex/*.toc

# Don't track the STAR binary once it has being built
source/STAR
Expand Down
4 changes: 4 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@
* Implemented --readFilesManifest option to input a list of input read files.
* Change in STARsolo SJ output behavior: junctions are output even if reads do not match genes.
* Fixed a bug with solo SJ output for large genomes.
* N characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
* SJ.out.tab is sym-linked as features.tsv for Solo SJ output.
* Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3.
* Issue #883: Patch for FreeBSD in SharedMemory and Makefile improvements.
* Issue #902: Fixed seg-fault for STARsolo CB/UB SAM attributes output with --soloFeatures GeneFull --outSAMunmapped Within options.
* Issue #934: Fixed a problem with annotated junctions that was casuing very rare seg-faults.
* Issue #936: Throw an error if an empty whitelist is provided to STARsolo.

STAR 2.7.4a 2020/06/01
======================
Expand Down
Binary file modified bin/Linux_x86_64/STAR
Binary file not shown.
Binary file modified bin/Linux_x86_64/STARlong
Binary file not shown.
Binary file modified bin/Linux_x86_64_static/STAR
Binary file not shown.
Binary file modified bin/Linux_x86_64_static/STARlong
Binary file not shown.
42 changes: 34 additions & 8 deletions extras/doc-latex/STARmanual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@

\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}}

\title{STAR manual 2.7.4a}
\title{STAR manual 2.7.5a}
\author{Alexander Dobin\\
dobin@cshl.edu}
\maketitle
Expand Down Expand Up @@ -177,16 +177,42 @@ \subsection{Basic options.}
For bzip2-compressed files, use
\code{\opt{readFilesCommand} \optv{bunzip2 -c}}.

Multiple samples can be mapped in one job. For single-end reads use a comma separated list (no spaces around commas), e.g.
\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq}. For paired-end reads, use comma separated list for read1 /space/ comma separated list for read2, e.g.: \opt{readFilesIn} \optv{sample1read1.fq,sample2read1.fq,sample3read1.fq sample1read2.fq,sample2read2.fq,sample3read2.fq}.

\end{itemize}

\subsection{Advanced options.}
There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}.

\subsubsection{Mapping multiple files in one run.}
Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.:

\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq}

For paired-end reads, use comma separated list
for read1, followed by space, followed by comma separated list for read2, e.g.:

\opt{readFilesIn}~\optv{s1read1.fq,s2read1.fq,s3read1.fq s1read2.fq,s2read2.fq,s3read2.fq}

For multiple read files, the corresponding read groups can be supplied with space/comma/space-separated list in \opt{outSAMattrRGline}, e.g.

\opt{outSAMattrRGline} \optv{ID:sample1 , ID:sample2 , ID:sample3}

Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list).

Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}.
The manifest file should contain 3 tab-separated columns, paired-end reads:

\ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line}

For single-end reads, the 2nd column should contain the dash -:

\ofilen{read1-file-name $tab$ - $tab$ read-group-line}

Spaces, but not tabs are allowed in the file names.
If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it.
If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line.

\subsubsection{Using annotations at the mapping stage.}
Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.
Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.

\subsubsection{ENCODE options}
An example of ENCODE standard options for long RNA-seq pipeline is given below:
Expand Down Expand Up @@ -285,7 +311,7 @@ \subsubsection{SAM attributes.}
\item[]
\optv{vG} : genomic coordiante of the variant overlapped by the read
\item[]
\optv{vW} : 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag
\optv{vW} : WASP filtering tag, see detailed description in Section \ref{section:WASP}==. Requires --waspOutputMode SAMtag
\item[]
\optv{CR CY UR UY} : STARsolo: sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected
\item[]
Expand Down Expand Up @@ -474,7 +500,7 @@ \section{Counting number of reads per gene.}

\section{2-pass mapping.}

For the most sensitive novel junction discovery,I would recommend running STAR in the 2-pass mode. It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping.
For the most sensitive novel junction discovery, it is recommended to run STAR in the 2-pass mode. It does not significantly increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping.

\subsection{Multi-sample 2-pass mapping.}
For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.
Expand Down Expand Up @@ -518,7 +544,7 @@ \section{Detection of personal variants overlapping alignments.}
SAM attribute vG outputs the genomic coordinate of the variant, allowing for identification of the variant.
SAM attribute vA outputs which allele is detected in the read: $1$ or $2$ match one of the genotype alleles, $3$ - no match to genotype.

\section{WASP filtering of allele specific alignments.}
\section{WASP filtering of allele specific alignments.} \label{section:WASP}
This is re-implementation of the original WASP algorithm by Bryce van de Geijn, Graham McVicker, Yoav Gilad and Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015) \url{https://www.nature.com/articles/nmeth.3582}.
WASP filtering is activated with \opt{waspOutputMode} \optv{SAMtag}, which will add \optv{vW} tag to the SAM output:
\optv{vW:i:1} means alignment passed WASP filtering, and all other values mean it did not pass:
Expand Down
32 changes: 16 additions & 16 deletions extras/doc-latex/parametersDefault.tex
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@
\optLine{int{\textgreater}0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate{\textunderscore}length - 1)}
\optName{sjdbScore}
\optValue{2}
\optLine{int: extra alignment score for alignmets that cross database junctions}
\optLine{int: extra alignment score for alignments that cross database junctions}
\optName{sjdbInsertSave}
\optValue{Basic}
\optLine{string: which files to save when sjdb junctions are inserted on the fly at the mapping step}
Expand Down Expand Up @@ -143,12 +143,12 @@
\optLine{string(s): paths to files that contain input read1 (and, if needed, read2)}
\optName{readFilesManifest}
\optValue{-}
\optLine{string: file with the names of read files, formatted as:}
\optLine{Paired-End reads: read1{\textunderscore}file{\textunderscore}name \tab read2{\textunderscore}file{\textunderscore}name \tab read{\textunderscore}group{\textunderscore}line.}
\optLine{Single-End reads: read1{\textunderscore}file{\textunderscore}name \tab - \tab read{\textunderscore}group{\textunderscore}line.}
\optLine{string: path to the "manifest" file with the names of read files. The manifest file should contain 3 tab-separated columns:}
\optLine{paired-end reads: read1{\textunderscore}file{\textunderscore}name $tab$ read2{\textunderscore}file{\textunderscore}name $tab$ read{\textunderscore}group{\textunderscore}line.}
\optLine{single-end reads: read1{\textunderscore}file{\textunderscore}name $tab$ - $tab$ read{\textunderscore}group{\textunderscore}line.}
\optLine{Spaces, but not tabs are allowed in file names.}
\optLine{If read{\textunderscore}group{\textunderscore}line does not start with ID:, it can only contain one ID field, and ID: will be added to it.}
\optLine{If read{\textunderscore}group{\textunderscore}line starts with ID:, it can contain several fields separated by \tab, and all fields will be be copied verbatim into SAM @RG header line.}
\optLine{If read{\textunderscore}group{\textunderscore}line starts with ID:, it can contain several fields separated by $tab$, and all fields will be be copied verbatim into SAM @RG header line.}
\optLine{-}
\optName{readFilesPrefix}
\optValue{-}
Expand Down Expand Up @@ -182,7 +182,7 @@
\optLine{string(s): adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.}
\optName{clip3pAdapterMMp}
\optValue{0.1}
\optLine{double(s): max proportion of mismatches for 3p adpater clipping for each mate. If one value is given, it will be assumed the same for both mates.}
\optLine{double(s): max proportion of mismatches for 3p adapter clipping for each mate. If one value is given, it will be assumed the same for both mates.}
\optName{clip3pAfterAdapterNbases}
\optValue{0}
\optLine{int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.}
Expand Down Expand Up @@ -301,7 +301,7 @@
\optLine{variation:}
\begin{optOptTable}
\optOpt{vA} \optOptLine{variant allele}
\optOpt{ha} \optOptLine{haplotype (1/2) when mapping to the diplod genome. Requires genome generated with --genomeTransformType Diploid}
\optOpt{ha} \optOptLine{haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid}
\optOpt{vG} \optOptLine{genomic coordinate of the variant overlapped by the read}
\optOpt{vW} \optOptLine{1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires --waspOutputMode SAMtag.}
\end{optOptTable}
Expand Down Expand Up @@ -572,7 +572,7 @@
\optLine{insertion extension penalty per base (in addition to scoreInsOpen)}
\optName{scoreStitchSJshift}
\optValue{1}
\optLine{maximum score reduction while searching for SJ boundaries inthe stitching step}
\optLine{maximum score reduction while searching for SJ boundaries in the stitching step}
\end{optTable}
\optSection{Alignments and Seeding}\label{Alignments_and_Seeding}
\begin{optTable}
Expand All @@ -584,7 +584,7 @@
\optLine{real: seedSearchStartLmax normalized to read length (sum of mates' lengths for paired-end reads)}
\optName{seedSearchLmax}
\optValue{0}
\optLine{int{\textgreater}=0: defines the maximum length of the seeds, if =0 max seed lengthis infinite}
\optLine{int{\textgreater}=0: defines the maximum length of the seeds, if =0 seed length is not limited}
\optName{seedMultimapNmax}
\optValue{10000}
\optLine{int{\textgreater}0: only pieces that map fewer than this value are utilized in the stitching procedure}
Expand Down Expand Up @@ -709,7 +709,7 @@
\optOpt{Junctions} \optOptLine{Chimeric.out.junction}
\optOpt{SeparateSAMold} \optOptLine{output old SAM into separate Chimeric.out.sam file}
\optOpt{WithinBAM} \optOptLine{output into main aligned BAM files (Aligned.*.bam)}
\optOpt{WithinBAM HardClip} \optOptLine{(default) hard-clipping in the CIGAR for supplemental chimeric alignments (defaultif no 2nd word is present)}
\optOpt{WithinBAM HardClip} \optOptLine{(default) hard-clipping in the CIGAR for supplemental chimeric alignments (default if no 2nd word is present)}
\optOpt{WithinBAM SoftClip} \optOptLine{soft-clipping in the CIGAR for supplemental chimeric alignments}
\end{optOptTable}
\optName{chimSegmentMin}
Expand Down Expand Up @@ -760,7 +760,7 @@
\optLine{int: formatting type for the Chimeric.out.junction file}
\begin{optOptTable}
\optOpt{0} \optOptLine{no comment lines/headers}
\optOpt{1} \optOptLine{comment lines at the end of the file: command line and Nreads: total, unique, multi}
\optOpt{1} \optOptLine{comment lines at the end of the file: command line and Nreads: total, unique/multi-mapping}
\end{optOptTable}
\end{optTable}
\optSection{Quantification of Annotations}\label{Quantification_of_Annotations}
Expand Down Expand Up @@ -807,7 +807,7 @@
\begin{optTable}
\optName{waspOutputMode}
\optValue{None}
\optLine{string: WASP allele-specific output type. This is re-implemenation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad {\&} Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .}
\optLine{string: WASP allele-specific output type. This is re-implementation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad {\&} Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .}
\begin{optOptTable}
\optOpt{SAMtag} \optOptLine{add WASP tags to the alignments that pass WASP filtering}
\end{optOptTable}
Expand All @@ -821,11 +821,11 @@
\optOpt{CB{\textunderscore}UMI{\textunderscore}Simple} \optOptLine{(a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.}
\optOpt{CB{\textunderscore}UMI{\textunderscore}Complex} \optOptLine{one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.}
\optOpt{CB{\textunderscore}samTagOut} \optOptLine{output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA{\textunderscore}read1 [cDNA{\textunderscore}read2 if paired-end] CellBarcode{\textunderscore}read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]}
\optOpt{SmartSeq} \optOptLine{Smart-seq: each cell in a separtate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alginments deduplicated according to alignment start and end (after extending soft-clipped bases)}
\optOpt{SmartSeq} \optOptLine{Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)}
\end{optOptTable}
\optName{soloCBwhitelist}
\optValue{-}
\optLine{string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB{\textunderscore}UMI{\textunderscore}Complexone allows more than one whitelist file.}
\optLine{string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB{\textunderscore}UMI{\textunderscore}Complex allows more than one whitelist file.}
\begin{optOptTable}
\optOpt{None} \optOptLine{no whitelist: all cell barcodes are allowed}
\end{optOptTable}
Expand Down Expand Up @@ -922,8 +922,8 @@
\optValue{CellRanger2.2 3000 0.99 10}
\optLine{string(s): cell filtering type and parameters}
\begin{optOptTable}
\optOpt{CellRanger2.2} \optOptLine{simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count}
\optOpt{TopCells} \optOptLine{only report top cells by UMI count, followed by the excat number of cells}
\optOpt{CellRanger2.2} \optOptLine{simple filtering of CellRanger 2.2, followed by three numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count}
\optOpt{TopCells} \optOptLine{only report top cells by UMI count, followed by the exact number of cells}
\optOpt{None} \optOptLine{do not output filtered cells}
\end{optOptTable}
\end{optTable}
4 changes: 2 additions & 2 deletions source/Genome_genomeOutLoad.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ void Genome::genomeOutLoad(){//allocate and load *output* Genome

G=new char[nGenome];
ifstream GenomeIn;
uint64 genomeFileSize = OpenStream("Genome", GenomeIn, nGenome);
uint64 genomeReadBytesN=fstreamReadBig(GenomeIn,G,nGenome);
//uint64 genomeFileSize = OpenStream("Genome", GenomeIn, nGenome);
//uint64 genomeReadBytesN=fstreamReadBig(GenomeIn,G,nGenome);
GenomeIn.close();

//record required genome parameters in P
Expand Down
2 changes: 1 addition & 1 deletion source/Genome_transformGenome.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ void Genome::transformGenome(GTF *gtf)
vcfStream.close();
};

uint64 nGenome1, nG1allocNew;
uint64 nGenome1=0, nG1allocNew;
char *Gnew=NULL, *G1new=NULL;

if (pGe.transform.type==1) {//haploid: insert alternative alleles into genome sequence, create conversion-block file
Expand Down
1 change: 1 addition & 0 deletions source/Parameters.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,7 @@ Parameters::Parameters() {//initalize parameters info
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "soloCellFilter",&pSolo.cellFilter.type));
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "soloUMIfiltering",&pSolo.umiFiltering.type));
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "soloClusterCBfile",&pSolo.clusterCBfile));
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "soloOutFormatFeaturesGeneField3",&pSolo.outFormat.featuresGeneField3));

parameterInputName.push_back("Default");
parameterInputName.push_back("Command-Line-Initial");
Expand Down
5 changes: 5 additions & 0 deletions source/ParametersSolo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,11 @@ void ParametersSolo::initialize(Parameters *pPin)
pP->inOut->logMain << "WARNING: CB whitelist sequence contains non-ACGT base and is ignored: " << seq1 <<endl;
};
};
if (cbWL.size()==0) {//empty whitelist
exitWithError("EXITING because of FATAL ERROR: CB whitelist file " + soloCBwhitelist[0] + \
" is empty. \nSOLUTION: provide non-empty whitelist.\n" , \
std::cerr, pP->inOut->logMain, EXIT_CODE_INPUT_FILES, *pP);
};
};

std::sort(cbWL.begin(),cbWL.end());//sort
Expand Down
6 changes: 6 additions & 0 deletions source/ParametersSolo.h
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,15 @@ class ParametersSolo {

//output
vector<string> outFileNames;
struct {
string featuresGeneField3;
} outFormat;

bool samAttrYes;//post-processed SAM attributes: error-corrected CB and UMI
int32 samAttrFeature;//which feature to use for error correction



//processing
uint32 redistrReadsNfiles; //numer of files to resditributes reads into

Expand Down
Loading

0 comments on commit 502dd0c

Please sign in to comment.