Update to syntax & using STAR v2.7.10a

cellgeni · Aug 9, 2022 · 31ceff5 · 31ceff5
1 parent 8f5bda9
commit 31ceff5
Show file tree

Hide file tree

Showing 8 changed files with 184 additions and 94 deletions.
diff --git a/README.md b/README.md
@@ -1,14 +1,14 @@
-# Synchronized processing of bulk and scRNA-seq
+# Wrapper scripts for using `STARsolo` with various types of single cell RNA-seq
 
-These are the scripts used for CellGenIT for synchronized processing of scRNA-seq and bulk RNA-seq. Both use [STAR](https://github.com/alexdobin/STAR) aligner to align reads to the reference genome. 
+These are the scripts used for CellGenIT for uniform processing of scRNA-seq - both 10X and quite a few other types (see below for supported platforms). Additionally, bulk RNA-seq could be processed using a 10X reference in a matched way - this should improve scRNA-seq-to-bulk mapping. All listed methods use [STAR](https://github.com/alexdobin/STAR) aligner to align reads to the reference genome. 
 
 ## Software installation
 
 ### STAR and RSEM versions
 
-`STAR` of version 2.7.9a or above is recommended. The newest update includes the ability to correctly process multi-mapping reads, and adds many important options and bug fixes. 
+`STAR` of version 2.7.9a or above is recommended (2.7.10a is the latest and greatest, as of August'22). The newest update includes the ability to correctly process multi-mapping reads, and adds many important options and bug fixes. 
 
-In order to use settings that closely mimic those of `Cell Ranger` v4 or above (see explanations below, particularly `--clipAdapterType CellRanger4` option), `STAR` needs to be re-compiled from source with `make STAR CXXFLAGS_SIMD="-msse4.2"` (see [this issue](https://github.com/alexdobin/STAR/issues/1218) for more info). If you get the "Illegal instruction" error, that's what you need to do. 
+In order to use settings that closely mimic those of `Cell Ranger` v4 or above (see explanations below, particularly `--clipAdapterType CellRanger4` option), `STAR` needs to be re-compiled from source with `make STAR CXXFLAGS_SIMD="-msse4.2"` (see [this issue](https://github.com/alexdobin/STAR/issues/1218) for more info). If you get the _Illegal instruction_ error, that's what you need to do. 
 
 There's also Martin Prete's awesome `icpc`-compiled version of `STAR` that's being tested right now - stay tuned for the updates. 
 
@@ -38,7 +38,7 @@ All **CellGenIT** pre-made `STAR` references are located in `/nfs/cellgeni/STAR/
 
 ## Processing scRNA-seq with STARsolo
 
-### Reprodicing `Cell Ranger` v4 and above (but much faster)
+### 10X: reprodicing `Cell Ranger` v4 and above (but much faster)
 
 Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results with be very close to those of `Cell Ranger` v4 and above. 
 
@@ -52,18 +52,19 @@ Below are the explanations for some of the options (note that 5' experiments **a
 
 | 10X VERSION | BC | UMILEN | STR |
 |:-:|:-:|:-:|:-:|
-| 3' v1 | 737K-april-2014_rc.txt |10 | Forward |
-| 3' v2 | 737K-august-2016.txt |10 | Forward |
-| 3' v3, v3.1 | 3M-february-2018.txt |12 | Forward |
-| 5' v1.1, v2 | 737K-august-2016.txt |10 | Reverse |
+| 3' v1 | 737K-april-2014_rc.txt | 10 | Forward |
+| 3' v2 | 737K-august-2016.txt | 10 | Forward |
+| 3' v3, v3.1 | 3M-february-2018.txt | 12 | Forward |
+| 5' v1.1, v2 | 737K-august-2016.txt | 10 | Reverse |
 | 5' v3 | 737K-august-2016.txt | 12 | Reverse |
+| multiome | 737K-arc-v1.txt | 12 | Forward |
 
 </div>
 
   - `--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30` are options that define UMI collapsing, barcode collapsing, and read clipping algorithms that are closest to ones used by `Cell Ranger`; 
   - `--soloCellFilter EmptyDrops_CR` specifies the cell filtering algorithm used in [EmptyDrops](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html), which is the default algorithm in later versions of `Cell Ranger`; 
   - `--soloFeatures Gene GeneFull Velocyto` output conventional (exon-only) UMI counts, as well as exon+intron UMI counts (analog of `Cell Ranger` premrna option), as well as matrices preprocessed for `Velocyto`; 
-  - `--soloMultiMappers Unique EM` is to count multimappers; 
+  - `--soloMultiMappers Unique EM` is to count multimappers (on by default in v3.0+ of these scripts; does not influence the main output, but creates an additional matrix in `/raw` subdir of `Gene` and `GeneFull`); 
   - `--readFilesCommand zcat` is used if your input fastq files are gzipped;
   - options grouped as `$SORTEDBAM` should be used if you need a genomic bam file; otherwise, use `$NOBAM`.  
 
@@ -76,14 +77,14 @@ STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_
      --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
      --soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
      --soloFeatures Gene GeneFull Velocyto \
-     --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
+     --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx --soloMultiMappers EM
 ```
 
-### Using STARsolo for Smart-seq/Smart-seq2
+### Using STARsolo for SMART-seq/SMART-seq2
 
 For plate-based methods that don't use UMIs (such as [SMART-Seq and SMART-Seq2](https://teichlab.github.io/scg_lib_structs/methods_html/SMART-seq_family.html)), `STARsolo` can be used as well. Fastq files for these methods usually come as separate, paired-end files; all of these should be listed in a *manifest* file - plain text, tab-separated file containing three columns per line: 1) full path to R1; 2) full path to R2; 3) cell name or ID. 
 
-Example of a script used to process Smart-seq2 data can be found in `/scripts/starsolo_ss2.sh`. Actual `STAR` command being run:
+Example of a script used to process Smart-seq2 data can be found in `/scripts/starsolo_ss2.sh`. Key parameters that could be adjusted are `--outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3`; the higher they are, the less permissive is the alignment. Lower values can help you "rescue" a large proportion of reads with high adapter content (see below for adapter trimming). Actual `STAR` command being run:
 
 ```bash
 STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand zcat $SORTEDBAM \
@@ -93,24 +94,39 @@ STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand
      --soloFeatures Gene GeneFull --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
 ```
 
-Sometimes, reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.  
+Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.  
 
 ### Counting the multimapping reads
 
 Default approach used by `Cell Ranger` (and `STARsolo` scripts above) is to discard all reads that map to multiple genomic locations with equal mapping quality. This approach creates a bias in gene expression estimation. Pseudocount-based methods correctly quantify multimapping reads, but generate false counts due to pseudo-alignment errors. These issues are described in good detail [here](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1). 
 
-If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM`. This will generate an extra matrix in the /raw output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`). 
+If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`). 
+
+As of `STAR` v2.7.10a, multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing. 
+
+### Running STARsolo on other scRNA-seq platforms 
+
+STARsolo is very flexible and can be used with almost any scRNA-seq method, provided you know the library structure - i.e. where cell barcodes, UMIs, and biological parts of the read are located in the sequencing fragment or reads. A great source of information about scRNA-seq library structures is [this page](https://teichlab.github.io/scg_lib_structs/).
+
+Currently, our scripts directory provides dedicated scripts for
+  - Drop-seq;
+  - inDrops; 
+  - STRT-seq. 
+
+Please contact `CellGenIT` if you need to process an unusual dataset. 
 
 ## Quick evaluation of multiple STARsolo runs
 
-If you've used these scripts to process multiple 10x samples, you can get a quick look at the results by copying `solo_QC.sh` script from this repo to the directory with `STARsolo` output folders, and running
+If you've used these scripts to process multiple 10X samples, you can get a quick look at the results by copying `solo_QC.sh` script from this repo to the directory with `STARsolo` output folders, and running
 
 ```bash
-./solo_QC.sh <output_tag> | column -t 
+./solo_QC.sh | column -t 
 ```
 
 The **output_tag** argument could be any common part of the folder name - e.g. if samples are called SRR124444 .. SRR124534, you can use **SRR124** or **SRR**.
 
+The script is designed for 10X or other droplet-based methods; the output will make a lot less sense for SMART-seq2. 
+
 ## Processing bulk RNA-seq with STAR/RSEM
 
 `RSEM` reference files need to be prepared from genome fasta and GTF using the following command: 

diff --git a/scripts/bbduk.sh b/scripts/bbduk.sh
@@ -0,0 +1,6 @@
+#!/bin/bash 
+
+TAG=$1
+ADAPTERS=/nfs/users/nfs_a/ap41/bbmap/resources/adapters.fa
+
+bbduk.sh in1=${TAG}_1.fastq.gz in2=${TAG}_2.fastq.gz out1=$TAG.bbduk.R1.fastq out2=$TAG.bbduk.R2.fastq ref=$ADAPTERS trimpolya=10 ktrim=r k=23 mink=11 hdist=1 tpe tbo &> $TAG.bbduk.log
diff --git a/scripts/solo_QC.sh b/scripts/solo_QC.sh
@@ -1,6 +1,6 @@
 #!/bin/bash 
 
-echo -e "Sample\tRd_all\tRd_in_cells\tFrc_in_cells\tUMI_in_cells\tCells\tMed_nFeature\tGood_BC\tall_u+m\tall_u\texon_u+m\texon_u\tfull_u+m\tfull_u"
+echo -e "Sample\tRd_all\tRd_in_cells\tFrc_in_cells\tUMI_in_cells\tCells\tMed_nFeature\tGood_BC\tStrand\tall_u+m\tall_u\texon_u+m\texon_u\tfull_u+m\tfull_u"
 
 for i in *
 do
@@ -10,15 +10,20 @@ do
     B=`grep "Reads With Valid Barcodes," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
     G1=`grep "Reads Mapped to Genome: Unique+Multiple," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
     G2=`grep "Reads Mapped to Genome: Unique," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
-    T1=`grep "Reads Mapped to Gene: Unique+Multipe Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
-    T2=`grep "Reads Mapped to Gene: Unique Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
-    F1=`grep "Reads Mapped to GeneFull: Unique+Multipe GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
+    E1=`grep "Reads Mapped to Gene: Unique+Multip.*e Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
+    E2=`grep "Reads Mapped to Gene: Unique Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
+    F1=`grep "Reads Mapped to GeneFull: Unique+Multip.*e GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
     F2=`grep "Reads Mapped to GeneFull: Unique GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
     C=`grep "Estimated Number of Cells," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
     R2=`grep "Unique Reads in Cells Mapped to GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
     CF=`echo $R1 | awk -v v=$R2 '{printf "%.3f\n",v/$1}'`
     R3=`grep "UMIs in Cells," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
     GC=`grep "Median GeneFull per Cell," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
-    echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$G1\t$G2\t$T1\t$T2\t$F1\t$F2"
+    ST=`grep "^soloStrand" $i/Log.out | grep RE-DEFINED | awk '{print $2}'`
+    if [[ $ST == "" ]]
+    then
+      ST="Undef"
+    fi
+    echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$ST\t$G1\t$G2\t$E1\t$E2\t$F1\t$F2"
   fi
 done