More updates and fixes - STAR v2.7.10a overhaul

cellgeni · Aug 9, 2022 · 4b44539 · 4b44539
1 parent 31ceff5
commit 4b44539
Show file tree

Hide file tree

Showing 6 changed files with 55 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -38,9 +38,9 @@ All **CellGenIT** pre-made `STAR` references are located in `/nfs/cellgeni/STAR/
 
 ## Processing scRNA-seq with STARsolo
 
-### 10X: reprodicing `Cell Ranger` v4 and above (but much faster)
+### 10X: reproducing `Cell Ranger` v4 and above (but much faster)
 
-Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results with be very close to those of `Cell Ranger` v4 and above. 
+Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results will be very close to those of `Cell Ranger` v4 and above. 
 
 Before running, barcode whitelists need to be downloaded [from here](https://github.com/10XGenomics/cellranger/tree/master/lib/python/cellranger/barcodes). 
 
@@ -94,15 +94,15 @@ STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand
      --soloFeatures Gene GeneFull --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
 ```
 
-Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.  
+Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `/scripts/bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification (this is the preferred option).  
 
 ### Counting the multimapping reads
 
 Default approach used by `Cell Ranger` (and `STARsolo` scripts above) is to discard all reads that map to multiple genomic locations with equal mapping quality. This approach creates a bias in gene expression estimation. Pseudocount-based methods correctly quantify multimapping reads, but generate false counts due to pseudo-alignment errors. These issues are described in good detail [here](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1). 
 
-If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`). 
+If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (or simply `--soloMultiMappers EM`; this is on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`). 
 
-As of `STAR` v2.7.10a, multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing. 
+As of `STAR` v2.7.10a, **multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing**. 
 
 ### Running STARsolo on other scRNA-seq platforms 
 

diff --git a/scripts/reorg_bulk_dir.sh b/scripts/reorg_bulk_dir.sh
@@ -3,9 +3,17 @@
 ## change sample tag for reuse!
 ## reorganize directories for bulk RNA-seq STAR+rsem processing 
 
+TAG=$1
+
+if [[ $TAG == "" ]]
+then
+  >&2 echo "Usage: ./reorg_bulk_dir.sh <sample_pattern>"
+  exit 1
+fi
+
 mkdir counts rsem_gene rsem_tran
 
-for i in PR*
+for i in $TAG*
 do
   echo "Processing sample $i - moving files.."
   cp $i/*genes.results rsem_gene

diff --git a/scripts/star_rsem_bulk.sh b/scripts/star_rsem_bulk.sh
@@ -1,9 +1,16 @@
 #!/bin/bash
 
 TAG=$1
+if [[ $TAG == "" ]]
+then
+  >&2 echo "Usage: ./star_rsem_bulk.sh <sample_tag>"
+  >&2 echo "(make sure you set the correct SREF, RREF, FQDIR, and PAIRED/STRAND/CPUS variables)"
+  exit 1
+fi
+
 SREF=/nfs/cellgeni/STAR/human/2020A/index
 RREF=/nfs/cellgeni/STAR/human/2020A/2020A_rsem
-FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-bulkseq/tic-1333/fastqs
+FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-bulkseq/tic-XXX/fastqs
 PAIRED="--paired-end"
 STRAND="--forward-prob 0"
 CPUS=16

diff --git a/scripts/starsolo_10x_auto.sh b/scripts/starsolo_10x_auto.sh
@@ -157,7 +157,7 @@ then
   STRAND=Reverse
 fi
 
-## finally, if paired-end experiment turned out to be 3', process it as single-end: 
+## finally, if paired-end experiment turned out to be 3' (yes, they do exist!), process it as single-end: 
 if [[ $STRAND == "Forward" && $PAIRED == "True" ]]
 then
   PAIRED=False

diff --git a/scripts/starsolo_indrops.sh b/scripts/starsolo_indrops.sh
@@ -16,6 +16,9 @@ CPUS=16                                                                ## typica
 REF=/nfs/cellgeni/STAR/human/2020A/index                               ## choose the appropriate reference 
 WL=/nfs/cellgeni/STAR/whitelists                                       ## directory with all barcode whitelists
 FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-starsolo/tic-XXX/fastqs  ## directory with your fastq files - can be in subdirs, just make sure tag is unique and greppable (e.g. no Sample1 and Sample 10). 
+ADAPTER=GAGTGATTGCTTGTGACGCCTT                                         ## these could be GAGTGATTGCTTGTGACGCCTT or GAGTGATTGCTTGTGACGCCAA, as far as I've seen 
+BC1=$WL/inDrops_Ambrose2_bc1.txt
+BC2=$WL/inDrops_Ambrose2_bc2.txt
 
 ## choose one of the two otions, depending on whether you need a BAM file 
 #BAM="--outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 2 --limitBAMsortRAM 120000000000 --outMultimapperOrder Random --runRNGseed 1 --outSAMattributes NH HI AS nM CB UB GX GN"
@@ -29,12 +32,8 @@ then
   exit 1
 fi
 
-
 mkdir $TAG && cd $TAG
 
-BC1=$WL/inDrops_Ambrose2_bc1.txt
-BC2=$WL/inDrops_Ambrose2_bc2.txt
-
 R1=""
 R2=""
 if [[ `find $FQDIR/* | grep $TAG | grep "_1\.fastq"` != "" ]]
@@ -61,8 +60,9 @@ then
   GZIP="--readFilesCommand zcat"
 fi
 
+## increased soloAdapterMismatchesNmax to 3, as per discussions in STAR issues
 STAR  --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX $GZIP $BAM \
-     --soloType CB_UMI_Complex --soloCBwhitelist $BC1 $BC2 --soloAdapterSequence GAGTGATTGCTTGTGACGCCTT  \
+     --soloType CB_UMI_Complex --soloCBwhitelist $BC1 $BC2 --soloAdapterSequence $ADAPTER  \
      --soloAdapterMismatchesNmax 3 --soloCBmatchWLtype 1MM --soloCBposition 0_0_2_-1 3_1_3_8 --soloUMIposition 3_9_3_14 \
      --soloFeatures Gene GeneFull --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx
 

diff --git a/scripts/starsolo_strt.sh b/scripts/starsolo_strt.sh
@@ -17,6 +17,8 @@ CPUS=16                                                                ## typica
 REF=/nfs/cellgeni/STAR/human/2020A/index                               ## choose the appropriate reference 
 WL=/nfs/cellgeni/STAR/whitelists                                       ## directory with all barcode whitelists
 FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-starsolo/tic-XXX/fastqs  ## directory with your fastq files - can be in subdirs, just make sure tag is unique and greppable (e.g. no Sample1 and Sample 10). 
+CBLEN=8
+UMILEN=8
 
 ## choose one of the two otions, depending on whether you need a BAM file 
 #BAM="--outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 2 --limitBAMsortRAM 120000000000 --outMultimapperOrder Random --runRNGseed 1 --outSAMattributes NH HI AS nM CB UB GX GN"
@@ -34,22 +36,38 @@ BC=$WL/96_barcodes.list
 
 mkdir $TAG && cd $TAG
 ## for multiple fastq files; change grep options according to your fastq file format 
-R1=`find $FQDIR/* | grep $TAG | grep "_f1.fastq.gz" | sort | tr '\n' ',' | sed "s/,$//g"`
-R2=`find $FQDIR/* | grep $TAG | grep "_r2.fastq.gz" | sort | tr '\n' ',' | sed "s/,$//g"`
-
-if [[ $R1 == "" || $R2 == "" ]]
+R1=""
+R2=""
+if [[ `find $FQDIR/* | grep $TAG | grep "_1\.fastq"` != "" ]]
 then
-  >&2 echo "No appropriate R1 or R2 read files was found for sample tag $TAG! Make sure you have set the correct FQDIR."
-  >&2 echo "Usage: ./starsolo_strt.sh <sample_tag>"
+  R1=`find $FQDIR/* | grep $TAG | grep "_1\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
+  R2=`find $FQDIR/* | grep $TAG | grep "_2\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
+elif [[ `find $FQDIR/* | grep $TAG | grep "R1\.fastq"` != "" ]]
+then
+  R1=`find $FQDIR/* | grep $TAG | grep "R1\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
+  R2=`find $FQDIR/* | grep $TAG | grep "R2\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
+elif [[ `find $FQDIR/* | grep $TAG | grep "_R1_.*\.fastq"` != "" ]]
+then
+  R1=`find $FQDIR/* | grep $TAG | grep "_R1_" | sort | tr '\n' ',' | sed "s/,$//g"`
+  R2=`find $FQDIR/* | grep $TAG | grep "_R2_" | sort | tr '\n' ',' | sed "s/,$//g"`
+else
+  >&2 echo "ERROR: No appropriate fastq files were found! Please check file formatting, and check if you have set the right FQDIR."
   exit 1
 fi
 
+GZIP=""
+## see if the original fastq files are archived: 
+if [[ `find $FQDIR/* | grep $TAG | grep "\.gz$"` != "" ]]
+then
+  GZIP="--readFilesCommand zcat"
+fi
+
 ## note the switched R2 and R1 compared to 10x! R1 is biological read in STRT-seq
-STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R1 $R2 --runDirPerm All_RWX --readFilesCommand zcat $NOBAM \
-     --soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloCBlen 8 --soloUMIstart 9 --soloUMIlen $UMILEN --soloStrand $STR \
+STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX $GZIP $BAM \
+     --soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloCBlen $CBLEN --soloUMIstart $((CBLEN+1)) --soloUMIlen $UMILEN --soloStrand $STRAND \
      --soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
      --soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
-     --soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx
+     --soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx --soloMultiMappers EM
 
 ## gzip all outputs
 cd output