Skip to content

Commit

Permalink
More updates and fixes - STAR v2.7.10a overhaul
Browse files Browse the repository at this point in the history
  • Loading branch information
apredeus committed Aug 9, 2022
1 parent 31ceff5 commit 4b44539
Show file tree
Hide file tree
Showing 6 changed files with 55 additions and 22 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ All **CellGenIT** pre-made `STAR` references are located in `/nfs/cellgeni/STAR/

## Processing scRNA-seq with STARsolo

### 10X: reprodicing `Cell Ranger` v4 and above (but much faster)
### 10X: reproducing `Cell Ranger` v4 and above (but much faster)

Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results with be very close to those of `Cell Ranger` v4 and above.
Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results will be very close to those of `Cell Ranger` v4 and above.

Before running, barcode whitelists need to be downloaded [from here](https://github.com/10XGenomics/cellranger/tree/master/lib/python/cellranger/barcodes).

Expand Down Expand Up @@ -94,15 +94,15 @@ STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand
--soloFeatures Gene GeneFull --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
```

Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.
Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `/scripts/bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification (this is the preferred option).

### Counting the multimapping reads

Default approach used by `Cell Ranger` (and `STARsolo` scripts above) is to discard all reads that map to multiple genomic locations with equal mapping quality. This approach creates a bias in gene expression estimation. Pseudocount-based methods correctly quantify multimapping reads, but generate false counts due to pseudo-alignment errors. These issues are described in good detail [here](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1).

If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`).
If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (or simply `--soloMultiMappers EM`; this is on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`).

As of `STAR` v2.7.10a, multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing.
As of `STAR` v2.7.10a, **multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing**.

### Running STARsolo on other scRNA-seq platforms

Expand Down
10 changes: 9 additions & 1 deletion scripts/reorg_bulk_dir.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,17 @@
## change sample tag for reuse!
## reorganize directories for bulk RNA-seq STAR+rsem processing

TAG=$1

if [[ $TAG == "" ]]
then
>&2 echo "Usage: ./reorg_bulk_dir.sh <sample_pattern>"
exit 1
fi

mkdir counts rsem_gene rsem_tran

for i in PR*
for i in $TAG*
do
echo "Processing sample $i - moving files.."
cp $i/*genes.results rsem_gene
Expand Down
9 changes: 8 additions & 1 deletion scripts/star_rsem_bulk.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
#!/bin/bash

TAG=$1
if [[ $TAG == "" ]]
then
>&2 echo "Usage: ./star_rsem_bulk.sh <sample_tag>"
>&2 echo "(make sure you set the correct SREF, RREF, FQDIR, and PAIRED/STRAND/CPUS variables)"
exit 1
fi

SREF=/nfs/cellgeni/STAR/human/2020A/index
RREF=/nfs/cellgeni/STAR/human/2020A/2020A_rsem
FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-bulkseq/tic-1333/fastqs
FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-bulkseq/tic-XXX/fastqs
PAIRED="--paired-end"
STRAND="--forward-prob 0"
CPUS=16
Expand Down
2 changes: 1 addition & 1 deletion scripts/starsolo_10x_auto.sh
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ then
STRAND=Reverse
fi

## finally, if paired-end experiment turned out to be 3', process it as single-end:
## finally, if paired-end experiment turned out to be 3' (yes, they do exist!), process it as single-end:
if [[ $STRAND == "Forward" && $PAIRED == "True" ]]
then
PAIRED=False
Expand Down
10 changes: 5 additions & 5 deletions scripts/starsolo_indrops.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ CPUS=16 ## typica
REF=/nfs/cellgeni/STAR/human/2020A/index ## choose the appropriate reference
WL=/nfs/cellgeni/STAR/whitelists ## directory with all barcode whitelists
FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-starsolo/tic-XXX/fastqs ## directory with your fastq files - can be in subdirs, just make sure tag is unique and greppable (e.g. no Sample1 and Sample 10).
ADAPTER=GAGTGATTGCTTGTGACGCCTT ## these could be GAGTGATTGCTTGTGACGCCTT or GAGTGATTGCTTGTGACGCCAA, as far as I've seen
BC1=$WL/inDrops_Ambrose2_bc1.txt
BC2=$WL/inDrops_Ambrose2_bc2.txt

## choose one of the two otions, depending on whether you need a BAM file
#BAM="--outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 2 --limitBAMsortRAM 120000000000 --outMultimapperOrder Random --runRNGseed 1 --outSAMattributes NH HI AS nM CB UB GX GN"
Expand All @@ -29,12 +32,8 @@ then
exit 1
fi


mkdir $TAG && cd $TAG

BC1=$WL/inDrops_Ambrose2_bc1.txt
BC2=$WL/inDrops_Ambrose2_bc2.txt

R1=""
R2=""
if [[ `find $FQDIR/* | grep $TAG | grep "_1\.fastq"` != "" ]]
Expand All @@ -61,8 +60,9 @@ then
GZIP="--readFilesCommand zcat"
fi

## increased soloAdapterMismatchesNmax to 3, as per discussions in STAR issues
STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX $GZIP $BAM \
--soloType CB_UMI_Complex --soloCBwhitelist $BC1 $BC2 --soloAdapterSequence GAGTGATTGCTTGTGACGCCTT \
--soloType CB_UMI_Complex --soloCBwhitelist $BC1 $BC2 --soloAdapterSequence $ADAPTER \
--soloAdapterMismatchesNmax 3 --soloCBmatchWLtype 1MM --soloCBposition 0_0_2_-1 3_1_3_8 --soloUMIposition 3_9_3_14 \
--soloFeatures Gene GeneFull --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx

Expand Down
36 changes: 27 additions & 9 deletions scripts/starsolo_strt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ CPUS=16 ## typica
REF=/nfs/cellgeni/STAR/human/2020A/index ## choose the appropriate reference
WL=/nfs/cellgeni/STAR/whitelists ## directory with all barcode whitelists
FQDIR=/lustre/scratch117/cellgen/cellgeni/TIC-starsolo/tic-XXX/fastqs ## directory with your fastq files - can be in subdirs, just make sure tag is unique and greppable (e.g. no Sample1 and Sample 10).
CBLEN=8
UMILEN=8

## choose one of the two otions, depending on whether you need a BAM file
#BAM="--outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 2 --limitBAMsortRAM 120000000000 --outMultimapperOrder Random --runRNGseed 1 --outSAMattributes NH HI AS nM CB UB GX GN"
Expand All @@ -34,22 +36,38 @@ BC=$WL/96_barcodes.list

mkdir $TAG && cd $TAG
## for multiple fastq files; change grep options according to your fastq file format
R1=`find $FQDIR/* | grep $TAG | grep "_f1.fastq.gz" | sort | tr '\n' ',' | sed "s/,$//g"`
R2=`find $FQDIR/* | grep $TAG | grep "_r2.fastq.gz" | sort | tr '\n' ',' | sed "s/,$//g"`

if [[ $R1 == "" || $R2 == "" ]]
R1=""
R2=""
if [[ `find $FQDIR/* | grep $TAG | grep "_1\.fastq"` != "" ]]
then
>&2 echo "No appropriate R1 or R2 read files was found for sample tag $TAG! Make sure you have set the correct FQDIR."
>&2 echo "Usage: ./starsolo_strt.sh <sample_tag>"
R1=`find $FQDIR/* | grep $TAG | grep "_1\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
R2=`find $FQDIR/* | grep $TAG | grep "_2\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
elif [[ `find $FQDIR/* | grep $TAG | grep "R1\.fastq"` != "" ]]
then
R1=`find $FQDIR/* | grep $TAG | grep "R1\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
R2=`find $FQDIR/* | grep $TAG | grep "R2\.fastq" | sort | tr '\n' ',' | sed "s/,$//g"`
elif [[ `find $FQDIR/* | grep $TAG | grep "_R1_.*\.fastq"` != "" ]]
then
R1=`find $FQDIR/* | grep $TAG | grep "_R1_" | sort | tr '\n' ',' | sed "s/,$//g"`
R2=`find $FQDIR/* | grep $TAG | grep "_R2_" | sort | tr '\n' ',' | sed "s/,$//g"`
else
>&2 echo "ERROR: No appropriate fastq files were found! Please check file formatting, and check if you have set the right FQDIR."
exit 1
fi

GZIP=""
## see if the original fastq files are archived:
if [[ `find $FQDIR/* | grep $TAG | grep "\.gz$"` != "" ]]
then
GZIP="--readFilesCommand zcat"
fi

## note the switched R2 and R1 compared to 10x! R1 is biological read in STRT-seq
STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R1 $R2 --runDirPerm All_RWX --readFilesCommand zcat $NOBAM \
--soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloCBlen 8 --soloUMIstart 9 --soloUMIlen $UMILEN --soloStrand $STR \
STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX $GZIP $BAM \
--soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloCBlen $CBLEN --soloUMIstart $((CBLEN+1)) --soloUMIlen $UMILEN --soloStrand $STRAND \
--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
--soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
--soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx
--soloFeatures Gene GeneFull Velocyto --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx --soloMultiMappers EM

## gzip all outputs
cd output
Expand Down

0 comments on commit 4b44539

Please sign in to comment.