Skip to content

Commit

Permalink
Update to syntax & using STAR v2.7.10a
Browse files Browse the repository at this point in the history
  • Loading branch information
apredeus committed Aug 9, 2022
1 parent 8f5bda9 commit 31ceff5
Show file tree
Hide file tree
Showing 8 changed files with 184 additions and 94 deletions.
50 changes: 33 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Synchronized processing of bulk and scRNA-seq
# Wrapper scripts for using `STARsolo` with various types of single cell RNA-seq

These are the scripts used for CellGenIT for synchronized processing of scRNA-seq and bulk RNA-seq. Both use [STAR](https://github.com/alexdobin/STAR) aligner to align reads to the reference genome.
These are the scripts used for CellGenIT for uniform processing of scRNA-seq - both 10X and quite a few other types (see below for supported platforms). Additionally, bulk RNA-seq could be processed using a 10X reference in a matched way - this should improve scRNA-seq-to-bulk mapping. All listed methods use [STAR](https://github.com/alexdobin/STAR) aligner to align reads to the reference genome.

## Software installation

### STAR and RSEM versions

`STAR` of version 2.7.9a or above is recommended. The newest update includes the ability to correctly process multi-mapping reads, and adds many important options and bug fixes.
`STAR` of version 2.7.9a or above is recommended (2.7.10a is the latest and greatest, as of August'22). The newest update includes the ability to correctly process multi-mapping reads, and adds many important options and bug fixes.

In order to use settings that closely mimic those of `Cell Ranger` v4 or above (see explanations below, particularly `--clipAdapterType CellRanger4` option), `STAR` needs to be re-compiled from source with `make STAR CXXFLAGS_SIMD="-msse4.2"` (see [this issue](https://github.com/alexdobin/STAR/issues/1218) for more info). If you get the "Illegal instruction" error, that's what you need to do.
In order to use settings that closely mimic those of `Cell Ranger` v4 or above (see explanations below, particularly `--clipAdapterType CellRanger4` option), `STAR` needs to be re-compiled from source with `make STAR CXXFLAGS_SIMD="-msse4.2"` (see [this issue](https://github.com/alexdobin/STAR/issues/1218) for more info). If you get the _Illegal instruction_ error, that's what you need to do.

There's also Martin Prete's awesome `icpc`-compiled version of `STAR` that's being tested right now - stay tuned for the updates.

Expand Down Expand Up @@ -38,7 +38,7 @@ All **CellGenIT** pre-made `STAR` references are located in `/nfs/cellgeni/STAR/

## Processing scRNA-seq with STARsolo

### Reprodicing `Cell Ranger` v4 and above (but much faster)
### 10X: reprodicing `Cell Ranger` v4 and above (but much faster)

Full scripts with the latest settings are available in `/scripts` (there are several scripts according to 10x chemistry version; e.g. `starsolo_3p_v3.sh` should be used for v3 of 3' 10x, while `starsolo_5p_v2.sh` should be used for v2 of 5'. The scripts contain *many* options that frequently change; some of which will be explained below. In general, commands are tuned in such way that the results with be very close to those of `Cell Ranger` v4 and above.

Expand All @@ -52,18 +52,19 @@ Below are the explanations for some of the options (note that 5' experiments **a

| 10X VERSION | BC | UMILEN | STR |
|:-:|:-:|:-:|:-:|
| 3' v1 | 737K-april-2014_rc.txt |10 | Forward |
| 3' v2 | 737K-august-2016.txt |10 | Forward |
| 3' v3, v3.1 | 3M-february-2018.txt |12 | Forward |
| 5' v1.1, v2 | 737K-august-2016.txt |10 | Reverse |
| 3' v1 | 737K-april-2014_rc.txt | 10 | Forward |
| 3' v2 | 737K-august-2016.txt | 10 | Forward |
| 3' v3, v3.1 | 3M-february-2018.txt | 12 | Forward |
| 5' v1.1, v2 | 737K-august-2016.txt | 10 | Reverse |
| 5' v3 | 737K-august-2016.txt | 12 | Reverse |
| multiome | 737K-arc-v1.txt | 12 | Forward |

</div>

- `--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30` are options that define UMI collapsing, barcode collapsing, and read clipping algorithms that are closest to ones used by `Cell Ranger`;
- `--soloCellFilter EmptyDrops_CR` specifies the cell filtering algorithm used in [EmptyDrops](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html), which is the default algorithm in later versions of `Cell Ranger`;
- `--soloFeatures Gene GeneFull Velocyto` output conventional (exon-only) UMI counts, as well as exon+intron UMI counts (analog of `Cell Ranger` premrna option), as well as matrices preprocessed for `Velocyto`;
- `--soloMultiMappers Unique EM` is to count multimappers;
- `--soloMultiMappers Unique EM` is to count multimappers (on by default in v3.0+ of these scripts; does not influence the main output, but creates an additional matrix in `/raw` subdir of `Gene` and `GeneFull`);
- `--readFilesCommand zcat` is used if your input fastq files are gzipped;
- options grouped as `$SORTEDBAM` should be used if you need a genomic bam file; otherwise, use `$NOBAM`.

Expand All @@ -76,14 +77,14 @@ STAR --runThreadN $CPUS --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR \
--soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
--soloFeatures Gene GeneFull Velocyto \
--soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
--soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx --soloMultiMappers EM
```

### Using STARsolo for Smart-seq/Smart-seq2
### Using STARsolo for SMART-seq/SMART-seq2

For plate-based methods that don't use UMIs (such as [SMART-Seq and SMART-Seq2](https://teichlab.github.io/scg_lib_structs/methods_html/SMART-seq_family.html)), `STARsolo` can be used as well. Fastq files for these methods usually come as separate, paired-end files; all of these should be listed in a *manifest* file - plain text, tab-separated file containing three columns per line: 1) full path to R1; 2) full path to R2; 3) cell name or ID.

Example of a script used to process Smart-seq2 data can be found in `/scripts/starsolo_ss2.sh`. Actual `STAR` command being run:
Example of a script used to process Smart-seq2 data can be found in `/scripts/starsolo_ss2.sh`. Key parameters that could be adjusted are `--outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3`; the higher they are, the less permissive is the alignment. Lower values can help you "rescue" a large proportion of reads with high adapter content (see below for adapter trimming). Actual `STAR` command being run:

```bash
STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand zcat $SORTEDBAM \
Expand All @@ -93,24 +94,39 @@ STAR --runThreadN $CPUS --genomeDir $REF --runDirPerm All_RWX --readFilesCommand
--soloFeatures Gene GeneFull --soloOutFileNames output/ genes.tsv barcodes.tsv matrix.mtx
```

Sometimes, reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.
Often, SMART-seq2 reads can benefit from trimming adapters, which can be turned on using `--clip3pAdapterSeq <3' adapter sequence>` option. Alternatively, `bbduk.sh` can be used to trim adapters from reads prior to the alignment and quantification.

### Counting the multimapping reads

Default approach used by `Cell Ranger` (and `STARsolo` scripts above) is to discard all reads that map to multiple genomic locations with equal mapping quality. This approach creates a bias in gene expression estimation. Pseudocount-based methods correctly quantify multimapping reads, but generate false counts due to pseudo-alignment errors. These issues are described in good detail [here](https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1).

If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM`. This will generate an extra matrix in the /raw output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`).
If you would like to process multimappers, add the following options: `--soloMultiMappers Uniform EM` (on by default in v3.0+ of these scripts). This will generate an extra matrix in the `/raw` output folders. There will be non-integer numbers in the matrix because of split reads. If the downstream processing requires integers, you can round with a tool of your liking (e.g. `awk`).

As of `STAR` v2.7.10a, multimapper counting still does not work for SMART-seq2 or bulk RNA-seq processing.

### Running STARsolo on other scRNA-seq platforms

STARsolo is very flexible and can be used with almost any scRNA-seq method, provided you know the library structure - i.e. where cell barcodes, UMIs, and biological parts of the read are located in the sequencing fragment or reads. A great source of information about scRNA-seq library structures is [this page](https://teichlab.github.io/scg_lib_structs/).

Currently, our scripts directory provides dedicated scripts for
- Drop-seq;
- inDrops;
- STRT-seq.

Please contact `CellGenIT` if you need to process an unusual dataset.

## Quick evaluation of multiple STARsolo runs

If you've used these scripts to process multiple 10x samples, you can get a quick look at the results by copying `solo_QC.sh` script from this repo to the directory with `STARsolo` output folders, and running
If you've used these scripts to process multiple 10X samples, you can get a quick look at the results by copying `solo_QC.sh` script from this repo to the directory with `STARsolo` output folders, and running

```bash
./solo_QC.sh <output_tag> | column -t
./solo_QC.sh | column -t
```

The **output_tag** argument could be any common part of the folder name - e.g. if samples are called SRR124444 .. SRR124534, you can use **SRR124** or **SRR**.

The script is designed for 10X or other droplet-based methods; the output will make a lot less sense for SMART-seq2.

## Processing bulk RNA-seq with STAR/RSEM

`RSEM` reference files need to be prepared from genome fasta and GTF using the following command:
Expand Down
6 changes: 6 additions & 0 deletions scripts/bbduk.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

TAG=$1
ADAPTERS=/nfs/users/nfs_a/ap41/bbmap/resources/adapters.fa

bbduk.sh in1=${TAG}_1.fastq.gz in2=${TAG}_2.fastq.gz out1=$TAG.bbduk.R1.fastq out2=$TAG.bbduk.R2.fastq ref=$ADAPTERS trimpolya=10 ktrim=r k=23 mink=11 hdist=1 tpe tbo &> $TAG.bbduk.log
15 changes: 10 additions & 5 deletions scripts/solo_QC.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

echo -e "Sample\tRd_all\tRd_in_cells\tFrc_in_cells\tUMI_in_cells\tCells\tMed_nFeature\tGood_BC\tall_u+m\tall_u\texon_u+m\texon_u\tfull_u+m\tfull_u"
echo -e "Sample\tRd_all\tRd_in_cells\tFrc_in_cells\tUMI_in_cells\tCells\tMed_nFeature\tGood_BC\tStrand\tall_u+m\tall_u\texon_u+m\texon_u\tfull_u+m\tfull_u"

for i in *
do
Expand All @@ -10,15 +10,20 @@ do
B=`grep "Reads With Valid Barcodes," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
G1=`grep "Reads Mapped to Genome: Unique+Multiple," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
G2=`grep "Reads Mapped to Genome: Unique," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
T1=`grep "Reads Mapped to Gene: Unique+Multipe Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
T2=`grep "Reads Mapped to Gene: Unique Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
F1=`grep "Reads Mapped to GeneFull: Unique+Multipe GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
E1=`grep "Reads Mapped to Gene: Unique+Multip.*e Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
E2=`grep "Reads Mapped to Gene: Unique Gene," $i/output/Gene/Summary.csv | awk -F "," '{print $2}'`
F1=`grep "Reads Mapped to GeneFull: Unique+Multip.*e GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
F2=`grep "Reads Mapped to GeneFull: Unique GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
C=`grep "Estimated Number of Cells," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
R2=`grep "Unique Reads in Cells Mapped to GeneFull," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
CF=`echo $R1 | awk -v v=$R2 '{printf "%.3f\n",v/$1}'`
R3=`grep "UMIs in Cells," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
GC=`grep "Median GeneFull per Cell," $i/output/GeneFull/Summary.csv | awk -F "," '{print $2}'`
echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$G1\t$G2\t$T1\t$T2\t$F1\t$F2"
ST=`grep "^soloStrand" $i/Log.out | grep RE-DEFINED | awk '{print $2}'`
if [[ $ST == "" ]]
then
ST="Undef"
fi
echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$ST\t$G1\t$G2\t$E1\t$E2\t$F1\t$F2"
fi
done
Loading

0 comments on commit 31ceff5

Please sign in to comment.