Merge pull request #526 from drpatelh/salmon

Replace featureCounts with Salmon quant
nf-core · Dec 11, 2020 · 69d024f · 69d024f
2 parents 863f68b + f3d7667
commit 69d024f
Show file tree

Hide file tree

Showing 28 changed files with 575 additions and 546 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -34,8 +34,8 @@ jobs:
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test,docker
 
-  star:
-    name: Test STAR with workflow parameters
+  star_salmon:
+    name: Test STAR Salmon with workflow parameters
     if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq') }}
     runs-on: ubuntu-latest
     env:
@@ -49,6 +49,7 @@ jobs:
           - '--skip_trimming'
           - '--gtf false'
           - '--star_index false'
+          - '--transcript_fasta false'
           - '--min_mapped_reads 90'
           - '--with_umi'
           - '--with_umi --skip_trimming'
@@ -63,7 +64,7 @@ jobs:
 
       - name: Run pipeline with STAR and various parameters
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --aligner star ${{ matrix.parameters }}
+          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --aligner star_salmon ${{ matrix.parameters }}
 
   star_rsem:
     name: Test STAR RSEM with workflow parameters

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,15 +5,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## v2.1dev - [date]
 
-* Fix tximport ingestion data to use lengthScaled and Scaled parameters: [#499][https://github.com/nf-core/rnaseq/issues/499]
+### Major enhancements
+
+* The aligned BAM files generated by `--aligner star` will now be quantified using Salmon instead of featureCounts. As a result, the name of this option has now been changed to `--aligner star_salmon` and this will now be the default route through the pipeline. This decision was made primarily because of the limitations of featureCounts to appropriately quantify gene expression data. Please see [Zhao et al., 2015](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910#pone-0141910-t001) and [Soneson et al., 2015](https://f1000research.com/articles/4-1521/v1)).
+* For similar reasons, quantification will not be performed if using `--aligner hisat2` due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.
 
 ### Enhancements & fixes
 
-* Updated pipeline template to nf-core/tools `1.12`
+* Updated pipeline template to nf-core/tools `1.12.1`
+* [[#498](https://github.com/nf-core/rnaseq/issues/498)] - Significantly different versions of STAR in star_rsem (2.7.6a) and star (2.6.1d)
+* [[#499](https://github.com/nf-core/rnaseq/issues/499)] - Use of salmon counts for DESeq2
 * [[#500](https://github.com/nf-core/rnaseq/issues/500), [#509](https://github.com/nf-core/rnaseq/issues/509)] - Error with AWS batch params
 * [[#511](https://github.com/nf-core/rnaseq/issues/511)] - rsem/star index fails with large genome
+* [[#515](https://github.com/nf-core/rnaseq/issues/515)] - Add decoy-aware indexing for salmon
 * [[#516](https://github.com/nf-core/rnaseq/issues/516)] - Unexpected error [InvocationTargetException]
 
+### Parameters
+
+| Old parameter                | New parameter               |
+|------------------------------|-----------------------------|
+| `--fc_extra_attributes`      | `--gtf_extra_attributes`    |
+| `--fc_group_features`        | `--gtf_group_features`      |
+| `--fc_count_type`            | `--gtf_count_type`          |
+| `--fc_group_features_type`   | `--gtf_group_features_type` |
+| `--skip_featurecounts`       | `-`                         |
+
+> **NB:** Parameter has been __updated__ if both old and new parameter information is present.  
+> **NB:** Parameter has been __added__ if just the new parameter information is present.  
+> **NB:** Parameter has been __removed__ if parameter information isn't present.  
+
 ## [[2.0](https://github.com/nf-core/rnaseq/releases/tag/2.0)] - 2020-11-12
 
 ### Major enhancements

diff --git a/README.md b/README.md
@@ -31,9 +31,9 @@ On release, automated continuous integration tests run the pipeline on a [full-s
 5. Adapter and quality trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
 6. Removal of ribosomal RNA ([`SortMeRNA`](https://github.com/biocore/sortmerna))
 7. Choice of multiple alignment and quantification routes:
-    1. [`STAR`](https://github.com/alexdobin/STAR) -> [`featureCounts`](http://bioinf.wehi.edu.au/featureCounts/)
+    1. [`STAR`](https://github.com/alexdobin/STAR) -> [`Salmon`](https://combine-lab.github.io/salmon/)
     2. [`STAR`](https://github.com/alexdobin/STAR) -> [`RSEM`](https://github.com/deweylab/RSEM)
-    3. [`HiSAT2`](https://ccb.jhu.edu/software/hisat2/index.shtml) -> [`featureCounts`](http://bioinf.wehi.edu.au/featureCounts/)
+    3. [`HiSAT2`](https://ccb.jhu.edu/software/hisat2/index.shtml) -> **NO QUANTIFICATION**
 8. Sort and index alignments ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
 9. UMI-based deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
 10. Duplicate read marking ([`picard MarkDuplicates`](https://broadinstitute.github.io/picard/))
@@ -48,6 +48,9 @@ On release, automated continuous integration tests run the pipeline on a [full-s
 14. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); *optional*)
 15. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
 
+> **NB:** Quantification isn't performed if using `--aligner hisat2` due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.  
+> **NB:** The `--aligner star_rsem` option will require STAR indices built from version 2.7.6a or later. However, in order to support legacy usage of genomes hosted on AWS iGenomes the `--aligner star_salmon` option requires indices built with STAR 2.6.1d or earlier. Please refer to this [issue](https://github.com/nf-core/rnaseq/issues/498) for further details.
+
 ## Quick Start
 
 1. Install [`nextflow`](https://nf-co.re/usage/installation)

diff --git a/assets/multiqc_config.yaml b/assets/multiqc_config.yaml
@@ -18,16 +18,15 @@ run_modules:
     - preseq
     - rseqc
     - qualimap
-    - featureCounts
 
 # Order of modules
 top_modules:
     - 'fail_mapped_samples'
     - 'fail_strand_check'
-    - 'featurecounts_deseq2_pca'
-    - 'featurecounts_deseq2_clustering'
-    - 'rsem_deseq2_pca'
-    - 'rsem_deseq2_clustering'
+    - 'star_rsem_deseq2_pca'
+    - 'star_rsem_deseq2_clustering'
+    - 'star_salmon_deseq2_pca'
+    - 'star_salmon_deseq2_clustering'
     - 'salmon_deseq2_pca'
     - 'salmon_deseq2_clustering'
     - 'biotype_counts'
@@ -91,9 +90,6 @@ sp:
     preseq:
         fn: '*.ccurve.txt'
 
-    featurecounts:
-        fn: '*.summary'
-
     samtools/stats:
         fn: '*.stats'
     samtools/flagstat:

diff --git a/bin/fasta2gtf.py b/bin/fasta2gtf.py
@@ -39,20 +39,20 @@ def fasta2gtf(fasta, output):
     # GTF output lines
     lines = []
     attributes = \
-        'gene_id "{name_sanitized}"; gene_name "{name_sanitized}";transcript_id "{name_sanitized}"; gene_biotype "{name_sanitized}"; gene_type "{name_sanitized}"\n'
+        'exon_id "{name}.1"; exon_number "1"; gene_biotype "transgene"; gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
     line_template = \
-        "{name_sanitized}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
+        "{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
 
     for ff in fiter:
         name, seq = ff
         # Use first ID as separated by spaces as the "sequence name"
         # (equivalent to "chromosome" in other cases)
         seqname = name.split()[0]
         # Remove all spaces
-        name_sanitized = seqname.replace(' ', '_')
+        name = seqname.replace(' ', '_')
         length = len(seq)
         line = line_template.format(
-            name_sanitized=name_sanitized, length=length)
+            name=name, length=length)
         lines.append(line)
 
     with open(output, 'w') as f:

diff --git a/conf/modules.config b/conf/modules.config
@@ -64,8 +64,20 @@ params {
         }
         'star_align' {
             args          = "--quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0"
+            publish_dir   = "${params.aligner}"
             publish_files = ['out':'log', 'tab':'log']
         }
+        'star_salmon_quant' {
+            args          = "--seqBias --useVBOpt --gcBias"
+            publish_dir   = "${params.aligner}"
+        }
+        'star_salmon_tximport' {
+            publish_dir   = "${params.aligner}"
+            publish_by_id = true
+        }
+        'star_salmon_merge_counts' {
+            publish_dir   = "${params.aligner}"
+        }
         'hisat2_build' {
             publish_dir   = "genome/index"
         }
@@ -97,6 +109,9 @@ params {
         'salmon_quant' {
             args          = "--validateMappings --seqBias --useVBOpt --gcBias"
         }
+        'salmon_tximport' {
+            publish_by_id = true
+        }
         'salmon_merge_counts' {
             publish_dir   = "${params.pseudo_aligner}"
         }
@@ -129,15 +144,8 @@ params {
             args          = "-B -C"
             publish_dir   = "${params.aligner}/featurecounts"
         }
-        'featurecounts_merge_counts' {
-            publish_dir   = "${params.aligner}"
-        }
-        'subread_featurecounts_biotype' {
-            args          = "-B -C"
-            publish_dir   = "${params.aligner}/featurecounts/biotype"
-        }
         'multiqc_custom_biotype' {
-            publish_dir   = "${params.aligner}/featurecounts/biotype"
+            publish_dir   = "${params.aligner}/featurecounts"
         }
         'bedtools_genomecov' {
             publish_files = false

diff --git a/docs/images/mqc_featurecounts_assignment.png b/docs/images/mqc_featurecounts_assignment.png