Skip to content

Commit

Permalink
Merge pull request #526 from drpatelh/salmon
Browse files Browse the repository at this point in the history
Replace featureCounts with Salmon quant
  • Loading branch information
drpatelh authored Dec 11, 2020
2 parents 863f68b + f3d7667 commit 69d024f
Show file tree
Hide file tree
Showing 28 changed files with 575 additions and 546 deletions.
7 changes: 4 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ jobs:
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker
star:
name: Test STAR with workflow parameters
star_salmon:
name: Test STAR Salmon with workflow parameters
if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq') }}
runs-on: ubuntu-latest
env:
Expand All @@ -49,6 +49,7 @@ jobs:
- '--skip_trimming'
- '--gtf false'
- '--star_index false'
- '--transcript_fasta false'
- '--min_mapped_reads 90'
- '--with_umi'
- '--with_umi --skip_trimming'
Expand All @@ -63,7 +64,7 @@ jobs:
- name: Run pipeline with STAR and various parameters
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --aligner star ${{ matrix.parameters }}
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --aligner star_salmon ${{ matrix.parameters }}
star_rsem:
name: Test STAR RSEM with workflow parameters
Expand Down
24 changes: 22 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## v2.1dev - [date]

* Fix tximport ingestion data to use lengthScaled and Scaled parameters: [#499][https://github.com/nf-core/rnaseq/issues/499]
### Major enhancements

* The aligned BAM files generated by `--aligner star` will now be quantified using Salmon instead of featureCounts. As a result, the name of this option has now been changed to `--aligner star_salmon` and this will now be the default route through the pipeline. This decision was made primarily because of the limitations of featureCounts to appropriately quantify gene expression data. Please see [Zhao et al., 2015](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141910#pone-0141910-t001) and [Soneson et al., 2015](https://f1000research.com/articles/4-1521/v1)).
* For similar reasons, quantification will not be performed if using `--aligner hisat2` due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.

### Enhancements & fixes

* Updated pipeline template to nf-core/tools `1.12`
* Updated pipeline template to nf-core/tools `1.12.1`
* [[#498](https://github.com/nf-core/rnaseq/issues/498)] - Significantly different versions of STAR in star_rsem (2.7.6a) and star (2.6.1d)
* [[#499](https://github.com/nf-core/rnaseq/issues/499)] - Use of salmon counts for DESeq2
* [[#500](https://github.com/nf-core/rnaseq/issues/500), [#509](https://github.com/nf-core/rnaseq/issues/509)] - Error with AWS batch params
* [[#511](https://github.com/nf-core/rnaseq/issues/511)] - rsem/star index fails with large genome
* [[#515](https://github.com/nf-core/rnaseq/issues/515)] - Add decoy-aware indexing for salmon
* [[#516](https://github.com/nf-core/rnaseq/issues/516)] - Unexpected error [InvocationTargetException]

### Parameters

| Old parameter | New parameter |
|------------------------------|-----------------------------|
| `--fc_extra_attributes` | `--gtf_extra_attributes` |
| `--fc_group_features` | `--gtf_group_features` |
| `--fc_count_type` | `--gtf_count_type` |
| `--fc_group_features_type` | `--gtf_group_features_type` |
| `--skip_featurecounts` | `-` |

> **NB:** Parameter has been __updated__ if both old and new parameter information is present.
> **NB:** Parameter has been __added__ if just the new parameter information is present.
> **NB:** Parameter has been __removed__ if parameter information isn't present.
## [[2.0](https://github.com/nf-core/rnaseq/releases/tag/2.0)] - 2020-11-12

### Major enhancements
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ On release, automated continuous integration tests run the pipeline on a [full-s
5. Adapter and quality trimming ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/))
6. Removal of ribosomal RNA ([`SortMeRNA`](https://github.com/biocore/sortmerna))
7. Choice of multiple alignment and quantification routes:
1. [`STAR`](https://github.com/alexdobin/STAR) -> [`featureCounts`](http://bioinf.wehi.edu.au/featureCounts/)
1. [`STAR`](https://github.com/alexdobin/STAR) -> [`Salmon`](https://combine-lab.github.io/salmon/)
2. [`STAR`](https://github.com/alexdobin/STAR) -> [`RSEM`](https://github.com/deweylab/RSEM)
3. [`HiSAT2`](https://ccb.jhu.edu/software/hisat2/index.shtml) -> [`featureCounts`](http://bioinf.wehi.edu.au/featureCounts/)
3. [`HiSAT2`](https://ccb.jhu.edu/software/hisat2/index.shtml) -> **NO QUANTIFICATION**
8. Sort and index alignments ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
9. UMI-based deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
10. Duplicate read marking ([`picard MarkDuplicates`](https://broadinstitute.github.io/picard/))
Expand All @@ -48,6 +48,9 @@ On release, automated continuous integration tests run the pipeline on a [full-s
14. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); *optional*)
15. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

> **NB:** Quantification isn't performed if using `--aligner hisat2` due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.
> **NB:** The `--aligner star_rsem` option will require STAR indices built from version 2.7.6a or later. However, in order to support legacy usage of genomes hosted on AWS iGenomes the `--aligner star_salmon` option requires indices built with STAR 2.6.1d or earlier. Please refer to this [issue](https://github.com/nf-core/rnaseq/issues/498) for further details.
## Quick Start

1. Install [`nextflow`](https://nf-co.re/usage/installation)
Expand Down
12 changes: 4 additions & 8 deletions assets/multiqc_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,15 @@ run_modules:
- preseq
- rseqc
- qualimap
- featureCounts

# Order of modules
top_modules:
- 'fail_mapped_samples'
- 'fail_strand_check'
- 'featurecounts_deseq2_pca'
- 'featurecounts_deseq2_clustering'
- 'rsem_deseq2_pca'
- 'rsem_deseq2_clustering'
- 'star_rsem_deseq2_pca'
- 'star_rsem_deseq2_clustering'
- 'star_salmon_deseq2_pca'
- 'star_salmon_deseq2_clustering'
- 'salmon_deseq2_pca'
- 'salmon_deseq2_clustering'
- 'biotype_counts'
Expand Down Expand Up @@ -91,9 +90,6 @@ sp:
preseq:
fn: '*.ccurve.txt'

featurecounts:
fn: '*.summary'

samtools/stats:
fn: '*.stats'
samtools/flagstat:
Expand Down
8 changes: 4 additions & 4 deletions bin/fasta2gtf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,20 @@ def fasta2gtf(fasta, output):
# GTF output lines
lines = []
attributes = \
'gene_id "{name_sanitized}"; gene_name "{name_sanitized}";transcript_id "{name_sanitized}"; gene_biotype "{name_sanitized}"; gene_type "{name_sanitized}"\n'
'exon_id "{name}.1"; exon_number "1"; gene_biotype "transgene"; gene_id "{name}_gene"; gene_name "{name}_gene"; gene_source "custom"; transcript_id "{name}_gene"; transcript_name "{name}_gene";\n'
line_template = \
"{name_sanitized}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes
"{name}\ttransgene\texon\t1\t{length}\t.\t+\t.\t" + attributes

for ff in fiter:
name, seq = ff
# Use first ID as separated by spaces as the "sequence name"
# (equivalent to "chromosome" in other cases)
seqname = name.split()[0]
# Remove all spaces
name_sanitized = seqname.replace(' ', '_')
name = seqname.replace(' ', '_')
length = len(seq)
line = line_template.format(
name_sanitized=name_sanitized, length=length)
name=name, length=length)
lines.append(line)

with open(output, 'w') as f:
Expand Down
24 changes: 16 additions & 8 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,20 @@ params {
}
'star_align' {
args = "--quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0"
publish_dir = "${params.aligner}"
publish_files = ['out':'log', 'tab':'log']
}
'star_salmon_quant' {
args = "--seqBias --useVBOpt --gcBias"
publish_dir = "${params.aligner}"
}
'star_salmon_tximport' {
publish_dir = "${params.aligner}"
publish_by_id = true
}
'star_salmon_merge_counts' {
publish_dir = "${params.aligner}"
}
'hisat2_build' {
publish_dir = "genome/index"
}
Expand Down Expand Up @@ -97,6 +109,9 @@ params {
'salmon_quant' {
args = "--validateMappings --seqBias --useVBOpt --gcBias"
}
'salmon_tximport' {
publish_by_id = true
}
'salmon_merge_counts' {
publish_dir = "${params.pseudo_aligner}"
}
Expand Down Expand Up @@ -129,15 +144,8 @@ params {
args = "-B -C"
publish_dir = "${params.aligner}/featurecounts"
}
'featurecounts_merge_counts' {
publish_dir = "${params.aligner}"
}
'subread_featurecounts_biotype' {
args = "-B -C"
publish_dir = "${params.aligner}/featurecounts/biotype"
}
'multiqc_custom_biotype' {
publish_dir = "${params.aligner}/featurecounts/biotype"
publish_dir = "${params.aligner}/featurecounts"
}
'bedtools_genomecov' {
publish_files = false
Expand Down
Binary file removed docs/images/mqc_featurecounts_assignment.png
Binary file not shown.
Loading

0 comments on commit 69d024f

Please sign in to comment.