Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close outstanding issues and amend salmon merge #236

Merged
merged 2 commits into from
Jun 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t
There are typically two types of tests that run:

### Lint Tests
The nf-core has a [set of guidelines](http://nf-co.re/guidelines) which all pipelines must adhere to.
The nf-core has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.

If any failures or warnings are encountered, please follow the listed URL for more documentation.
Expand Down
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,17 @@

### Pipeline updates

* Removed `genebody_coverage` process [#195](https://github.com/nf-core/rnaseq/issues/195)
* Implemented Pearsons correlation instead of euclidean distance [#146](https://github.com/nf-core/rnaseq/issues/146)
* Add `--stringTieIgnoreGTF` parameter [#206](https://github.com/nf-core/rnaseq/issues/206)
* Resolved link to guidelines is broken [#203](https://github.com/nf-core/rnaseq/issues/203)
* Removed unnecessary `stringtie` channels for `MultiQC`
* Added tximport to merge salmon output
* Added Salmon as an supplementary method to STAR and HiSAT2
* Added `--psuedo_aligner`, `--transcript_fasta` and `--salmon_index` parameters
* Add `Citation` and `Quick Start` section to `README.md`
* Integrate changes in `nf-core/tools v1.6` template
* Closed missing multiqc_plots in dev branch output [#200](https://github.com/nf-core/rnaseq/issues/200)
* Integrate changes in `nf-core/tools v1.6` template which resolved [#90](https://github.com/nf-core/rnaseq/issues/90)
* Add tximport and summarizedexperiment dependency [#171](https://github.com/nf-core/rnaseq/issues/171)
* Change all boolean parameters from snake_case to camelCase and vice versa for value parameters
* Appointed changes because of missing output of the multiqc_plots folder [#200](https://github.com/nf-core/rnaseq/issues/200)
Expand Down
4 changes: 2 additions & 2 deletions assets/heatmap_header.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
# section_name: 'edgeR: Sample Similarity'
# description: "is generated from normalised gene counts through
# <a href='https://bioconductor.org/packages/release/bioc/html/edgeR.html' target='_blank'>edgeR</a>.
# Euclidean distances between log<sub>2</sub> normalised CPM values are then calculated and clustered."
# Pearson's correlation between log<sub>2</sub> normalised CPM values are then calculated and clustered."
# plot_type: 'heatmap'
# anchor: 'ngi_rnaseq-sample_similarity'
# pconfig:
# title: 'edgeR: Euclidean distances'
# title: 'edgeR: Pearsons correlation'
# xlab: True
# reverseColors: True
19 changes: 9 additions & 10 deletions bin/edgeR_heatmap_MDS.r
Original file line number Diff line number Diff line change
Expand Up @@ -67,25 +67,24 @@ write.csv(MDSxy, 'edgeR_MDS_Aplot_coordinates_mqc.csv', quote=FALSE, append=TRUE
# Get the log counts per million values
logcpm <- cpm(dataNorm, prior.count=2, log=TRUE)

# Calculate the euclidean distances between samples
dists = dist(t(logcpm))

# Calculate the Pearsons correlation between samples
# Plot a heatmap of correlations
pdf('log2CPM_sample_distances_heatmap.pdf')
hmap <- heatmap.2(as.matrix(dists),
main="Sample Correlations", key.title="Distance", trace="none",
pdf('log2CPM_sample_correlation_heatmap.pdf')
hmap <- heatmap.2(as.matrix(cor(logcpm, method="pearson")),
key.title="Pearsons Correlation", trace="none",
dendrogram="row", margin=c(9, 9)
)
dev.off()

# Write correlation values to file
write.csv(hmap$carpet, 'log2CPM_sample_correlation_mqc.csv', quote=FALSE, append=TRUE)

# Plot the heatmap dendrogram
pdf('log2CPM_sample_distances_dendrogram.pdf')
plot(hmap$rowDendrogram, main="Sample Dendrogram")
hmap <- heatmap.2(as.matrix(dist(t(logcpm))))
plot(hmap$rowDendrogram, main="Sample Euclidean Distance Clustering")
dev.off()

# Write clustered distance values to file
write.csv(hmap$carpet, 'log2CPM_sample_distances_mqc.csv', quote=FALSE, append=TRUE)

file.create("corr.done")

# Printing sessioninfo to standard out
Expand Down
10 changes: 5 additions & 5 deletions bin/tximport.r
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,14 @@ if (!is.null(tx2gene)){

if(exists("gse")){
saveRDS(gse, file = "gse.rds")
write.csv(assays(se)[["abundance"]], "merged_salmon_gene_tpm.csv")
write.csv(assays(se)[["counts"]], "merged_salmon_gene_reads.csv")
write.csv(assays(se)[["abundance"]], "salmon_merged_gene_tpm.csv")
write.csv(assays(se)[["counts"]], "salmon_merged_gene_counts.csv")
}

saveRDS(se, file = "se.rds")
write.csv(assays(se)[["abundance"]], "merged_salmon_tx_tpm.csv")
write.csv(assays(se)[["counts"]], "merged_salmon_tx_reads.csv")
write.csv(assays(se)[["abundance"]], "salmon_merged_transcript_tpm.csv")
write.csv(assays(se)[["counts"]], "salmon_merged_transcript_counts.csv")

# Print sessioninfo to standard out
citation("tximeta")
sessionInfo()
sessionInfo()
6 changes: 5 additions & 1 deletion conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,11 @@ process {
cpus = { check_max( 8, 'cpus' ) }
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
}
withName: 'multiqc|get_software_versions' {
withName: 'get_software_versions' {
memory = { check_max( 2.GB * task.attempt, 'memory' ) }
cache = false
}
withName: 'multiqc' {
memory = { check_max( 2.GB * task.attempt, 'memory' ) }
cache = false
}
Expand Down
Binary file modified docs/images/heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/images/rseqc_gene_body_coverage_plot.png
Binary file not shown.
45 changes: 13 additions & 32 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ and processes data using the following steps:
* [RPKM saturation](#rpkm-saturation)
* [Read duplication](#read-duplication)
* [Inner distance](#inner-distance)
* [Gene body coverage](#gene-body-coverage)
* [Read distribution](#read-distribution)
* [Junction annotation](#junction-annotation)
* [Qualimap](#qualimap) - RNA quality control metrics
Expand Down Expand Up @@ -207,24 +206,6 @@ This plot will not be generated for single-end data. Very short inner distances

RSeQC documentation: [inner_distance.py](http://rseqc.sourceforge.net/#inner-distance-py)

### Gene body coverage
**NB:** In nfcore/rnaseq we subsample this to 1 Million reads. This speeds up this task significantly and has no to little effect on the results.

**Output:**

* `Sample_rseqc.geneBodyCoverage.curves.pdf`
* `Sample_rseqc.geneBodyCoverage.r`
* `Sample_rseqc.geneBodyCoverage.txt`

This script calculates the reads coverage across gene bodies. This makes it easy to identify 3' or 5' skew in libraries. A skew towards increased 3' coverage can happen in degraded samples prepared with poly-A selection.

A typical set of libraries with little or no bias will look as follows:

![Gene body coverage](images/rseqc_gene_body_coverage_plot.png)

RSeQC documentation: [gene\_body_coverage.py](http://rseqc.sourceforge.net/#genebody-coverage-py)


### Read distribution
**Output: `Sample_read_distribution.txt`**

Expand Down Expand Up @@ -327,13 +308,13 @@ We also use featureCounts to count overlaps with different classes of features.

**Output directory: `results/salmon`**

* `merged_salmon_tx_tpm.csv`
* `salmon_merged_transcript_tpm.csv`
* TPM counts for the different transcripts.
* `merged_salmon_gene_tpm.csv`
* `salmon_merged_gene_tpm.csv`
* TPM counts for the different genes.
* `merged_salmon_tx_reads.csv`
* `salmon_merged_transcript_counts.csv`
* estimated counts for the different transcripts.
* `merged_salmon_gene_reads.csv`
* `salmon_merged_gene_counts.csv`
* estimated counts for the different genes.
* `tx2gene.csv`
* CSV file with transcript and genes (`params.fc_group_features`) and extra name (`params.fc_extra_attributes`) in each column.
Expand Down Expand Up @@ -402,7 +383,7 @@ StringTie outputs FPKM metrics for genes and transcripts as well as the transcri
* This `.gtf` file contains the transcripts that are fully covered by reads.

## Sample Correlation
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is a Bioconductor package for R used for RNA-seq data analysis. The script included in the pipeline uses edgeR to normalise read counts and create a heatmap / dendrogram showing pairwise euclidean distance (sample similarity). It also creates a 2D MDS scatter plot showing sample grouping. These help to show sample similarity and can reveal batch effects and sample groupings.
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is a Bioconductor package for R used for RNA-seq data analysis. The script included in the pipeline uses edgeR to normalise read counts and create a heatmap showing Pearsons correlation and a dendrogram showing pairwise Euclidean distances between the samples in the experiment. It also creates a 2D MDS scatter plot showing sample grouping. These help to show sample similarity and can reveal batch effects and sample groupings.

**Heatmap:**

Expand All @@ -415,17 +396,17 @@ StringTie outputs FPKM metrics for genes and transcripts as well as the transcri
**Output directory: `results/sample_correlation`**

* `edgeR_MDS_plot.pdf`
* MDS scatter plot, showing sample similarity
* `edgeR_MDS_distance_matrix.txt`
* MDS scatter plot showing sample similarity
* `edgeR_MDS_distance_matrix.csv`
* Distance matrix containing raw data from MDS analysis
* `edgeR_MDS_plot_coordinates.txt`
* `edgeR_MDS_Aplot_coordinates_mqc.csv`
* Scatter plot coordinates from MDS plot, used for MultiQC report
* `log2CPM_sample_distances_dendrogram.pdf`
* Dendrogram plot showing the euclidian distance between your samples
* `log2CPM_sample_distances_heatmap.pdf`
* Heatmap plot showing the euclidian distance between your samples
* `log2CPM_sample_distances.txt`
* Raw data used for heatmap and dendrogram plots.
* Dendrogram showing the Euclidean distance between your samples
* `log2CPM_sample_correlation_heatmap.pdf`
* Heatmap showing the Pearsons correlation between your samples
* `log2CPM_sample_correlation_mqc.csv`
* Raw data from Pearsons correlation heatmap, used for MultiQC report

## MultiQC
[MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
Expand Down
3 changes: 0 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,6 @@
* [Stand-alone scripts](#stand-alone-scripts)
<!-- TOC END -->



## Introduction
Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through `screen` / `tmux` or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.

Expand Down Expand Up @@ -315,7 +313,6 @@ The following options make this easy:
* `--skipFastQC` - Skip FastQC
* `--skipRseQC` - Skip RSeQC
* `--skipQualimap` - Skip Qualimap
* `--skipGenebodyCoverage` - Skip calculating the genebody coverage
* `--skipPreseq` - Skip Preseq
* `--skipDupRadar` - Skip dupRadar (and Picard MarkDuplicates)
* `--skipEdgeR` - Skip edgeR MDS plot and heatmap
Expand Down
Loading