nf-core · drpatelh · Jun 18, 2019 · Jun 18, 2019 · Jun 18, 2019
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -32,7 +32,7 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t
 There are typically two types of tests that run:
 
 ### Lint Tests
-The nf-core has a [set of guidelines](http://nf-co.re/guidelines) which all pipelines must adhere to.
+The nf-core has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
 To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.
 
 If any failures or warnings are encountered, please follow the listed URL for more documentation.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,11 +4,17 @@
 
 ### Pipeline updates
 
+* Removed `genebody_coverage` process [#195](https://github.com/nf-core/rnaseq/issues/195)
+* Implemented Pearsons correlation instead of euclidean distance [#146](https://github.com/nf-core/rnaseq/issues/146)
+* Add `--stringTieIgnoreGTF` parameter [#206](https://github.com/nf-core/rnaseq/issues/206)
+* Resolved link to guidelines is broken [#203](https://github.com/nf-core/rnaseq/issues/203)
+* Removed unnecessary `stringtie` channels for `MultiQC`
 * Added tximport to merge salmon output
 * Added Salmon as an supplementary method to STAR and HiSAT2
 * Added `--psuedo_aligner`, `--transcript_fasta` and `--salmon_index` parameters
 * Add `Citation` and `Quick Start` section to `README.md`
-* Integrate changes in `nf-core/tools v1.6` template
+* Closed missing multiqc_plots in dev branch output [#200](https://github.com/nf-core/rnaseq/issues/200)
+* Integrate changes in `nf-core/tools v1.6` template which resolved [#90](https://github.com/nf-core/rnaseq/issues/90)
 * Add tximport and summarizedexperiment dependency [#171](https://github.com/nf-core/rnaseq/issues/171)
 * Change all boolean parameters from snake_case to camelCase and vice versa for value parameters
 * Appointed changes because of missing output of the multiqc_plots folder [#200](https://github.com/nf-core/rnaseq/issues/200)

diff --git a/assets/heatmap_header.txt b/assets/heatmap_header.txt
@@ -2,10 +2,10 @@
 # section_name: 'edgeR: Sample Similarity'
 # description: "is generated from normalised gene counts through
 #        <a href='https://bioconductor.org/packages/release/bioc/html/edgeR.html' target='_blank'>edgeR</a>.
-#        Euclidean distances between log<sub>2</sub> normalised CPM values are then calculated and clustered."
+#        Pearson's correlation between log<sub>2</sub> normalised CPM values are then calculated and clustered."
 # plot_type: 'heatmap'
 # anchor: 'ngi_rnaseq-sample_similarity'
 # pconfig:
-#     title: 'edgeR: Euclidean distances'
+#     title: 'edgeR: Pearsons correlation'
 #     xlab: True
 #     reverseColors: True
diff --git a/bin/edgeR_heatmap_MDS.r b/bin/edgeR_heatmap_MDS.r
@@ -67,25 +67,24 @@ write.csv(MDSxy, 'edgeR_MDS_Aplot_coordinates_mqc.csv', quote=FALSE, append=TRUE
 # Get the log counts per million values
 logcpm <- cpm(dataNorm, prior.count=2, log=TRUE)
 
-# Calculate the euclidean distances between samples
-dists = dist(t(logcpm))
-
+# Calculate the Pearsons correlation between samples
 # Plot a heatmap of correlations
-pdf('log2CPM_sample_distances_heatmap.pdf')
-hmap <- heatmap.2(as.matrix(dists),
-  main="Sample Correlations", key.title="Distance", trace="none",
+pdf('log2CPM_sample_correlation_heatmap.pdf')
+hmap <- heatmap.2(as.matrix(cor(logcpm, method="pearson")),
+  key.title="Pearsons Correlation", trace="none",
   dendrogram="row", margin=c(9, 9)
 )
 dev.off()
 
+# Write correlation values to file
+write.csv(hmap$carpet, 'log2CPM_sample_correlation_mqc.csv', quote=FALSE, append=TRUE)
+
 # Plot the heatmap dendrogram
 pdf('log2CPM_sample_distances_dendrogram.pdf')
-plot(hmap$rowDendrogram, main="Sample Dendrogram")
+hmap <- heatmap.2(as.matrix(dist(t(logcpm))))
+plot(hmap$rowDendrogram, main="Sample Euclidean Distance Clustering")
 dev.off()
 
-# Write clustered distance values to file
-write.csv(hmap$carpet, 'log2CPM_sample_distances_mqc.csv', quote=FALSE, append=TRUE)
-
 file.create("corr.done")
 
 # Printing sessioninfo to standard out

diff --git a/bin/tximport.r b/bin/tximport.r
@@ -55,14 +55,14 @@ if (!is.null(tx2gene)){
 
 if(exists("gse")){
   saveRDS(gse, file = "gse.rds")
-  write.csv(assays(se)[["abundance"]], "merged_salmon_gene_tpm.csv")
-  write.csv(assays(se)[["counts"]], "merged_salmon_gene_reads.csv")
+  write.csv(assays(se)[["abundance"]], "salmon_merged_gene_tpm.csv")
+  write.csv(assays(se)[["counts"]], "salmon_merged_gene_counts.csv")
 }
 
 saveRDS(se, file = "se.rds")
-write.csv(assays(se)[["abundance"]], "merged_salmon_tx_tpm.csv")
-write.csv(assays(se)[["counts"]], "merged_salmon_tx_reads.csv")
+write.csv(assays(se)[["abundance"]], "salmon_merged_transcript_tpm.csv")
+write.csv(assays(se)[["counts"]], "salmon_merged_transcript_counts.csv")
 
 # Print sessioninfo to standard out
 citation("tximeta")
-sessionInfo()
+sessionInfo()
diff --git a/conf/base.config b/conf/base.config
@@ -51,7 +51,11 @@ process {
     cpus = { check_max( 8, 'cpus' ) }
     memory = { check_max( 16.GB * task.attempt, 'memory' ) }
   }
-  withName: 'multiqc|get_software_versions' {
+  withName: 'get_software_versions' {
+    memory = { check_max( 2.GB * task.attempt, 'memory' ) }
+    cache = false
+  }
+  withName: 'multiqc' {
     memory = { check_max( 2.GB * task.attempt, 'memory' ) }
     cache = false
   }

diff --git a/docs/images/heatmap.png b/docs/images/heatmap.png
diff --git a/docs/images/rseqc_gene_body_coverage_plot.png b/docs/images/rseqc_gene_body_coverage_plot.png
diff --git a/docs/output.md b/docs/output.md
@@ -16,7 +16,6 @@ and processes data using the following steps:
   * [RPKM saturation](#rpkm-saturation)
   * [Read duplication](#read-duplication)
   * [Inner distance](#inner-distance)
-  * [Gene body coverage](#gene-body-coverage)
   * [Read distribution](#read-distribution)
   * [Junction annotation](#junction-annotation)
 * [Qualimap](#qualimap) - RNA quality control metrics
@@ -207,24 +206,6 @@ This plot will not be generated for single-end data. Very short inner distances
 
 RSeQC documentation: [inner_distance.py](http://rseqc.sourceforge.net/#inner-distance-py)
 
-### Gene body coverage
-**NB:** In nfcore/rnaseq we subsample this to 1 Million reads. This speeds up this task significantly and has no to little effect on the results.
-
-**Output:**
-
-* `Sample_rseqc.geneBodyCoverage.curves.pdf`
-* `Sample_rseqc.geneBodyCoverage.r`
-* `Sample_rseqc.geneBodyCoverage.txt`
-
-This script calculates the reads coverage across gene bodies. This makes it easy to identify 3' or 5' skew in libraries. A skew towards increased 3' coverage can happen in degraded samples prepared with poly-A selection.
-
-A typical set of libraries with little or no bias will look as follows:
-
-![Gene body coverage](images/rseqc_gene_body_coverage_plot.png)
-
-RSeQC documentation: [gene\_body_coverage.py](http://rseqc.sourceforge.net/#genebody-coverage-py)
-
-
 ### Read distribution
 **Output: `Sample_read_distribution.txt`**
 
@@ -327,13 +308,13 @@ We also use featureCounts to count overlaps with different classes of features.
 
 **Output directory: `results/salmon`**
 
-* `merged_salmon_tx_tpm.csv`
+* `salmon_merged_transcript_tpm.csv`
   * TPM counts for the different transcripts.
-* `merged_salmon_gene_tpm.csv`
+* `salmon_merged_gene_tpm.csv`
   * TPM counts for the different genes.
-* `merged_salmon_tx_reads.csv`
+* `salmon_merged_transcript_counts.csv`
   * estimated counts for the different transcripts.
-* `merged_salmon_gene_reads.csv`
+* `salmon_merged_gene_counts.csv`
   * estimated counts for the different genes.
 * `tx2gene.csv`
   * CSV file with transcript and genes (`params.fc_group_features`) and extra name (`params.fc_extra_attributes`) in each column.
@@ -402,7 +383,7 @@ StringTie outputs FPKM metrics for genes and transcripts as well as the transcri
   * This `.gtf` file contains the transcripts that are fully covered by reads.
 
 ## Sample Correlation
-[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is a Bioconductor package for R used for RNA-seq data analysis. The script included in the pipeline uses edgeR to normalise read counts and create a heatmap / dendrogram showing pairwise euclidean distance (sample similarity). It also creates a 2D MDS scatter plot showing sample grouping. These help to show sample similarity and can reveal batch effects and sample groupings.
+[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is a Bioconductor package for R used for RNA-seq data analysis. The script included in the pipeline uses edgeR to normalise read counts and create a heatmap showing Pearsons correlation and a dendrogram showing pairwise Euclidean distances between the samples in the experiment. It also creates a 2D MDS scatter plot showing sample grouping. These help to show sample similarity and can reveal batch effects and sample groupings.
 
 **Heatmap:**
 
@@ -415,17 +396,17 @@ StringTie outputs FPKM metrics for genes and transcripts as well as the transcri
 **Output directory: `results/sample_correlation`**
 
 * `edgeR_MDS_plot.pdf`
-  * MDS scatter plot, showing sample similarity
-* `edgeR_MDS_distance_matrix.txt`
+  * MDS scatter plot showing sample similarity
+* `edgeR_MDS_distance_matrix.csv`
   * Distance matrix containing raw data from MDS analysis
-* `edgeR_MDS_plot_coordinates.txt`
+* `edgeR_MDS_Aplot_coordinates_mqc.csv`
   * Scatter plot coordinates from MDS plot, used for MultiQC report
 * `log2CPM_sample_distances_dendrogram.pdf`
-  * Dendrogram plot showing the euclidian distance between your samples
-* `log2CPM_sample_distances_heatmap.pdf`
-  * Heatmap plot showing the euclidian distance between your samples
-* `log2CPM_sample_distances.txt`
-  * Raw data used for heatmap and dendrogram plots.
+  * Dendrogram showing the Euclidean distance between your samples
+* `log2CPM_sample_correlation_heatmap.pdf`
+  * Heatmap showing the Pearsons correlation between your samples
+* `log2CPM_sample_correlation_mqc.csv`
+  * Raw data from Pearsons correlation heatmap, used for MultiQC report
 
 ## MultiQC
 [MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.

diff --git a/docs/usage.md b/docs/usage.md
@@ -60,8 +60,6 @@
 * [Stand-alone scripts](#stand-alone-scripts)
 <!-- TOC END -->
 
-
-
 ## Introduction
 Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through `screen` / `tmux` or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.
 
@@ -315,7 +313,6 @@ The following options make this easy:
 * `--skipFastQC` -            Skip FastQC
 * `--skipRseQC` -             Skip RSeQC
 * `--skipQualimap` -          Skip Qualimap
-* `--skipGenebodyCoverage` -  Skip calculating the genebody coverage
 * `--skipPreseq` -            Skip Preseq
 * `--skipDupRadar` -          Skip dupRadar (and Picard MarkDuplicates)
 * `--skipEdgeR` -             Skip edgeR MDS plot and heatmap