This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The pipeline is built using Nextflow. See main README.md
for a condensed overview of the steps in the pipeline, and the bioinformatics tools used at each step.
See Illumina website for more information regarding the MNase-seq protocol, and for an extensive list of publications.
The directories listed below will be created in the output directory after the pipeline has finished. All paths are relative to the top-level results directory.
The initial QC and alignments are performed at the library-level e.g. if the same library has been sequenced more than once to increase sequencing depth. This has the advantage of being able to assess each library individually, and the ability to process multiple libraries from the same sample in parallel.
-
Raw read QC
Documentation:
FastQCDescription:
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%A/C/G/T). You get information about adapter contamination and other overrepresented sequences.Output directories:
fastqc/
FastQC*.html
files for read 1 (and read2 if paired-end) before adapter trimming.fastqc/zips/
FastQC*.zip
files for read 1 (and read2 if paired-end) before adapter trimming.
-
Adapter trimming
Documentation:
Trim Galore!Description:
Trim Galore! is a wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. By default, Trim Galore! will automatically detect and trim the appropriate adapter sequence. Seeusage.md
for more details about the trimming options.Output directories:
trim_galore/
If--save_trimmed
is specified FastQ files after adapter trimming will be placed in this directory.trim_galore/logs/
*.log
files generated by Trim Galore!.trim_galore/fastqc/
FastQC*.html
files for read 1 (and read2 if paired-end) after adapter trimming.trim_galore/fastqc/zips/
FastQC*.zip
files for read 1 (and read2 if paired-end) after adapter trimming.
-
Alignment
Description:
Adapter-trimmed reads are mapped to the reference assembly using BWA. A genome index is required to run BWA so if this is not provided explicitly using the--bwa_index
parameter then it will be created automatically from the genome fasta input. The index creation process can take a while for larger genomes so it is possible to use the--save_reference
parameter to save the indices for future pipeline runs, reducing processing times.File names in the resulting directory (i.e.
bwa/library/
) will have the '.Lb.
' (Library) suffix.Output directories:
bwa/library/
The files resulting from the alignment of individual libraries are not saved by default so this directory will not be present in your results. You can override this behaviour with the use of the--save_align_intermeds
flag in which case it will contain the coordinate sorted alignment files in*.bam
format.bwa/library/samtools_stats/
SAMtools*.flagstat
,*.idxstats
and*.stats
files generated from the alignment files.
The library-level alignments associated with the same sample are merged and subsequently used for the downstream analyses.
-
Alignment merging, duplicate marking, filtering and QC
Documentation:
picard, SAMtools, BEDTools, BAMTools, Pysam, PreseqDescription:
Picard MergeSamFiles and MarkDuplicates are used in combination to merge the alignments, and for the marking of duplicates, respectively. If you only have one library for any given replicate then the merging step isnt carried out because the library-level and merged library-level BAM files will be exactly the same.Read duplicate marking is carried out using the Picard MarkDuplicates command. Duplicate reads are generally removed from the aligned reads to mitigate for fragments in the library that may have been sequenced more than once due to PCR biases. There is an option to keep duplicate reads with the
--keep_dups
parameter but its generally recommended to remove them to avoid the wrong interpretation of the results. A similar option has been provided to keep reads that are multi-mapped ---keep_multi_map
. Other steps have been incorporated into the pipeline to filter the resulting alignments - seemain README.md
for a more comprehensive listing, and the tools used at each step.A selection of alignment-based QC metrics generated by Picard CollectMultipleMetrics and MarkDuplicates will be included in the MultiQC report.
The Preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. The dashed line shows a perfectly complex library where total reads = unique reads.
Note that these are predictive numbers only, not absolute. The MultiQC plot can sometimes give extreme sequencing depth on the X axis - click and drag from the left side of the plot to zoom in on more realistic numbers.
File names in the resulting directory (i.e.
bwa/mergedLibrary/
) will have the '.mLb.
' suffix to denote merged Libraries.Output directories:
bwa/mergedLibrary/
Merged library-level, coordinate sorted*.bam
files after the marking of duplicates, and filtering based on various criteria. The file suffix for the final filtered files will be*.mLb.clN.*
. If you specify the--save_align_intermeds
parameter then two additional sets of files will be present. These represent the unfiltered alignments with duplicates marked (*.mLb.mkD.*
), and in the case of paired-end datasets the filtered alignments before the removal of orphan read pairs (*.mLb.flT.*
).bwa/mergedLibrary/samtools_stats/
SAMtools*.flagstat
,*.idxstats
and*.stats
files generated from the alignment files.bwa/mergedLibrary/picard_metrics/
Alignment QC files from picard CollectMultipleMetrics and the metrics file from MarkDuplicates:*_metrics
and*.metrics.txt
, respectively. These are generated both before (*.mLb.mkD.*
) and after (*.mLb.clN.*
) read filtering.bwa/mergedLibrary/picard_metrics/pdf/
Alignment QC plot files in*.pdf
format from picard CollectMultipleMetrics.bwa/mergedLibrary/preseq/
Preseq expected future yield file (*.ccurve.txt
).
-
Normalised bigWig files
Documentation:
BEDTools, bedGraphToBigWigDescription:
The bigWig format is in an indexed binary format useful for displaying dense, continuous data in Genome Browsers such as the UCSC and IGV. This mitigates the need to load the much larger BAM files for data visualisation purposes which will be slower and result in memory issues. The coverage values represented in the bigWig file can also be normalised in order to be able to compare the coverage across multiple samples - this is not possible with BAM files. The bigWig format is also supported by various bioinformatics software for downstream processing such as meta-profile plotting.Output directories:
bwa/mergedLibrary/bigwig/
Normalised*.bigWig
files scaled to 1 million mapped reads.bwa/mergedLibrary/bigwig/scale/
Per-sample scale factors used to normalise*.bigWig
files.
-
Coverage QC
Documentation:
deepToolsDescription:
The output of deepTools plotFingerprint is a useful QC in order to see the relative enrichment of the samples in the experiment on a genome-wide basis (see plotFingerprint docs).The results from deepTools plotProfile gives you a quick visualisation for the genome-wide enrichment of your samples at the TSS, and across the gene body. During the downstream analysis, you may want to refine the features/genes used to generate these plots in order to see a more specific condition-related effect.
Output directories:
bwa/mergedLibrary/deepTools/plotFingerprint/
- Output files:
*.plotFingerprint.pdf
,*.plotFingerprint.qcmetrics.txt
,*.plotFingerprint.raw.txt
- Output files:
bwa/mergedLibrary/deepTools/plotProfile/
- Output files:
*.computeMatrix.mat.gz
,*.computeMatrix.vals.mat.gz
,*.plotProfile.pdf
,*.plotProfile.tab
.
- Output files:
-
Nucleosome profiling
Documentation:
DANPOS2Description:
The DANPOS2dpos
tool is used by the pipeline in order to analyse the fuzziness and occupancy at each nucleosome. The tool generates a smoothedbigWig
file normalised to 1 million reads that can also be used downstream by tools such asdeepTools
to plot average occupancy profiles across regions of interest.Output directories:
bwa/mergedLibrary/danpos/
- Normalised and smoothed
*.bigWig
files scaled to 1 million reads. .xls
file containing nucleosome position information.- BED files containing nucleosome positions (
*.bed
) and summit information (*.summit.bed
).
- Normalised and smoothed
The alignments associated with all of the replicates from the same experimental condition can also be merged. This can be useful to increase the coverage for nucleosome positioning analysis. The analysis steps and directory structure for bwa/mergedLibrary/
and bwa/mergedReplicate/
are almost identical.
File names in the resulting directory (i.e. bwa/mergedReplicate/
) will have the '.mRp.
' suffix to denote merged Replicates.
You can skip this portion of the analysis by specifying the --skip_merge_replicates
parameter.
-
Present QC for the raw read, alignment and genome-wide coverage
Documentation:
MultiQCDescription:
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available within the report data directory.Results generated by MultiQC collate pipeline QC from FastQC, TrimGalore, samtools flagstat, samtools idxstats, samtools stats, picard CollectMultipleMetrics, picard MarkDuplicates, Preseq, deepTools plotProfile and deepTools plotFingerprint.
The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Output directories:
multiqc/
multiqc_report.html
- a standalone HTML file that can be viewed in your web browser.multiqc_data/
- directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
- directory containing static images from the report in various formats.
-
Create IGV session file
Documentation:
IGVDescription:
An IGV session file will be created at the end of the pipeline containing the normalised bigWig tracks (both smoothed and unsmoothed), nucleosome positions and associated summits. This avoids having to load all of the data individually into IGV for visualisation.The genome fasta file required for the IGV session will be the same as the one that was provided to the pipeline. This will be copied into
reference_genome/
to overcome any loading issues. If you prefer to use another path or an in-built genome provided by IGV just change thegenome
entry in the second-line of the session file.The file paths in the IGV session file will only work if the results are kept in the same place on your storage. If the results are moved or for example, if you prefer to load the data over the web then just replace the file paths with others that are more appropriate.
Once installed, open IGV, go to
File > Open Session
and select theigv_session.xml
file for loading.NB: If you are not using an in-built genome provided by IGV you will need to load the annotation yourself e.g. in .gtf and/or .bed format.
Output directories:
igv/
igv_session.xml
file.igv_files.txt
file containing a listing of the files used to create the IGV session, and their allocated colours.
-
Reference genome files
Documentation:
BWA, BEDTools, SAMtoolsDescription: Reference genome-specific files can be useful to keep for the downstream processing of the results.
Output directories:
reference_genome/
A number of genome-specific files are generated by the pipeline in order to aid in the filtering of the data, and because they are required by standard tools such as BEDTools. These can be found in this directory along with the genome fasta file which is required by IGV. If using a genome from AWS iGenomes and if it exists aREADME.txt
file containing information about the annotation version will also be saved in this directory.reference_genome/BWAIndex/
If the--save_reference
parameter is provided then the alignment indices generated by the pipeline will be saved in this directory. This can be quite a time-consuming process so it permits their reuse for future runs of the pipeline or for other purposes.
-
Pipeline information
Documentation:
Nextflow!Description:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to trouble-shoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.Output directories:
pipeline_info/
- Reports generated by the pipeline -
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.csv
. - Reports generated by Nextflow -
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.svg
. - Reformatted design files used as input to the pipeline -
design_reads.csv
.
- Reports generated by the pipeline -
Documentation/
Documentation for interpretation of results in HTML format -results_description.html
.