See below for an example config file with explanation of each option and description of common use-cases.
# config/config.yaml
# A list of datasets
# Every dataset is defined by a name, a path to a sample sheet, trimmomatic, star and bamCoverage options.
# All these options are mandatory.
sequencing_libraries:
- name: GSE13073
sample_sheet: sample-sheet.csv
trimmomatic: >-
"ILLUMINACLIP:TruSeq3-PE.fa:1:0:15:2
SLIDINGWINDOW:20:22
MAXINFO:20:0.6
LEADING:22
TRAILING:20
MINLEN:75"
star: >-
"--seedSearchStartLmax 30
--outFilterMismatchNoverReadLmax 0.04
--winAnchorMultimapNmax 40"
bamCoverage: "--binSize 50 --normalizeUsing None"
# - name: ...
# sample_sheet: ...
# trimmomatic: ...
# star: ...
# bamCoverage: ...
# Disable all functionalities related to TE analysis
disable_TE_analysis: false
# Disable tRNA analysis
disable_tRNA_analysis: false
globals:
# path to reads folder
# NB: ./GSE13073 is expected to exist
reads_folder: .
# path to results folder
results_folder: results/
# path to qc
qc_folder: results/qc
# path to log
log_folder: results/log
# path to references
references_folder: results/references
# temp folder
tmp_folder: /tmp
# path to analysis
analysis_folder: results/analysis
# genome informations
genome:
# genome label
label: mm10
# annotation type
# can be ensembl, mgi, gencode
annotation_type: ensembl
# URL or path to genome sequence
fasta_url: <Genome fasta URL>
# URL or path to genome annotation file
gtf_url: <Genome annotation URL>
# URL to gtRNAdb zip file
gtrnadb_url: <GtRNADb bundle URL>
# Differential expression analysis parameters
deseq2:
# wd
working_directory: ../../..
# DESeq2 test name, can be Wald or LRT
test: Wald
# name of the column from sample sheet with experimental variable
variable: genotype
# base level from variable column
reference_level: wt
The pipeline resolves reads paths starting from two bits of information:
reads_folder
in theglobals
sections- The name of a library in
sequencing_libraries
list of objects
In the example configuration above, reads_folder: .
and sequencing_libraries[0].name: GSE13073
. These resolve to ./GSE13073
. It is crucial that this folder exists before starting the pipeline. This is because in this folder, the pipeline will look for input files.
Another example:
sequencing_libraries:
- name: first-batch
sample_sheet: sample-sheet-first-batch.csv
# [...]
- name: second-batch
sample_sheet: sample-sheet-second-batch.csv
# [...]
globals:
reads_folder: reads
# [...]
In this scenario, 3t-seq will look for the reads
folder and inside of it will look for two folders names first-batch
and second-batch
: reads/first-batch
and reads/second-batch
. If any of the two is not detected, the pipeline will fail.
Reads files need to have one of the following extensions: fq
, fq.gz
, fastq
, fastq.gz
. For a given sequencing library, the pipeline expects files to have the same extension.
For paired-end reads, the two mates should have one of the following idenfiers before the extension: (_1, _2)
, (_R1, _R2)
, (_1_sequence, _2_sequence)
.
The references_folder
can be outside of results_folder
. For instance:
globals:
# [...]
# path to results folder
results_folder: results/
# path to references
references_folder: /path/to/references
This allows users to host their own reference files locally and set genome
informations accordingly
genome:
# [...]
# This will evaluate to /path/to/references/custom-mm10.fa.gz
fasta_url: custom-mm10.fa.gz
# This will evaluate to /path/to/references/custom-mm10.gtf.gz
gtf_url: custom-mm10.gtf.gz