Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve input.md #183

Merged
merged 13 commits into from
Apr 17, 2020
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Piellorieppe is one of the main massif in the Sarek National Park.
- [#152](https://github.com/nf-core/sarek/pull/152), [#158](https://github.com/nf-core/sarek/pull/158), [#164](https://github.com/nf-core/sarek/pull/164), [#174](https://github.com/nf-core/sarek/pull/174) - Update docs
- [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0`
- [#180](https://github.com/nf-core/sarek/pull/180) - Improve minimal setting
- [#183](https://github.com/nf-core/sarek/pull/183) - Update input.md documentation

### Fixed - [2.6dev]

Expand Down
84 changes: 56 additions & 28 deletions docs/input.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,47 @@
# Input Documentation

## Information about the TSV files
## General information about the TSV files

Input files for Sarek can be specified using a TSV file given to the `--input` command.
The TSV file is a Tab Separated Value file with columns:
There are different kinds of TSV files that can be used as input, depending on the input files available (fastq, uBAM, BAM...).
For all possible TSV files, described in the next sections, here is an explanation of what the columns refer to:

- `subject sex status sample lane fastq1 fastq2` for step `mapping` with paired-end FASTQs
- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs
- `subject sex status sample bam bai recaltable` for step `recalibrate` with BAMs
- `subject sex status sample bam bai` for step `variantcalling` with BAMs

The content of these columns is quite straight-forward:

- `subject` designate the subject, it should be the ID of the Patient, and it must design only one patient
- `subject` designates the subject, it should be the ID of the patient, and it must be unique for each patient, but one patient can have multiple samples (e.g. normal and tumor)
- `sex` are the sex chromosomes of the Patient, (XX or XY)
- `status` is the status of the Patient, (0 for Normal or 1 for Tumor)
- `sample` designate the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must design only one sample
- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample
- `status` is the status of the measured sample, (0 for Normal or 1 for Tumor)
- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged)
- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character
- `fastq1` is the path to the first pair of the fastq file
- `fastq2` is the path to the second pair of the fastq file
- `bam` is the bam file
- `bai` is the bam index file
- `recaltable` is the recalibration table
- `bam` is the path to the bam file
- `bai` is the path to the bam index file
- `recaltable` is the path to the recalibration table

It is recommended to add the absolute path of the files, but relative path should work also.
Note, the delimiter is the tab (`\t`) character:
Note, the delimiter is the tab (`\t`) character.

All examples are given for a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline variant calling results only.

Sarek will output results is a different directory for each sample.
Sarek will output results in a different directory for each sample.
If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples.
Multiple TSV files can be specified if the path is enclosed in quotes.

Somatic variant calling output will be in a specific directory for each normal/tumor pair.

## Example TSV file for a normal/tumor pair with FASTQ files (step mapping)
## Starting from the mapping step

When starting from the mapping step (`--step mapping`), the first step of Sarek, the input can have three different forms:

- A TSV file containing the sample metadata and the path to the fastq files.
- The Path to a directory containing the fastq files
- A TSV file containing the sample metadata and the path to the unmapped BAM (uBAM) files.

### Providing a TSV file with the path to FASTQ files

The TSV file to start with the step mapping with paired-end FASTQs should contain the columns:

`subject sex status sample lane fastq1 fastq2`

In this sample for the normal case there are 3 read groups, and 2 for the tumor.

Expand All @@ -47,17 +53,17 @@ G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq.
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz
```

## Path to a FASTQ directory for a single normal sample (step mapping)
### Providing the path to a FASTQ directory

Input files for Sarek can be specified using the path to a FASTQ directory given to the `--input` command only with the `mapping` step.

```bash
nextflow run nf-core/sarek --input pathToDirectory ...
```

### Input FASTQ file name best practices
#### Input FASTQ file name best practices

The input folder, containing the FASTQ files for one individual (ID) should be organized into one sub-folder for every sample.
The input folder, containing the FASTQ files for one subject (ID) should be organized into one sub-folder for every sample.
All fastq files for that sample should be collected here.

```text
Expand Down Expand Up @@ -96,7 +102,11 @@ Read group information will be parsed from fastq file names according to this:
- `PU` = sample
- `RGLB` = lib

## Example TSV file for a normal/tumor pair with uBAM files (step mapping)
### Providing a TSV file with the paths to uBAM files

The TSV (Tab Separated Values) file for starting the mapping from uBAM files should contain the columns:

- `subject sex status sample lane bam`

In this sample for the normal case there are 3 read groups, and 2 for the tumor.

Expand All @@ -108,7 +118,12 @@ G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
```

## Example TSV file for a normal/tumor pair with non recalibrated BAM files (step recalibrate)
## Starting from the BAM recalibration step

To start from the recalibration step (`--step recalibrate`), a TSV file for a normal/tumor pair needs to be given as input containing the paths to the non recalibrated but already mapped BAM files.
The TSV needs to contain the following columns:

- `subject sex status sample bam bai recaltable`

The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like:

Expand All @@ -117,18 +132,31 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
```

## Example TSV file for a normal/tumor pair with recalibrated BAM files (step variantcalling)
When starting Sarek from the mapping step, a TSV file is generated automatically after the `MarkDuplicates` process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files. Setting the step `--step recalibrate` will automatically take this file as input.

Additionally, individual TSV files for each sample (`duplicateMarked_[SAMPLE].tsv`) can be found in the same directory.

## Starting from the variant calling step

The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
A TSV file for a normal/tumor pair with recalibrated BAM files and their indexes can be provided to start Sarek from the variant calling step (`--step variantcalling`).
The TSV file should contain the columns:

- `subject sex status sample bam bai`

Here is an example for two samples from the same subject:

```text
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToFiles/G15511.C09DFN.md.recal.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai
```

When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files. Setting the step `--step variantcalling` will automatically take this file as input.

Additionally, individual TSV files for each sample (`recalibrated_[SAMPLE].tsv`) can be found in the same directory.

## VCF files for annotation

Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the `annotate` step.
Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the annotation step (`--step annotate`).
As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted.
Multiple VCF files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes.
For example:
Expand Down