Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve input.md #183

Merged
merged 13 commits into from
Apr 17, 2020
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Piellorieppe is one of the main massif in the Sarek National Park.
- [#152](https://github.com/nf-core/sarek/pull/152), [#158](https://github.com/nf-core/sarek/pull/158), [#164](https://github.com/nf-core/sarek/pull/164), [#174](https://github.com/nf-core/sarek/pull/174) - Update docs
- [#164](https://github.com/nf-core/sarek/pull/164) - Update `gatk4-spark` from `4.1.4.1` to `4.1.6.0`
- [#180](https://github.com/nf-core/sarek/pull/180) - Improve minimal setting
- [#183](https://github.com/nf-core/sarek/pull/183) - Update input.md documentation

### Fixed - [2.6dev]

Expand Down
22 changes: 13 additions & 9 deletions docs/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,30 @@ Input files for Sarek can be specified using a TSV file given to the `--input` c
The TSV file is a Tab Separated Value file with columns:

- `subject sex status sample lane fastq1 fastq2` for step `mapping` with paired-end FASTQs
- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs
- `subject sex status sample bam bai recaltable` for step `recalibrate` with BAMs
- `subject sex status sample lane bam` for step `mapping` with unmapped BAMs (uBAMs)
- `subject sex status sample bam bai recaltable` for step `recalibrate` with mapped BAMs
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
- `subject sex status sample bam bai` for step `variantcalling` with BAMs

The content of these columns is quite straight-forward:

- `subject` designate the subject, it should be the ID of the Patient, and it must design only one patient
- `subject` designates the subject, it should be the ID of the Patient, and it must be unique for each patient
maxulysse marked this conversation as resolved.
Show resolved Hide resolved
- `sex` are the sex chromosomes of the Patient, (XX or XY)
- `status` is the status of the Patient, (0 for Normal or 1 for Tumor)
- `sample` designate the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must design only one sample
- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample
- `status` is the status of the measured sample, (0 for Normal or 1 for Tumor)
- `sample` designates the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must be unique for each sample
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one letter
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
- `fastq1` is the path to the first pair of the fastq file
- `fastq2` is the path to the second pair of the fastq file
- `bam` is the bam file
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
- `bai` is the bam index file
ggabernet marked this conversation as resolved.
Show resolved Hide resolved
- `recaltable` is the recalibration table
ggabernet marked this conversation as resolved.
Show resolved Hide resolved

It is recommended to add the absolute path of the files, but relative path should work also.
Note, the delimiter is the tab (`\t`) character:
Note, the delimiter is the tab (`\t`) character.

All examples are given for a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair.
If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair, producing the germline variant calling results only.

Sarek will output results is a different directory for each sample.
Sarek will output results in a different directory for each sample.
If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples.
Multiple TSV files can be specified if the path is enclosed in quotes.

Expand Down Expand Up @@ -117,6 +117,8 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
```

When starting Sarek from the mapping step, a TSV file is generated automatically after the MarkDuplicates process. This TSV file is stored under `results/Preprocessing/TSV/duplicateMarked.tsv` and can be used to restart Sarek from the non-recalibrated BAM files, giving it as `--input` and setting the step `--step recalibrate`.
ggabernet marked this conversation as resolved.
Show resolved Hide resolved

## Example TSV file for a normal/tumor pair with recalibrated BAM files (step variantcalling)

The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
Expand All @@ -126,6 +128,8 @@ G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToF
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai
```

When starting Sarek from the mapping or recalibrate steps, a TSV file is generated automatically after the recalibration processes. This TSV file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and can be used to restart Sarek from the recalibrated BAM files, giving it as `--input` and setting the step `--step variantcalling`.
ggabernet marked this conversation as resolved.
Show resolved Hide resolved

## VCF files for annotation

Input files for Sarek can be specified using the path to a VCF directory given to the `--input` command only with the `annotate` step.
Expand Down