From ccf1fa29ff6510fba6cbc6617496426ededeb677 Mon Sep 17 00:00:00 2001 From: William Rowell Date: Thu, 14 Nov 2024 14:55:56 -0800 Subject: [PATCH] Removed wiki submodule and moved docs to ./docs. Updated script to prep docs. --- README.md | 22 +- docs/backend-aws-healthomics.md | 1 + docs/backend-azure.md | 27 ++ docs/backend-gcp.md | 31 +++ docs/backend-hpc.md | 52 ++++ docs/backends.md | 3 + docs/bam_stats.md | 14 + docs/deepvariant.md | 15 + docs/family.md | 259 ++++++++++++++++++ docs/gpu.md | 17 ++ docs/pharmcat.md | 10 + docs/ref_map.md | 39 +++ docs/singleton.md | 220 +++++++++++++++ docs/tertiary.md | 66 +++++ docs/tertiary_map.md | 20 ++ docs/tools_containers.md | 29 ++ ...dme_and_adjust_links.sh => update_docs.sh} | 4 +- wiki | 1 - 18 files changed, 817 insertions(+), 13 deletions(-) create mode 100644 docs/backend-aws-healthomics.md create mode 100644 docs/backend-azure.md create mode 100644 docs/backend-gcp.md create mode 100644 docs/backend-hpc.md create mode 100644 docs/backends.md create mode 100644 docs/bam_stats.md create mode 100644 docs/deepvariant.md create mode 100644 docs/family.md create mode 100644 docs/gpu.md create mode 100644 docs/pharmcat.md create mode 100644 docs/ref_map.md create mode 100644 docs/singleton.md create mode 100644 docs/tertiary.md create mode 100644 docs/tertiary_map.md create mode 100644 docs/tools_containers.md rename scripts/{create_readme_and_adjust_links.sh => update_docs.sh} (50%) delete mode 160000 wiki diff --git a/README.md b/README.md index a9c395c7..75304050 100644 --- a/README.md +++ b/README.md @@ -24,18 +24,18 @@ Both workflows are designed to analyze human PacBio whole genome sequencing (WGS This is an actively developed workflow with multiple versioned releases, and we make use of git submodules for common tasks that are shared by multiple workflows. There are two ways to ensure you are using a supported release of the workflow and ensure that the submodules are correctly initialized: -1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.2): +1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.3): ```bash - wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-singleton.zip - wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-family.zip + wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-singleton.zip + wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-family.zip ``` 2) Clone the repository and initialize the submodules: ```bash git clone \ - --depth 1 --branch v2.0.2 \ + --depth 1 --branch v2.0.3 \ --recursive \ https://github.com/PacificBiosciences/HiFi-human-WGS-WDL.git ``` @@ -63,10 +63,10 @@ The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will For backend-specific configuration, see the relevant documentation: -- [Azure](./wiki/backend-azure) -- [AWS](./wiki/backend-aws-healthomics) -- [GCP](./wiki/backend-gcp) -- [HPC](./wiki/backend-hpc) +- [Azure](./docs/backend-azure) +- [AWS](./docs/backend-aws-healthomics) +- [GCP](./docs/backend-gcp) +- [HPC](./docs/backend-hpc) ### Configuring a workflow engine and container runtime @@ -76,7 +76,7 @@ Because workflow dependencies are containerized, a container runtime is required See the backend-specific documentation for details on setting up an engine. -| Engine | [Azure](./wiki/backend-azure) | [AWS](./wiki/backend-aws-healthomics) | [GCP](./wiki/backend-gcp) | [HPC](./wiki/backend-hpc) | +| Engine | [Azure](./docs/backend-azure) | [AWS](./docs/backend-aws-healthomics) | [GCP](./docs/backend-gcp) | [HPC](./docs/backend-hpc) | | :- | :- | :- | :- | :- | | [**miniwdl**](https://github.com/chanzuckerberg/miniwdl#scaling-up) | _Unsupported_ | Supported via [AWS HealthOmics](https://aws.amazon.com/healthomics/) | _Unsupported_ | (SLURM only) Supported via the [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm) plugin | | [**Cromwell**](https://cromwell.readthedocs.io/en/stable/backends/Backends/) | Supported via [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) | _Unsupported_ | Supported via Google's [Pipelines API](https://cromwell.readthedocs.io/en/stable/backends/Google/) | Supported - [Configuration varies depending on HPC infrastructure](https://cromwell.readthedocs.io/en/stable/tutorials/HPCIntro/) | @@ -118,7 +118,7 @@ If Cromwell is running in server mode, the workflow can be submitted using cURL. This section describes the inputs required for a run of the workflow. Typically, only the sample-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided [for each backend](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends). -Workflow inputs for each entrypoint are described in [singleton](./wiki/singleton) and [family](./wiki/family) documentation. +Workflow inputs for each entrypoint are described in [singleton](./docs/singleton) and [family](./docs/family) documentation. At a high level, we have two types of inputs files: @@ -136,7 +136,7 @@ Docker images definitions used by this workflow can be found in [the wdl-dockerf The Docker image used by a particular step of the workflow can be identified by looking at the `docker` key in the `runtime` block for the given task. Images can be referenced in the following table by looking for the name after the final `/` character and before the `@sha256:...`. For example, the image referred to here is "align_hifiasm": > ~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87 -Tool versions and Docker images used in these workflows can be found in the [tools and containers](./wiki/tools_containers) documentation. +Tool versions and Docker images used in these workflows can be found in the [tools and containers](./docs/tools_containers) documentation. --- diff --git a/docs/backend-aws-healthomics.md b/docs/backend-aws-healthomics.md new file mode 100644 index 00000000..70644910 --- /dev/null +++ b/docs/backend-aws-healthomics.md @@ -0,0 +1 @@ +# TBD diff --git a/docs/backend-azure.md b/docs/backend-azure.md new file mode 100644 index 00000000..14f08bb7 --- /dev/null +++ b/docs/backend-azure.md @@ -0,0 +1,27 @@ +# Configuring Cromwell on Azure + +Workflows can be run in Azure by setting up [Cromwell on Azure (CoA)](https://github.com/microsoft/CromwellOnAzure). Documentation on deploying and configuring an instance of CoA can be found [here](https://github.com/microsoft/CromwellOnAzure/wiki/Deploy-your-instance-of-Cromwell-on-Azure). + +## Requirements + +- [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) version 3.2+; version 4.0+ is recommended + +## Configuring and running the workflow + +### Filling out workflow inputs + +Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json). + +See [the inputs section of the main README](./singleton#inputs) for more information on the structure of the inputs.json file. + +### Running via Cromwell on Azure + +Instructions for running a workflow from Cromwell on Azure are described in [the Cromwell on Azure documentation](https://github.com/microsoft/CromwellOnAzure/wiki/Running-Workflows). + +## Reference data hosted in Azure + +To use Azure reference data, add the following line to your `containers-to-mount` file in your Cromwell on Azure installation ([more info here](https://github.com/microsoft/CromwellOnAzure/blob/develop/docs/troubleshooting-guide.md#use-input-data-files-from-an-existing-azure-storage-account-that-my-lab-or-team-is-currently-using)): + +`https://datasetpbrarediseases.blob.core.windows.net/dataset?si=public&spr=https&sv=2021-06-08&sr=c&sig=o6OkcqWWlGcGOOr8I8gCA%2BJwlpA%2FYsRz0DMB8CCtCJk%3D` + +The [Azure input file template](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json) has paths to the reference files in this blob storage prefilled. diff --git a/docs/backend-gcp.md b/docs/backend-gcp.md new file mode 100644 index 00000000..543d84de --- /dev/null +++ b/docs/backend-gcp.md @@ -0,0 +1,31 @@ +# Configuring Cromwell on GCP + +[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used to set up the resources needed to run the workflow. + +## Configuring and running the workflow + +### Filling out workflow inputs + +Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/gcp/singleton.gcp.inputs.json). + +See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file. + +#### Determining available zones + +To determine available zones in GCP, run the following; available zones within a region can be found in the first column of the output: + +```bash +gcloud compute zones list | grep +``` + +For example, the zones in region `us-central1` are `"us-central1-a us-central1-b us-central1c us-central1f"`. + +## Running the workflow via Google's genomics Pipelines API + +[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used as an example for how to run the workflow. + +## Reference data hosted in GCP + +GCP reference data is hosted in the `us-west1` region in the bucket `gs://pacbio-wdl`. This bucket is requester-pays, meaning that users will need to [provide a billing project in their Cromwell configuration](https://cromwell.readthedocs.io/en/stable/filesystems/GoogleCloudStorage/) in order to use files located in this bucket. + +To avoid egress charges, Cromwell should be set up to spin up compute resources in the same region in which the data is located. If possible, add cohort data to the same region as the reference dataset, or consider mirroring this dataset in the region where your data is located. See [Google's information about data storage and egress charges for more information](https://cloud.google.com/storage/pricing). diff --git a/docs/backend-hpc.md b/docs/backend-hpc.md new file mode 100644 index 00000000..3b950796 --- /dev/null +++ b/docs/backend-hpc.md @@ -0,0 +1,52 @@ +# Installing and configuring for HPC backends + +Either `miniwdl` or `Cromwell` can be used to run workflows on the HPC. + +## Installing and configuring `miniwdl` + +### Requirements + +- [`miniwdl`](https://github.com/chanzuckerberg/miniwdl) >= 1.9.0 +- [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm) + +### Configuration + +An [example miniwdl.cfg file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/miniwdl.cfg) is provided here. This should be placed at `~/.config/miniwdl.cfg` and edited to match your slurm configuration. This allows running workflows using a basic SLURM setup. + +## Installing and configuring `Cromwell` + +Cromwell supports a number of different HPC backends; see [Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/backends/HPC/) for more information on configuring each of the backends. Cromwell can be used in a standalone "run" mode, or in "server" mode to allow for multiple users to submit workflows. In the example below, we provide example commands for running Cromwell in "run" mode. + +## Running the workflow + +### Filling out workflow inputs + +Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/singleton.hpc.inputs.json). Once you have downloaded the reference data bundle, ensure that you have replaced the `` in the input template file with the local path to the reference datasets on your HPC. + +See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file. + +#### Running via miniwdl + +```bash +miniwdl run workflows/singleton.wdl --input +``` + +#### Running via Cromwell + +```bash +cromwell run workflows/singleton.wdl --input +``` + +## Reference data bundle + +[10.5281/zenodo.14027047](https://zenodo.org/records/14027047) + +Reference data is hosted on Zenodo at [10.5281/zenodo.14027047](https://zenodo.org/record/14027047). Download the reference data bundle and extract it to a location on your HPC, then update the input template file with the path to the reference data. + +```bash +## download the reference data bundle +wget https://zenodo.org/record/14027047/files/hifi-wdl-resources-v2.0.0.tar + +## extract the reference data bundle and rename as dataset +tar -xvf hifi-wdl-resources-v2.0.0.tar +``` diff --git a/docs/backends.md b/docs/backends.md new file mode 100644 index 00000000..54c2ef2f --- /dev/null +++ b/docs/backends.md @@ -0,0 +1,3 @@ +- [hpc](./backend-hpc) +- [azure](./backend-azure) +- [gcp](./backend-gcp) \ No newline at end of file diff --git a/docs/bam_stats.md b/docs/bam_stats.md new file mode 100644 index 00000000..acbaba30 --- /dev/null +++ b/docs/bam_stats.md @@ -0,0 +1,14 @@ +# bam_stats outputs + +```wdl +{sample}.{movie}.read_length_and_quality.tsv.gz - per read length and quality metrics +``` + +## `{sample}.{movie}.read_length_and_quality.tsv.gz` - per read length and quality metrics + +Base metrics are extracted for each read from the uBAM and stored in these 4 columns: + +- movie +- read name +- read length: length of query sequence +- read quality: transformation of `rq` tag into Phred (log) space, e.g., `rq:f:0.99` (99% accuracy, 1 error in 100 bases) is Phred 20 ($-10 \times \log(1 - 0.99)$); this value is capped at Phred 60 for `rq:f:1.0` diff --git a/docs/deepvariant.md b/docs/deepvariant.md new file mode 100644 index 00000000..2b8a9d3c --- /dev/null +++ b/docs/deepvariant.md @@ -0,0 +1,15 @@ +# DeepVariant subworkflow + +```mermaid +flowchart TD + aBAM[/"HiFi aBAM"/] --> make_examples["DeepVariant make_examples"] + make_examples --> gpu{"gpu?"} + gpu -- yes --> call_variants_gpu["DeepVariant call_variants_gpu"] + gpu -- no --> call_variants_cpu["DeepVariant call_variants_cpu"] + call_variants_gpu --> postprocess_variants["DeepVariant postprocess_variants"] + call_variants_cpu --> postprocess_variants + postprocess_variants --> vcf[/"small variant VCF"/] + postprocess_variants --> gvcf[/"small variant gVCF"/] +``` + +This subworkflow runs the three steps of DeepVariant individually in order to make best use of resources. If a GPU is available and `gpu==true`, the `call_variants` step will run on 1 GPU and 8 cpu threads, otherwise it will run on 64 CPU threads. The `make_examples` and `postprocess_variants` steps will always run on the CPU. diff --git a/docs/family.md b/docs/family.md new file mode 100644 index 00000000..7bc5b1d6 --- /dev/null +++ b/docs/family.md @@ -0,0 +1,259 @@ +# family.wdl inputs and outputs + +- [family.wdl inputs and outputs](#familywdl-inputs-and-outputs) + - [DAG (simplified)](#dag-simplified) + - [Inputs](#inputs) + - [Family Struct](#family-struct) + - [Sample Struct](#sample-struct) + - [Outputs](#outputs) + - [Alignments, Coverage, and QC](#alignments-coverage-and-qc) + - [Small Variants (\<50 bp)](#small-variants-50-bp) + - [Structural Variants (≥50 bp)](#structural-variants-50-bp) + - [Copy Number Variants (≥100 kb)](#copy-number-variants-100-kb) + - [Tandem Repeat Genotyping](#tandem-repeat-genotyping) + - [Variant Phasing](#variant-phasing) + - [Variant Calling in Dark Regions](#variant-calling-in-dark-regions) + - [5mCpG Methylation Calling](#5mcpg-methylation-calling) + - [PGx Typing](#pgx-typing) + - [Tertiary Analysis](#tertiary-analysis) + +## DAG (simplified) + +```mermaid +--- +title: family.wdl +--- +flowchart TD + subgraph "`**Upstream of Phasing (per-sample)**`" + subgraph "per-movie" + ubam[/"HiFi uBAM"/] --> pbmm2_align["pbmm2 align"] + pbmm2_align --> pbsv_discover["PBSV discover"] + end + pbmm2_align --> merge_read_stats["merge read statistics"] + pbmm2_align --> samtools_merge["samtools merge"] + samtools_merge --> mosdepth["mosdepth"] + samtools_merge --> paraphase["Paraphase"] + samtools_merge --> hificnv["HiFiCNV"] + samtools_merge --> trgt["TRGT"] + samtools_merge --> trgt_dropouts["TR coverage dropouts"] + samtools_merge --> deepvariant["DeepVariant"] + end + subgraph "`**Joint Calling**`" + deepvariant --> glnexus["GLnexus (joint-call small variants)"] + pbsv_discover --> pbsv_call["PBSV call"] + glnexus --> split_glnexus["split small variant vcf by sample"] + pbsv_call --> split_pbsv["split SV vcf by sample"] + end + subgraph "`**Phasing and Downstream (per-sample)**`" + split_glnexus --> hiphase["HiPhase"] + trgt --> hiphase + split_pbsv --> hiphase + hiphase --> bcftools_roh["bcftools roh"] + hiphase --> bcftools_stats["bcftools stats\n(small variants)"] + hiphase --> sv_stats["SV stats"] + hiphase --> cpg_pileup["5mCpG pileup"] + hiphase --> starphase["StarPhase"] + hiphase --> pharmcat["PharmCat"] + starphase --> pharmcat + end + subgraph " " + hiphase --> merge_small_variants["bcftools merge small variants"] + hiphase --> merge_svs["bcftools merge SV"] + hiphase --> trgt_merge["trgt merge"] + end + subgraph "`**Tertiary Analysis**`" + merge_small_variants --> slivar_small_variants["slivar small variants"] + merge_svs --> svpack["svpack filter and annotate"] + svpack --> slivar_svpack["slivar svpack tsv"] + end +``` + +## Inputs + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| [Family](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/humanwgs_structs.wdl#L15) | family | Family struct describing samples, relationships, and unaligned BAM paths | [below](#family-struct) | +| File | [ref_map_file](./ref_map) | TSV containing reference genome file paths; must match backend | | +| String? | phenotypes | Comma-delimited list of HPO terms. | [Human Phenotype Ontology (HPO) phenotypes](https://hpo.jax.org/app/) associated with the cohort.

If omitted, tertiary analysis will be skipped. | +| File? | [tertiary_map_file](./tertiary_map) | TSV containing tertiary analysis file paths and thresholds; must match backend | `AF`/`AC`/`nhomalt` thresholds can be modified, but this will affect performance.

If omitted, tertiary analysis will be skipped. | +| Int? | glnexus_mem_gb | Override GLnexus memory; optional | | +| Int? | pbsv_call_mem_gb | Override PBSV call memory; optional | | +| Boolean | gpu | Use GPU when possible

Default: `false` | [GPU support](./gpu#gpu-support) | +| String | backend | Backend where the workflow will be executed

`["GCP", "Azure", "AWS-HealthOmics", "HPC"]` | | +| String? | zones | Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'. | [Determining available zones in GCP](./backends/gcp#determining-available-zones) | +| String? | gpuType | GPU type to use; required if gpu is set to `true` for cloud backends; must match backend | [Available GPU types](./gpu#gpu-types) | +| String? | container_registry | Container registry where workflow images are hosted.

Default: `"quay.io/pacbio"` | If omitted, [PacBio's public Quay.io registry](https://quay.io/organization/pacbio) will be used.

Custom container_registry must be set if backend is set to 'AWS-HealthOmics'. | +| Boolean | preemptible | Where possible, run tasks preemptibly

`[true, false]`

Default: `true` | If set to `true`, run tasks preemptibly where possible. If set to `false`, on-demand VMs will be used for every task. Ignored if backend is set to HPC. | + +### Family Struct + +The `Family` struct contains the samples for the family. The struct has the following fields: + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| String | family_id | Unique identifier for the family | Alphanumeric characters, periods, dashes, and underscores are allowed. | +| Array\[[Sample](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/humanwgs_structs.wdl#L3)\] | samples | Sample struct with sample specific data and metadata. | [below](#sample-struct) | + +### Sample Struct + +The `Sample` struct contains sample specific data and metadata. The struct has the following fields: + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| String | sample_id | Unique identifier for the sample | Alphanumeric characters, periods, dashes, and underscores are allowed. | +| String? | sex | Sample sex
`["MALE", "FEMALE", null]` | Used by HiFiCNV and TRGT for genotyping. Allosome karyotype will default to XX unless sex is specified as `"MALE"`. Used for tertiary analysis X-linked inheritance filtering. | +| Boolean | affected | Affected status | If set to `true`, sample is described as being affected by all HPO terms in `phenotypes`.
If set to `false`, sample is described as not being affected by all HPO terms in `phenotypes`. | +| Array\[File\] | hifi_reads | Array of paths to HiFi reads in unaligned BAM format. | | +| String? | father_id | sample_id of father (optional) | | +| String? | mother_id | sample_id of mother (optional) | | + +## Outputs + +### Alignments, Coverage, and QC + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| String | workflow_name | Workflow name | | +| String | workflow_version | Workflow version | | +| Array\[String\] | sample_ids | Sample IDs | | +| File | stats_file | Table of summary statistics | | +| Array\[File\] | bam_stats | BAM stats | Per-read length and read-quality | +| Array\[File\] | read_length_plot | Read length plot | | +| Array\[File\] | read_quality_plot | Read quality plot | | +| Array\[File\] | merged_haplotagged_bam | Merged, haplotagged alignments | Includes unmapped reads | +| Array\[File\] | merged_haplotagged_bam_index | | | +| Array\[File\] | mosdepth_summary | Summary of aligned read depth. | | +| Array\[File\] | mosdepth_region_bed | Median aligned read depth by 500bp windows. | | +| Array\[File\] | mosdepth_region_bed_index | | | +| Array\[File\] | mosdepth_depth_distribution_plot | | | +| Array\[File\] | mapq_distribution_plot | Distribution of mapping quality per alignment | | +| Array\[File\] | mg_distribution_plot | Distribution of gap-compressed identity score per alignment | | +| Array\[String\] | stat_num_reads | Number of reads | | +| Array\[String\] | stat_read_length_mean | Mean read length | | +| Array\[String\] | stat_read_length_median | Median read length | | +| Array\[String\] | stat_read_quality_mean | Mean read quality | | +| Array\[String\] | stat_read_quality_median | Median read quality | | +| Array\[String\] | stat_mapped_read_count | Count of reads mapped to reference | | +| Array\[String\] | stat_mapped_percent | Percent of reads mapped to reference | | +| Array\[String\] | inferred_sex | Inferred sex | Sex is inferred based on relative depth of chrY alignments. | +| Array\[String\] | stat_mean_depth | Mean depth | | + +### Small Variants (<50 bp) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | phased_small_variant_vcf | Phased small variant VCF | | +| Array\[File\] | phased_small_variant_vcf_index | | | +| Array\[File\] | small_variant_gvcf | Small variant GVCF | Can be used for joint-calling. | +| Array\[File\] | small_variant_gvcf_index | | | +| Array\[File\] | small_variant_stats | Small variant stats | Generated by `bcftools stats`. | +| Array\[String\] | stat_small_variant_SNV_count | SNV count | (PASS variants) | +| Array\[String\] | stat_small_variant_INDEL_count | INDEL count | (PASS variants) | +| Array\[String\] | stat_small_variant_TSTV_ratio | Ts/Tv ratio | (PASS variants) | +| Array\[String\] | stat_small_variant_HETHOM_ratio | Het/Hom ratio | (PASS variants) | +| Array\[File\] | snv_distribution_plot | Distribution of SNVs by REF, ALT | | +| Array\[File\] | indel_distribution_plot | Distribution of indels by size | | +| File? | joint_small_variants_vcf | Joint-called small variant VCF | | +| File? | joint_small_variants_vcf_index | | | + +### Structural Variants (≥50 bp) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | phased_sv_vcf | Phased structural variant VCF | | +| Array\[File\] | phased_sv_vcf_index | Index for phased structural variant VCF | | +| Array\[String\] | stat_sv_DUP_count | Structural variant DUP count | (PASS variants) | +| Array\[String\] | stat_sv_DEL_count | Structural variant DEL count | (PASS variants) | +| Array\[String\] | stat_sv_INS_count | Structural variant INS count | (PASS variants) | +| Array\[String\] | stat_sv_INV_count | Structural variant INV count | (PASS variants) | +| Array\[String\] | stat_sv_BND_count | Structural variant BND count | (PASS variants) | +| Array\[File\] | bcftools_roh_out | ROH calling | `bcftools roh` | +| Array\[File\] | bcftools_roh_bed | Generated from above, without filtering | | +| File? | joint_sv_vcf | Joint-called structural variant VCF | | +| File? | joint_sv_vcf_index | | | + +### Copy Number Variants (≥100 kb) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | cnv_vcf | CNV VCF | | +| Array\[File\] | cnv_vcf_index | Index for CNV VCF | | +| Array\[File\] | cnv_copynum_bedgraph | CNV copy number BEDGraph | | +| Array\[File\] | cnv_depth_bw | CNV depth BigWig | | +| Array\[File\] | cnv_maf_bw | CNV MAF BigWig | | +| Array\[String\] | stat_cnv_DUP_count | Count of DUP events | (for PASS variants) | +| Array\[String\] | stat_cnv_DEL_count | Count of DEL events | (PASS variants) | +| Array\[String\] | stat_cnv_DUP_sum | Sum of DUP bp | (PASS variants) | +| Array\[String\] | stat_cnv_DEL_sum | Sum of DEL bp | (PASS variants) | + +### Tandem Repeat Genotyping + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | phased_trgt_vcf | Phased TRGT VCF | | +| Array\[File\] | phased_trgt_vcf_index | | | +| Array\[File\] | trgt_spanning_reads | TRGT spanning reads | | +| Array\[File\] | trgt_spanning_reads_index | | | +| Array\[String\] | stat_trgt_genotyped_count | Count of genotyped sites | | +| Array\[String\] | stat_trgt_uncalled_count | Count of ungenotyped sites | | + +### Variant Phasing + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | phase_stats | Phasing stats | | +| Array\[File\] | phase_blocks | Phase blocks | | +| Array\[File\] | phase_haplotags | Per-read haplotag assignment | | +| Array\[String\] | stat_phased_basepairs | Count of bp within phase blocks | | +| Array\[String\] | stat_phase_block_ng50 | Phase block NG50 | | + +### Variant Calling in Dark Regions + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | paraphase_output_json | Paraphase output JSON | | +| Array\[File\] | paraphase_realigned_bam | Paraphase realigned BAM | | +| Array\[File\] | paraphase_realigned_bam_index | | | +| Array\[File?\] | paraphase_vcfs | Paraphase VCFs | Compressed as `.tar.gz` | + +### 5mCpG Methylation Calling + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | cpg_hap1_bed | CpG hap1 BED | | +| Array\[File\] | cpg_hap1_bed_index | | | +| Array\[File\] | cpg_hap2_bed | CpG hap2 BED | | +| Array\[File\] | cpg_hap2_bed_index | | | +| Array\[File\] | cpg_combined_bed | CpG combined BED | | +| Array\[File\] | cpg_combined_bed_index | | | +| Array\[File\] | cpg_hap1_bw | CpG hap1 BigWig | | +| Array\[File\] | cpg_hap2_bw | CpG hap2 BigWig | | +| Array\[File\] | cpg_combined_bw | CpG combined BigWig | | +| Array\[String\] | stat_cpg_hap1_count | Hap1 CpG count | | +| Array\[String\] | stat_cpg_hap2_count | Hap2 CpG count | | +| Array\[String\] | stat_cpg_combined_count | Combined CpG count | | + +### PGx Typing + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Array\[File\] | pbstarphase_json | PBstarPhase JSON | Haplotype calls for PGx loci | +| Array\[File\] | pharmcat_match_json | PharmCAT match JSON | | +| Array\[File\] | pharmcat_phenotype_json | PharmCAT phenotype JSON | | +| Array\[File\] | pharmcat_report_html | PharmCAT report HTML | | +| Array\[File\] | pharmcat_report_json | PharmCAT report JSON | | + +### Tertiary Analysis + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File? | pedigree | Pedigree file in PLINK PED [format](https://zzz.bwh.harvard.edu/plink/data.shtml#ped) | | +| File? | small_variant_filtered_vcf | Filtered, annotated small variant VCF | | +| File? | small_variant_filtered_vcf_index | | | +| File? | small_variant_filtered_tsv | Filtered, annotated small variant calls | | +| File? | small_variant_compound_het_vcf | Filtered, annotated compound heterozygous small variant VCF | | +| File? | small_variant_compound_het_vcf_index | | | +| File? | small_variant_compound_het_tsv | Filtered, annotated compound heterozygous small variant calls | | +| File? | sv_filtered_vcf | Filtered, annotated structural variant VCF | | +| File? | sv_filtered_vcf_index | | | +| File? | sv_filtered_tsv | Filtered, annotated structural variant TSV | | diff --git a/docs/gpu.md b/docs/gpu.md new file mode 100644 index 00000000..4633e411 --- /dev/null +++ b/docs/gpu.md @@ -0,0 +1,17 @@ +# GPU support + +Starting in workflow version 2.0.0, we have added support for running workflows on GPU-enabled nodes. The first task to take advantage of this is the [`deepvariant_call_variants` task](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/workflows/wdl-common/wdl/workflows/deepvariant/deepvariant.wdl) in the DeepVariant workflow, which can use 1 GPU. To run the DeepVariant workflow on a GPU-enabled node, you will need to provide some additional configuration in your inputs JSON file. + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| Boolean | gpu | Use GPUs. | default = `false` | +| String | gpuType | Type of GPU/Accelerator to use. | This will depend on your backend configuration. | + +## GPU Types + +| Backend | GPU Type | Notes | +| ------- | -------- | ----- | +| AWS-HealthOmics | `["nvidia-tesla-a10g", "nvidia-tesla-t4", "nvidia-tesla-t4-a10g"]` | [GPU availability varies by zone.](https://aws.amazon.com/ec2/instance-types) | +| Azure | | GPU support not yet implemented, but monitoring microsoft/ga4gh-tes#717. | +| GCP | `["nvidia-tesla-t4", "nvidia-tesla-v100"]` | [GPU availability varies by zone.](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones) | +| HPC | | This will depend on HPC and miniwdl or Cromwell configuration. Reach out to [support@pacb.com](mailto:support@pacb.com?subject=WDL%20Workflows%20-%20GPU%20Support) | diff --git a/docs/pharmcat.md b/docs/pharmcat.md new file mode 100644 index 00000000..bcd11775 --- /dev/null +++ b/docs/pharmcat.md @@ -0,0 +1,10 @@ +# PharmCat subworkflow + +```mermaid +flowchart TD + phased_vcf[/"phased small variant VCF"/] --> preprocess["pharmcat preprocess"] + aBAM[/"haplotagged BAM"/] --> filter["filter preprocessed VCF"] + preprocess --> filter + filter --> pharmcat["PharmCat"] + pharmcat --> outputs[/"PharmCat outputs"/] +``` diff --git a/docs/ref_map.md b/docs/ref_map.md new file mode 100644 index 00000000..0d6c5e93 --- /dev/null +++ b/docs/ref_map.md @@ -0,0 +1,39 @@ +# Reference Map File Specification + +| Type | Key | Description | Notes | +| ---- | --- | ----------- | ----- | +| String | name | Short name for reference | Alphanumeric characters, underscores, and dashes only. Will be used in file names. | +| File | fasta | Reference genome FASTA | | +| File | fasta_index | Reference genome FASTA index | | +| File | pbsv_splits | Regions for pbsv parallelization | [below](#pbsv_splits) | +| File | pbsv_tandem_repeat_bed | Tandem Repeat BED used by PBSV to normalize SVs within TRs | [link](https://github.com/PacificBiosciences/pbsv/tree/master/annotations) | +| File | trgt_tandem_repeat_bed | Tandem Repeat catalog (BED) for TRGT genotyping | [link](https://github.com/PacificBiosciences/trgt/blob/main/docs/repeat_files.md) | +| File | hificnv_exclude_bed | Regions to be excluded by HIFICNV in gzipped BED format | [link](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md) | +| File | hificnv_exclude_bed_index | BED index | [link](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md) | +| File | hificnv_expected_bed_male | Expected allosome copy number BED for XY samples | [link](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md) | +| File | hificnv_expected_bed_female | Expected allosome copy number BED for XX samples | [link](https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/aux_data.md) | +| File | pharmcat_positions_vcf | PharmCAT positions VCF | | +| File | pharmcat_positions_vcf_index | PharmCAT positions VCF index | | + +## pbsv_splits + +The `pbsv_splits` file is a JSON array of arrays of strings. Each inner array contains one or more chromosome names such that each inner array is of roughly equal size in base pairs. The inner arrays are processed in parallel. For example: + +```json +[ + ... + [ + "chr10", + "chr11" + ], + [ + "chr12", + "chr13" + ], + [ + "chr14", + "chr15" + ], + ... +] +``` diff --git a/docs/singleton.md b/docs/singleton.md new file mode 100644 index 00000000..9e627e45 --- /dev/null +++ b/docs/singleton.md @@ -0,0 +1,220 @@ +# singleton.wdl inputs and outputs + +- [singleton.wdl inputs and outputs](#singletonwdl-inputs-and-outputs) + - [DAG (simplified)](#dag-simplified) + - [Inputs](#inputs) + - [Outputs](#outputs) + - [Alignments, Coverage, and QC](#alignments-coverage-and-qc) + - [Small Variants (\<50 bp)](#small-variants-50-bp) + - [Structural Variants (≥50 bp)](#structural-variants-50-bp) + - [Copy Number Variants (≥100 kb)](#copy-number-variants-100-kb) + - [Tandem Repeat Genotyping](#tandem-repeat-genotyping) + - [Variant Phasing](#variant-phasing) + - [Variant Calling in Dark Regions](#variant-calling-in-dark-regions) + - [5mCpG Methylation Calling](#5mcpg-methylation-calling) + - [PGx Typing](#pgx-typing) + - [Tertiary Analysis](#tertiary-analysis) + +## DAG (simplified) + +```mermaid +--- +title: singleton.wdl +--- +flowchart TD + subgraph "`**Upstream of Phasing**`" + subgraph "per-movie" + ubam[/"HiFi uBAM"/] --> pbmm2_align["pbmm2 align"] + pbmm2_align --> pbsv_discover["PBSV discover"] + end + pbmm2_align --> merge_read_stats["merge read statistics"] + pbmm2_align --> samtools_merge["samtools merge"] + samtools_merge --> mosdepth["mosdepth"] + samtools_merge --> paraphase["Paraphase"] + samtools_merge --> hificnv["HiFiCNV"] + samtools_merge --> trgt["TRGT"] + samtools_merge --> trgt_dropouts["TR coverage dropouts"] + samtools_merge --> deepvariant["DeepVariant"] + pbsv_discover --> pbsv_call["PBSV call"] + end + subgraph "`**Phasing and Downstream**`" + deepvariant --> hiphase["HiPhase"] + trgt --> hiphase + pbsv_call --> hiphase + hiphase --> bcftools_roh["bcftools roh"] + hiphase --> bcftools_stats["bcftools stats\n(small variants)"] + hiphase --> sv_stats["SV stats"] + hiphase --> cpg_pileup["5mCpG pileup"] + hiphase --> starphase["StarPhase"] + hiphase --> pharmcat["PharmCat"] + starphase --> pharmcat + end + subgraph "`**Tertiary Analysis**`" + hiphase --> slivar_small_variants["slivar small variants"] + hiphase --> svpack["svpack filter and annotate"] + svpack --> slivar_svpack["slivar svpack tsv"] + end +``` + +## Inputs + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| String | sample_id | Unique identifier for the sample | Alphanumeric characters, periods, dashes, and underscores are allowed. | +| String? | sex | Sample sex
`["MALE", "FEMALE"]` | Used by HiFiCNV and TRGT for genotyping. Allosome karyotype will default to XX unless sex is specified as `"MALE"`. | +| Array\[File\] | hifi_reads | Array of paths to HiFi reads in unaligned BAM format. | | +| File | [ref_map_file](./ref_map) | TSV containing reference genome file paths; must match backend | | +| String? | phenotypes | Comma-delimited list of HPO terms. | [Human Phenotype Ontology (HPO) phenotypes](https://hpo.jax.org/app/) associated with the cohort.

If omitted, tertiary analysis will be skipped. | +| File? | [tertiary_map_file](./tertiary_map) | TSV containing tertiary analysis file paths and thresholds; must match backend | `AF`/`AC`/`nhomalt` thresholds can be modified, but this will affect performance.

If omitted, tertiary analysis will be skipped. | +| Boolean | gpu | Use GPU when possible

Default: `false` | [GPU support](./gpu#gpu-support) | +| String | backend | Backend where the workflow will be executed

`["GCP", "Azure", "AWS-AGC", "AWS-HealthOmics", "HPC"]` | | +| String? | zones | Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'. | [Determining available zones in GCP](./backends/gcp#determining-available-zones) | +| String? | gpuType | GPU type to use; required if gpu is set to `true` for cloud backends; must match backend | [Available GPU types](./gpu#gpu-types) | +| String? | container_registry | Container registry where workflow images are hosted.

Default: `"quay.io/pacbio"` | If omitted, [PacBio's public Quay.io registry](https://quay.io/organization/pacbio) will be used.

Custom container_registry must be set if backend is set to 'AWS-HealthOmics'. | +| Boolean | preemptible | Where possible, run tasks preemptibly

`[true, false]`

Default: `true` | If set to `true`, run tasks preemptibly where possible. If set to `false`, on-demand VMs will be used for every task. Ignored if backend is set to HPC. | + +## Outputs + +### Alignments, Coverage, and QC + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| String | workflow_name | Workflow name | | +| String | workflow_version | Workflow version | | +| File | stats_file | Table of summary statistics | | +| File | bam_stats | BAM stats | Per-read length and read-quality | +| File | read_length_plot | Read length plot | | +| File | read_quality_plot | Read quality plot | | +| File | merged_haplotagged_bam | Merged, haplotagged alignments | Includes unmapped reads | +| File | merged_haplotagged_bam_index | | | +| File | mosdepth_summary | Summary of aligned read depth. | | +| File | mosdepth_region_bed | Median aligned read depth by 500bp windows. | | +| File | mosdepth_region_bed_index | | | +| File | mosdepth_depth_distribution_plot | | | +| File | mapq_distribution_plot | Distribution of mapping quality per alignment | | +| File | mg_distribution_plot | Distribution of gap-compressed identity score per alignment | | +| String | stat_num_reads | Number of reads | | +| String | stat_read_length_mean | Mean read length | | +| String | stat_read_length_median | Median read length | | +| String | stat_read_quality_mean | Mean read quality | | +| String | stat_read_quality_median | Median read quality | | +| String | stat_mapped_read_count | Count of reads mapped to reference | | +| String | stat_mapped_percent | Percent of reads mapped to reference | | +| String | inferred_sex | Inferred sex | Sex is inferred based on relative depth of chrY alignments. | +| String | stat_mean_depth | Mean depth | | + +### Small Variants (<50 bp) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | phased_small_variant_vcf | Phased small variant VCF | | +| File | phased_small_variant_vcf_index | | | +| File | small_variant_gvcf | Small variant GVCF | Can be used for joint-calling. | +| File | small_variant_gvcf_index | | | +| File | small_variant_stats | Small variant stats | Generated by `bcftools stats`. | +| String | stat_small_variant_SNV_count | SNV count | (PASS variants) | +| String | stat_small_variant_INDEL_count | INDEL count | (PASS variants) | +| String | stat_small_variant_TSTV_ratio | Ts/Tv ratio | (PASS variants) | +| String | stat_small_variant_HETHOM_ratio | Het/Hom ratio | (PASS variants) | +| File | snv_distribution_plot | Distribution of SNVs by REF, ALT | | +| File | indel_distribution_plot | Distribution of indels by size | | + +### Structural Variants (≥50 bp) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | phased_sv_vcf | Phased structural variant VCF | | +| File | phased_sv_vcf_index | Index for phased structural variant VCF | | +| String | stat_sv_DUP_count | Structural variant DUP count | (PASS variants) | +| String | stat_sv_DEL_count | Structural variant DEL count | (PASS variants) | +| String | stat_sv_INS_count | Structural variant INS count | (PASS variants) | +| String | stat_sv_INV_count | Structural variant INV count | (PASS variants) | +| String | stat_sv_BND_count | Structural variant BND count | (PASS variants) | +| File | bcftools_roh_out | ROH calling | `bcftools roh` | +| File | bcftools_roh_bed | Generated from above, without filtering | | + +### Copy Number Variants (≥100 kb) + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | cnv_vcf | CNV VCF | | +| File | cnv_vcf_index | Index for CNV VCF | | +| File | cnv_copynum_bedgraph | CNV copy number BEDGraph | | +| File | cnv_depth_bw | CNV depth BigWig | | +| File | cnv_maf_bw | CNV MAF BigWig | | +| String | stat_cnv_DUP_count | Count of DUP events | (for PASS variants) | +| String | stat_cnv_DEL_count | Count of DEL events | (PASS variants) | +| String | stat_cnv_DUP_sum | Sum of DUP bp | (PASS variants) | +| String | stat_cnv_DEL_sum | Sum of DEL bp | (PASS variants) | + +### Tandem Repeat Genotyping + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | phased_trgt_vcf | Phased TRGT VCF | | +| File | phased_trgt_vcf_index | | | +| File | trgt_spanning_reads | TRGT spanning reads | | +| File | trgt_spanning_reads_index | | | +| String | stat_trgt_genotyped_count | Count of genotyped sites | | +| String | stat_trgt_uncalled_count | Count of ungenotyped sites | | + +### Variant Phasing + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | phase_stats | Phasing stats | | +| File | phase_blocks | Phase blocks | | +| File | phase_haplotags | Per-read haplotag assignment | | +| String | stat_phased_basepairs | Count of bp within phase blocks | | +| String | stat_phase_block_ng50 | Phase block NG50 | | + +### Variant Calling in Dark Regions + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | paraphase_output_json | Paraphase output JSON | | +| File | paraphase_realigned_bam | Paraphase realigned BAM | | +| File | paraphase_realigned_bam_index | | | +| File? | paraphase_vcfs | Paraphase VCFs | Compressed as `.tar.gz` | + +### 5mCpG Methylation Calling + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | cpg_hap1_bed | CpG hap1 BED | | +| File | cpg_hap1_bed_index | | | +| File | cpg_hap2_bed | CpG hap2 BED | | +| File | cpg_hap2_bed_index | | | +| File | cpg_combined_bed | CpG combined BED | | +| File | cpg_combined_bed_index | | | +| File | cpg_hap1_bw | CpG hap1 BigWig | | +| File | cpg_hap2_bw | CpG hap2 BigWig | | +| File | cpg_combined_bw | CpG combined BigWig | | +| String | stat_cpg_hap1_count | Hap1 CpG count | | +| String | stat_cpg_hap2_count | Hap2 CpG count | | +| String | stat_cpg_combined_count | Combined CpG count | | + +### PGx Typing + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File | pbstarphase_json | PBstarPhase JSON | Haplotype calls for PGx loci | +| File | pharmcat_match_json | PharmCAT match JSON | | +| File | pharmcat_phenotype_json | PharmCAT phenotype JSON | | +| File | pharmcat_report_html | PharmCAT report HTML | | +| File | pharmcat_report_json | PharmCAT report JSON | | + +### Tertiary Analysis + +| Type | Name | Description | Notes | +| ---- | ---- | ----------- | ----- | +| File? | pedigree | Pedigree file in PLINK PED [format](https://zzz.bwh.harvard.edu/plink/data.shtml#ped) | | +| File? | small_variant_filtered_vcf | Filtered, annotated small variant VCF | | +| File? | small_variant_filtered_vcf_index | | | +| File? | small_variant_filtered_tsv | Filtered, annotated small variant calls | | +| File? | small_variant_compound_het_vcf | Filtered, annotated compound heterozygous small variant VCF | | +| File? | small_variant_compound_het_vcf_index | | | +| File? | small_variant_compound_het_tsv | Filtered, annotated compound heterozygous small variant calls | | +| File? | sv_filtered_vcf | Filtered, annotated structural variant VCF | | +| File? | sv_filtered_vcf_index | | | +| File? | sv_filtered_tsv | Filtered, annotated structural variant TSV | | diff --git a/docs/tertiary.md b/docs/tertiary.md new file mode 100644 index 00000000..cd7aac93 --- /dev/null +++ b/docs/tertiary.md @@ -0,0 +1,66 @@ +# tertiary.wdl analysis workflow + +This is a simple, opinionated subworkflow for tertiary analysis in rare disease research. It starts with small variants and structural variants in VCF format, filters to remove variants that are common in the population, annotates with functional impact, and then prioritizes based on the predicted impact on the gene and the gene's relevance to the phenotype. It has been designed for ~30x WGS HiFi for the proband and ~10-30x WGS HiFi for parents (optional). + +## Inputs + +- Small variants and structural variants are provided to this workflow in VCF format. If multiple family members have been sequenced, they are provided as a single joint-called VCF per variant type per family. If only the proband has been sequenced, the VCFs are provided for the proband only. +- We generate a pedigree describing sample relationships and phenotype status, based on the input provided to the entrypoint workflow. In the case of a singleton, the pedigree is a single row. +- Using the comma-delimited list of HPO terms provided to the entrypoint workflow, we generate a Phenotype Rank (Phrank) lookup table, a simple two column lookup table mapping gene symbols to Phrank score. Phrank scores are positive real numbers (or null) such that higher scores indicate a gene is more likely to be relevant to the phenotypes. The Phrank lookup is used to prioritize variants based on the predicted impact on the gene and the gene's relevance to the phenotype. Phrank scores are not normalized, and providing more phenotypes for a sample will result in a higher maximum Phrank score. +- Reference data is provided by the [`ref_map_file`](./ref_map) input. This workflow is currently only compatible with the GRCh38 human reference. +- Population data, other supplemental data, and allele thresholds are provided by the [`tertiary_map_file`](./tertiary_map) input. We provide a version of this file that uses population data from [gnomAD v4.1](https://gnomad.broadinstitute.org/news/2024-05-gnomad-v4-1-updates/) and [CoLoRSdb](https://colorsdb.org) v1.1.0 [10.5281/zenodo.13145123](https://zenodo.org/records/13145123). We provide the ability to tweak the allele thresholds, but the default values are recommended, as increasing these will result in much higher resource usage. + +## Process + +### Small variants + +We use [`slivar`](https://github.com/brentp/slivar) and [`bcftools csq`](https://samtools.github.io/bcftools/howtos/csq-calling.html) to filter and annotate small variants, and to identify potential compound heterozygous ("comphet") candidate pairs. Slivar uses variant annotations stored in "gnotate" databases. We use the following steps (nb: some steps performed within the same command). + +1. Ignore variants with a non-passing `FILTER` value. +2. Ignore variants that are present at > 3% (`slivar_max_af`) in any of the population datasets. +3. Ignore variants with more than 4 homozygous alternate ("homalt") calls (`slivar_max_nhomalt`) in any of the population datasets. For the purposes of this tool, we count hemizygous ("hemialt") calls on the X chromosome as homalt. +4. To be tagged as a potential "dominant" variant, the site must be high quality[^1] in all relevant samples, present as homref in all unaffected samples, present as homalt or hetalt in all affected samples, and have allele count < 4 (`slivar_max_ac`) in the population datasets. +5. To be tagged as a potential "recessive" variant, the site must be high quality[^1] in all relevant samples, present as homalt or hemi in all affected samples, and present as homref or hetalt in all unaffected samples. +6. To be tagged in comphet analysis, the site must be have GQ > 5 (`slivar_min_gq`) and present as hetalt in all affected samples. +7. All remaining "tagged" variants are annotated with predicted impact using Ensembl GFF3 gene set and `bcftools csq`. This annotated VCF is provided for downstream analysis. +8. All variants considered for comphet analysis with high potential impacts[^2] are considered in pairs. If the pair of variants are shown to be _in cis_ according to HiPhase phasing, they are rejected. The passing pairs are stored in a second VCF for downstream analysis. + +We use [`slivar tsv`](https://github.com/brentp/slivar/wiki/tsv:-creating-a-spreadsheet-from-a-filtered-VCF) to produce TSVs from the VCFs generated above. These TSVs have many of the relevant fields from the VCF, as well as: + +- Clinvar annotations for the gene +- gnomAD [loss-of-function tolerance metrics](https://gnomad.broadinstitute.org/downloads#v2-lof-curation-results) +- Phrank scores for the gene + +### Structural variants + +We use [`svpack`](https://github.com/PacificBiosciences/svpack) to filter and annotate SVs, with the following steps. + +1. Remove variants with a non-passing `FILTER` value. +2. Remove variants < 50bp +3. Remove variants that match any existing variant in: gnomAD v4.1 (n=) or CoLoRSdb (n=). In this case, "match" means that the variant is the same type, the difference in position is <= 100bp, and the difference in size is <= 100bp. +4. Annotate `INFO/BCSQ` with predicted impact using Ensembl GFF3 gene set. +5. Annotate `INFO/homalt` and `INFO/hetalt` with the names of samples in this cohort that have the variant in homozygous or heterozygous form, respectively. + +We use [`slivar tsv`](https://github.com/brentp/slivar/wiki/tsv:-creating-a-spreadsheet-from-a-filtered-VCF) to produce a TSV of structural variants that impact genes in affected samples. This TSV has many of the relevant fields from the VCF, as well as: + +- Clinvar annotations for the gene +- gnomAD [loss-of-function tolerance metrics](https://gnomad.broadinstitute.org/downloads#v2-lof-curation-results) +- Phrank scores for the gene + +[^1]: High quality is defined as: + GQ >= 20 (GQ >= 10 for males on chrX) + DP >= 6 + 0.2 <= hetalt AB <= 0.8 + homref AB < 0.02 + homalt AB > 0.98 + +[^2]: For more description of considered impacts, see [`slivar` documentation](https://github.com/brentp/slivar/wiki/compound-heterozygotes). We alter the default "skip" list to: + non_coding_transcript + intron + non_coding + upstream_gene + downstream_gene + non_coding_transcript_exon + NMD_transcript + 5_prime_UTR + 3_prime_UTR diff --git a/docs/tertiary_map.md b/docs/tertiary_map.md new file mode 100644 index 00000000..1dd57bb2 --- /dev/null +++ b/docs/tertiary_map.md @@ -0,0 +1,20 @@ +# Tertiary Map File Specification + +| Type | Key | Description | Notes | +| ---- | --- | ----------- | ----- | +| File | slivar_js | slivar functions | [link](https://raw.githubusercontent.com/brentp/slivar/91a40d582805d6607fa8a76a8fce15fd2e4be3b8/js/slivar-functions.js) | +| File | ensembl_gff | [Ensembl](https://useast.ensembl.org/index.html) GFF3 reference annotation | | +| File | lof_lookup | Path to table of loss-of-function scores per gene | | +| File | clinvar_lookup | Path to table of ClinVar annotations per gene | | +| File | slivar_gnotate_files | Comma-delimited array of population dataset allele frequencies in [`slivar gnotate`](https://github.com/brentp/slivar/wiki/gnotate) format | | +| String | slivar_gnotate_prefixes | Comma-delimieted array of prefixes to `_af`, `_nhomalt`, and `_ac` in `slivar_gnotate_files` | | +| String (Float) [^1] | slivar_max_af | Maximum allele frequency within population for small variants | | +| String (Int) [^2] | slivar_max_nhomalt | Maximum number of homozygous alternate alleles within population for small variants | | +| String (Int) [^2] | slivar_max_ac | Maximum allele count within population for small variants | | +| String (Int) [^2] | slivar_min_gq | Minimum genotype quality for small variants to be considered for compound heterozygous pairs | | +| String | svpack_pop_vcfs | Comma-delimited array of structural variant population VCF paths | | +| String | svpack_pop_vcf_indices | Comma-delimited array of structural variant population VCF index paths | | + +[^1]: Technically this value is interpreted as String by WDL, but slivar expects a Float, e.g, `0.03`. + +[^2]: Technically these values are interpreted as String by WDL, but slivar expects an Int. diff --git a/docs/tools_containers.md b/docs/tools_containers.md new file mode 100644 index 00000000..2859cf05 --- /dev/null +++ b/docs/tools_containers.md @@ -0,0 +1,29 @@ +# Tool versions and Containers + +Containers are used to package tools and their dependencies. This ensures that the tools are reproducible and can be run on any system that supports the container runtime. Our containers are built using [Docker](https://www.docker.com/) and are compatible with any container runtime that supports the OCI Image Specification, like [Singularity](https://sylabs.io/singularity/) or [Podman](https://podman.io/). + +Most of our containers are built on the `pb_wdl_base` container, which includes common bioinformatics tools and libraries. We tag our containers with a version number and build count, but the containers are referenced within the WDL files by the sha256 sum tags for reproducibility and better compatibility with Cromwell and miniwdl call caching. + +Our Dockerfiles can be inspected on GitHub, and the containers can be pulled from our [Quay.io organization](https://quay.io/pacbio). + +We directly use `deepvariant`, `deepvariant-gpu`, `pharmcat`, and `glnexus` containers from their respective authors, although we have mirrored some for better compatibility with Cromwell call caching. + +| Container | Major tool versions | Dockerfile | Container | +| --------: | ------------------- | :---: | :---: | +| pb_wdl_base |
  • htslib 1.20
  • bcftools 1.20
  • samtools 1.20
  • bedtools 2.31.0
  • python3.9
  • numpy 1.24.24
  • pandas 2.0.3
  • matplotlib 3.7.5
  • seaborn 0.13.2
  • pysam 0.22.1
  • vcfpy 0.13.8
  • biopython 1.83
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/6b13cc246dd44e41903d17a660bb5432cdd18dbe/docker/pb_wdl_base) | [sha256:4b889a1f21a6a7fecf18820613cf610103966a93218de772caba126ab70a8e87](https://quay.io/pacbio/pb_wdl_base/manifest/pb_wdl_base@sha256:4b889a1f21a6a7fecf18820613cf610103966a93218de772caba126ab70a8e87) | +| pbmm2 |
  • pbmm2 1.16.0
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/44df87558e18ce9d3b65f3ede9c7ba1513669ccb/docker/pbmm2) | [pbmm2@sha256:24218cb5cbc68d1fd64db14a9dc38263d3d931c74aca872c998d12ef43020ef0](https://quay.io/pacbio/pbmm2/manifest/sha256:24218cb5cbc68d1fd64db14a9dc38263d3d931c74aca872c998d12ef43020ef0) | +| mosdepth |
  • mosdepth 0.3.9
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/fa84fbf582738c05c750e667ff43d11552ad4183/docker/mosdepth) | [mosdepth@sha256:63f7a5d1a4a17b71e66d755d3301a951e50f6b63777d34dab3ee9e182fd7acb1](https://quay.io/pacbio/mosdepth/manifest/sha256:63f7a5d1a4a17b71e66d755d3301a951e50f6b63777d34dab3ee9e182fd7acb1) | +| pbsv |
  • pbsv 2.10.0
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/e82dddf32b042e985a5d66d0ebe25ca57058e61c/docker/pbsv) | [pbsv@sha256:3a8529853c1e214809dcdaacac0079de70d0c037b41b43bb8ba7c3fc5f783e26](https://quay.io/pacbio/pbsv/manifest/sha256:3a8529853c1e214809dcdaacac0079de70d0c037b41b43bb8ba7c3fc5f783e26) | +| trgt |
  • trgt 1.2.0
  • `/opt/scripts/check_trgt_coverage.py` 0.1.0
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/ed658e93fc51229f20415e0784dc242a8e4ef66a/docker/trgt) | [trgt@sha256:0284ff5756f8d47d9d81b515b8b1a6c81fac862ae5a7b4fe89f65235c3e5e0c9](https://quay.io/pacbio/trgt/manifest/sha256:0284ff5756f8d47d9d81b515b8b1a6c81fac862ae5a7b4fe89f65235c3e5e0c9) | +| hiphase |
  • hiphase 1.4.5
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/1051d12818e165a2145526e0b58f0ed0d0dc023a/docker/hiphase) | [hiphase@sha256:47fe7d42aea6b1b2e6d3c7401bc35a184464c3f647473d0525c00f3c968b40ad](https://quay.io/pacbio/hiphase/manifest/sha256:47fe7d42aea6b1b2e6d3c7401bc35a184464c3f647473d0525c00f3c968b40ad) | +| hificnv |
  • hificnv 1.0.1
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/a58f8b44cf8fd09c39c90e07076dbb418188084d/docker/hificnv) | [hificnv@sha256:c4764a70c8c2028edb1cdb4352997269947c5076ddd1aeaeef6c5076c630304d](https://quay.io/pacbio/hificnv/manifest/sha256:c4764a70c8c2028edb1cdb4352997269947c5076ddd1aeaeef6c5076c630304d) | +| paraphase |
  • paraphase 3.1.1
  • minimap 2.28
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/6b13cc246dd44e41903d17a660bb5432cdd18dbe/docker/paraphase) | [paraphase@sha256:a114ac5b9a682d7dc0fdf25c92cfb36f80c07ab4f1fb76b2e58092521b123a4d](https://quay.io/pacbio/paraphase/manifest/sha256:a114ac5b9a682d7dc0fdf25c92cfb36f80c07ab4f1fb76b2e58092521b123a4d) | +| pbstarphase |
  • pbstarphase 1.0.0
  • Database 20240826
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/813c7dc3143b91c34754d768c3e27a46355bb3e5/docker/pbstarphase) | [pbstarphase@sha256:6954d6f7e462c9cec7aaf7ebb66efaf13d448239aab76a3c947c1dfe24859686](https://quay.io/pacbio/pbstarphase/manifest/sha256:6954d6f7e462c9cec7aaf7ebb66efaf13d448239aab76a3c947c1dfe24859686) | +| pb-cpg-tools |
  • pb-cpg-tools 2.3.2
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/6b13cc246dd44e41903d17a660bb5432cdd18dbe/docker/pb-cpg-tools) | [pb-cpg-tools@sha256:d6e63fe3f6855cfe60f573de1ca85fab27f4a68e24a7f5691a7a805a22af292d](https://quay.io/pacbio/pb-cpg-tools/manifest/sha256:d6e63fe3f6855cfe60f573de1ca85fab27f4a68e24a7f5691a7a805a22af292d) | +| wgs_tertiary |
  • `/opt/scripts/calculate_phrank.py` 2.0.0
  • `/opt/scripts/json2ped.py` 0.5.0
Last built 2021-09-17:
  • ensembl -> HGNC
  • ensembl -> HPO
  • HGNC -> inheritance
  • HPO DAG
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/fd70e2872bd3c6bb705faff5bc68374116d7d62f/docker/wgs_tertiary) | [wgs_tertiary@sha256:410597030e0c85cf16eb27a877d260e7e2824747f5e8b05566a1aaa729d71136](https://quay.io/pacbio/wgs_tertiary/manifest/sha256:410597030e0c85cf16eb27a877d260e7e2824747f5e8b05566a1aaa729d71136) | +| slivar |
  • slivar 0.3.1
  • `/opt/scripts/add_comphet_phase.py` 0.1.0
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/5e1094fd6755203b4971fdac6dcb951bbc098bed/docker/slivar) | [slivar@sha256:35be557730d3ac9e883f1c2010fb24ac02631922f9b4948b0608d3e643a46e8b](https://quay.io/pacbio/slivar/manifest/sha256:35be557730d3ac9e883f1c2010fb24ac02631922f9b4948b0608d3e643a46e8b) | +| svpack |
  • svpack 54b54db
| [Dockerfile](https://github.com/PacificBiosciences/wdl-dockerfiles/tree/6fc750b0c65b4a5c1eb65791eab9eed89864d858/docker/svpack) | [svpack@sha256:628e9851e425ed8044a907d33de04043d1ef02d4d2b2667cf2e9a389bb011eba](https://quay.io/pacbio/svpack/manifest/sha256:628e9851e425ed8044a907d33de04043d1ef02d4d2b2667cf2e9a389bb011eba) | +| deepvariant |
  • DeepVariant 1.6.1
| | [deepvariant:1.6.1](https://hub.docker.com/layers/google/deepvariant/1.6.1/images/sha256-ccab95548e6c3ec28c75232987f31209ff1392027d67732435ce1ba3d0b55c68) | +| deepvariant-gpu |
  • DeepVariant 1.6.1
| | [deepvariant:1.6.1-gpu](https://hub.docker.com/layers/google/deepvariant/1.6.1-gpu/images/sha256-7929c55106d3739daa18d52802913c43af4ca2879db29656056f59005d1d46cb) | +| pharmcat |
  • PharmCat 2.15.4
| | [pharmcat:2.15.4](https://hub.docker.com/layers/pgkb/pharmcat/2.15.4/images/sha256-5b58ae959b4cd85986546c2d67e3596f33097dedc40dfe57dd845b6e78781eb6) | +| glnexus |
  • GLnexus 1.4.3
| | [glnexus:1.4.3](https://quay.io/pacbio/glnexus/manifest/sha256:ce6fecf59dddc6089a8100b31c29c1e6ed50a0cf123da9f2bc589ee4b0c69c8e) | diff --git a/scripts/create_readme_and_adjust_links.sh b/scripts/update_docs.sh similarity index 50% rename from scripts/create_readme_and_adjust_links.sh rename to scripts/update_docs.sh index bd6117b9..2f93c8d3 100644 --- a/scripts/create_readme_and_adjust_links.sh +++ b/scripts/update_docs.sh @@ -3,4 +3,6 @@ set -e # create a README.md file from the Home.md file in the wiki directory # with correct relative links -sed 's!(\./!(./wiki/!g;s!\.\./\.\.!../../..!g;' ./wiki/Home.md > README.md +sed 's!(\./!(./docs/!g;s!\.\./\.\.!../../..!g;' "$1"/Home.md > README.md +cp "$1"/*.md docs/ +rm docs/Home.md docs/_Sidebar.md \ No newline at end of file diff --git a/wiki b/wiki deleted file mode 160000 index e8281ed6..00000000 --- a/wiki +++ /dev/null @@ -1 +0,0 @@ -Subproject commit e8281ed6931a4a7fe8a4b6121261d97d665a0076