Merge pull request #155 from PacificBiosciences/develop-v2

v2.0.3
PacificBiosciences · Nov 14, 2024 · 0c9571c · 0c9571c
2 parents 0aba3fb + c970f6b
commit 0c9571c
Show file tree

Hide file tree

Showing 22 changed files with 821 additions and 19 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,9 +1,7 @@
 inputs.test_data*.json
 .wdltest*
 
-dependencies.zip
-hifi-human-wgs-wdl-singleton.zip
-hifi-human-wgs-wdl-family.zip
+*.zip
 
 Makefile
 .env

diff --git a/README.md b/README.md
@@ -24,18 +24,18 @@ Both workflows are designed to analyze human PacBio whole genome sequencing (WGS
 
 This is an actively developed workflow with multiple versioned releases, and we make use of git submodules for common tasks that are shared by multiple workflows. There are two ways to ensure you are using a supported release of the workflow and ensure that the submodules are correctly initialized:
 
-1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.2):
+1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.3):
 
   ```bash
-  wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-singleton.zip
-  wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-family.zip
+  wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-singleton.zip
+  wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-family.zip
   ```
 
 2) Clone the repository and initialize the submodules:
 
   ```bash
   git clone \
-    --depth 1 --branch v2.0.2 \
+    --depth 1 --branch v2.0.3 \
     --recursive \
     https://github.com/PacificBiosciences/HiFi-human-WGS-WDL.git
   ```
@@ -63,10 +63,10 @@ The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will
 
 For backend-specific configuration, see the relevant documentation:
 
-- [Azure](./wiki/backend-azure)
-- [AWS](./wiki/backend-aws-healthomics)
-- [GCP](./wiki/backend-gcp)
-- [HPC](./wiki/backend-hpc)
+- [Azure](./docs/backend-azure)
+- [AWS](./docs/backend-aws-healthomics)
+- [GCP](./docs/backend-gcp)
+- [HPC](./docs/backend-hpc)
 
 ### Configuring a workflow engine and container runtime
 
@@ -76,7 +76,7 @@ Because workflow dependencies are containerized, a container runtime is required
 
 See the backend-specific documentation for details on setting up an engine.
 
-| Engine | [Azure](./wiki/backend-azure) | [AWS](./wiki/backend-aws-healthomics) | [GCP](./wiki/backend-gcp) | [HPC](./wiki/backend-hpc) |
+| Engine | [Azure](./docs/backend-azure) | [AWS](./docs/backend-aws-healthomics) | [GCP](./docs/backend-gcp) | [HPC](./docs/backend-hpc) |
 | :- | :- | :- | :- | :- |
 | [**miniwdl**](https://github.com/chanzuckerberg/miniwdl#scaling-up) | _Unsupported_ | Supported via [AWS HealthOmics](https://aws.amazon.com/healthomics/) | _Unsupported_ | (SLURM only) Supported via the [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm) plugin |
 | [**Cromwell**](https://cromwell.readthedocs.io/en/stable/backends/Backends/) | Supported via [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) | _Unsupported_ | Supported via Google's [Pipelines API](https://cromwell.readthedocs.io/en/stable/backends/Google/) | Supported - [Configuration varies depending on HPC infrastructure](https://cromwell.readthedocs.io/en/stable/tutorials/HPCIntro/) |
@@ -118,7 +118,7 @@ If Cromwell is running in server mode, the workflow can be submitted using cURL.
 
 This section describes the inputs required for a run of the workflow. Typically, only the sample-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided [for each backend](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends).
 
-Workflow inputs for each entrypoint are described in [singleton](./wiki/singleton) and [family](./wiki/family) documentation.
+Workflow inputs for each entrypoint are described in [singleton](./docs/singleton) and [family](./docs/family) documentation.
 
 At a high level, we have two types of inputs files:
 
@@ -136,7 +136,7 @@ Docker images definitions used by this workflow can be found in [the wdl-dockerf
 The Docker image used by a particular step of the workflow can be identified by looking at the `docker` key in the `runtime` block for the given task. Images can be referenced in the following table by looking for the name after the final `/` character and before the `@sha256:...`. For example, the image referred to here is "align_hifiasm":
 > ~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87
 
-Tool versions and Docker images used in these workflows can be found in the [tools and containers](./wiki/tools_containers) documentation.
+Tool versions and Docker images used in these workflows can be found in the [tools and containers](./docs/tools_containers) documentation.
 
 ---
 

diff --git a/docs/backend-aws-healthomics.md b/docs/backend-aws-healthomics.md
@@ -0,0 +1 @@
+# TBD
diff --git a/docs/backend-azure.md b/docs/backend-azure.md
@@ -0,0 +1,27 @@
+# Configuring Cromwell on Azure
+
+Workflows can be run in Azure by setting up [Cromwell on Azure (CoA)](https://github.com/microsoft/CromwellOnAzure). Documentation on deploying and configuring an instance of CoA can be found [here](https://github.com/microsoft/CromwellOnAzure/wiki/Deploy-your-instance-of-Cromwell-on-Azure).
+
+## Requirements
+
+- [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) version 3.2+; version 4.0+ is recommended
+
+## Configuring and running the workflow
+
+### Filling out workflow inputs
+
+Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json).
+
+See [the inputs section of the main README](./singleton#inputs) for more information on the structure of the inputs.json file.
+
+### Running via Cromwell on Azure
+
+Instructions for running a workflow from Cromwell on Azure are described in [the Cromwell on Azure documentation](https://github.com/microsoft/CromwellOnAzure/wiki/Running-Workflows).
+
+## Reference data hosted in Azure
+
+To use Azure reference data, add the following line to your `containers-to-mount` file in your Cromwell on Azure installation ([more info here](https://github.com/microsoft/CromwellOnAzure/blob/develop/docs/troubleshooting-guide.md#use-input-data-files-from-an-existing-azure-storage-account-that-my-lab-or-team-is-currently-using)):
+
+`https://datasetpbrarediseases.blob.core.windows.net/dataset?si=public&spr=https&sv=2021-06-08&sr=c&sig=o6OkcqWWlGcGOOr8I8gCA%2BJwlpA%2FYsRz0DMB8CCtCJk%3D`
+
+The [Azure input file template](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json) has paths to the reference files in this blob storage prefilled.
diff --git a/docs/backend-gcp.md b/docs/backend-gcp.md
@@ -0,0 +1,31 @@
+# Configuring Cromwell on GCP
+
+[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used to set up the resources needed to run the workflow.
+
+## Configuring and running the workflow
+
+### Filling out workflow inputs
+
+Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/gcp/singleton.gcp.inputs.json).
+
+See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file.
+
+#### Determining available zones
+
+To determine available zones in GCP, run the following; available zones within a region can be found in the first column of the output:
+
+```bash
+gcloud compute zones list | grep <region>
+```
+
+For example, the zones in region `us-central1` are `"us-central1-a us-central1-b us-central1c us-central1f"`.
+
+## Running the workflow via Google's genomics Pipelines API
+
+[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used as an example for how to run the workflow.
+
+## Reference data hosted in GCP
+
+GCP reference data is hosted in the `us-west1` region in the bucket `gs://pacbio-wdl`. This bucket is requester-pays, meaning that users will need to [provide a billing project in their Cromwell configuration](https://cromwell.readthedocs.io/en/stable/filesystems/GoogleCloudStorage/) in order to use files located in this bucket.
+
+To avoid egress charges, Cromwell should be set up to spin up compute resources in the same region in which the data is located. If possible, add cohort data to the same region as the reference dataset, or consider mirroring this dataset in the region where your data is located. See [Google's information about data storage and egress charges for more information](https://cloud.google.com/storage/pricing).
diff --git a/docs/backend-hpc.md b/docs/backend-hpc.md
@@ -0,0 +1,52 @@
+# Installing and configuring for HPC backends
+
+Either `miniwdl` or `Cromwell` can be used to run workflows on the HPC.
+
+## Installing and configuring `miniwdl`
+
+### Requirements
+
+- [`miniwdl`](https://github.com/chanzuckerberg/miniwdl) >= 1.9.0
+- [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm)
+
+### Configuration
+
+An [example miniwdl.cfg file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/miniwdl.cfg) is provided here. This should be placed at `~/.config/miniwdl.cfg` and edited to match your slurm configuration. This allows running workflows using a basic SLURM setup.
+
+## Installing and configuring `Cromwell`
+
+Cromwell supports a number of different HPC backends; see [Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/backends/HPC/) for more information on configuring each of the backends.  Cromwell can be used in a standalone "run" mode, or in "server" mode to allow for multiple users to submit workflows.  In the example below, we provide example commands for running Cromwell in "run" mode.
+
+## Running the workflow
+
+### Filling out workflow inputs
+
+Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/singleton.hpc.inputs.json). Once you have downloaded the reference data bundle, ensure that you have replaced the `<local_path_prefix>` in the input template file with the local path to the reference datasets on your HPC.
+
+See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file.
+
+#### Running via miniwdl
+
+```bash
+miniwdl run workflows/singleton.wdl --input <inputs_json_file>
+```
+
+#### Running via Cromwell
+
+```bash
+cromwell run workflows/singleton.wdl --input <inputs_json_file>
+```
+
+## Reference data bundle
+
+[<img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14027047.svg" alt="10.5281/zenodo.14027047">](https://zenodo.org/records/14027047)
+
+Reference data is hosted on Zenodo at [10.5281/zenodo.14027047](https://zenodo.org/record/14027047). Download the reference data bundle and extract it to a location on your HPC, then update the input template file with the path to the reference data.
+
+```bash
+## download the reference data bundle
+wget https://zenodo.org/record/14027047/files/hifi-wdl-resources-v2.0.0.tar
+
+## extract the reference data bundle and rename as dataset
+tar -xvf hifi-wdl-resources-v2.0.0.tar
+```
diff --git a/docs/backends.md b/docs/backends.md
@@ -0,0 +1,3 @@
+- [hpc](./backend-hpc)
+- [azure](./backend-azure)
+- [gcp](./backend-gcp)
diff --git a/docs/bam_stats.md b/docs/bam_stats.md
@@ -0,0 +1,14 @@
+# bam_stats outputs
+
+```wdl
+{sample}.{movie}.read_length_and_quality.tsv.gz - per read length and quality metrics
+```
+
+## `{sample}.{movie}.read_length_and_quality.tsv.gz` - per read length and quality metrics
+
+Base metrics are extracted for each read from the uBAM and stored in these 4 columns:
+
+- movie
+- read name
+- read length: length of query sequence
+- read quality: transformation of `rq` tag into Phred (log) space, e.g., `rq:f:0.99` (99% accuracy, 1 error in 100 bases) is Phred 20 ($-10 \times \log(1 - 0.99)$); this value is capped at Phred 60 for `rq:f:1.0`
diff --git a/docs/deepvariant.md b/docs/deepvariant.md
@@ -0,0 +1,15 @@
+# DeepVariant subworkflow
+
+```mermaid
+flowchart TD
+  aBAM[/"HiFi aBAM"/] --> make_examples["DeepVariant make_examples"]
+  make_examples --> gpu{"gpu?"}
+  gpu -- yes --> call_variants_gpu["DeepVariant call_variants_gpu"]
+  gpu -- no --> call_variants_cpu["DeepVariant call_variants_cpu"]
+  call_variants_gpu --> postprocess_variants["DeepVariant postprocess_variants"]
+  call_variants_cpu --> postprocess_variants
+  postprocess_variants --> vcf[/"small variant VCF"/]
+  postprocess_variants --> gvcf[/"small variant gVCF"/]
+```
+
+This subworkflow runs the three steps of DeepVariant individually in order to make best use of resources.  If a GPU is available and `gpu==true`, the `call_variants` step will run on 1 GPU and 8 cpu threads, otherwise it will run on 64 CPU threads.  The `make_examples` and `postprocess_variants` steps will always run on the CPU.