Skip to content

Commit

Permalink
Merge pull request #155 from PacificBiosciences/develop-v2
Browse files Browse the repository at this point in the history
v2.0.3
  • Loading branch information
williamrowell authored Nov 14, 2024
2 parents 0aba3fb + c970f6b commit 0c9571c
Show file tree
Hide file tree
Showing 22 changed files with 821 additions and 19 deletions.
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
inputs.test_data*.json
.wdltest*

dependencies.zip
hifi-human-wgs-wdl-singleton.zip
hifi-human-wgs-wdl-family.zip
*.zip

Makefile
.env
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,18 @@ Both workflows are designed to analyze human PacBio whole genome sequencing (WGS

This is an actively developed workflow with multiple versioned releases, and we make use of git submodules for common tasks that are shared by multiple workflows. There are two ways to ensure you are using a supported release of the workflow and ensure that the submodules are correctly initialized:

1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.2):
1) Download the release zips directly from a [supported release](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/tag/v2.0.3):

```bash
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-singleton.zip
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.2/hifi-human-wgs-family.zip
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-singleton.zip
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.3/hifi-human-wgs-family.zip
```

2) Clone the repository and initialize the submodules:

```bash
git clone \
--depth 1 --branch v2.0.2 \
--depth 1 --branch v2.0.3 \
--recursive \
https://github.com/PacificBiosciences/HiFi-human-WGS-WDL.git
```
Expand Down Expand Up @@ -63,10 +63,10 @@ The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will

For backend-specific configuration, see the relevant documentation:

- [Azure](./wiki/backend-azure)
- [AWS](./wiki/backend-aws-healthomics)
- [GCP](./wiki/backend-gcp)
- [HPC](./wiki/backend-hpc)
- [Azure](./docs/backend-azure)
- [AWS](./docs/backend-aws-healthomics)
- [GCP](./docs/backend-gcp)
- [HPC](./docs/backend-hpc)

### Configuring a workflow engine and container runtime

Expand All @@ -76,7 +76,7 @@ Because workflow dependencies are containerized, a container runtime is required

See the backend-specific documentation for details on setting up an engine.

| Engine | [Azure](./wiki/backend-azure) | [AWS](./wiki/backend-aws-healthomics) | [GCP](./wiki/backend-gcp) | [HPC](./wiki/backend-hpc) |
| Engine | [Azure](./docs/backend-azure) | [AWS](./docs/backend-aws-healthomics) | [GCP](./docs/backend-gcp) | [HPC](./docs/backend-hpc) |
| :- | :- | :- | :- | :- |
| [**miniwdl**](https://github.com/chanzuckerberg/miniwdl#scaling-up) | _Unsupported_ | Supported via [AWS HealthOmics](https://aws.amazon.com/healthomics/) | _Unsupported_ | (SLURM only) Supported via the [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm) plugin |
| [**Cromwell**](https://cromwell.readthedocs.io/en/stable/backends/Backends/) | Supported via [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) | _Unsupported_ | Supported via Google's [Pipelines API](https://cromwell.readthedocs.io/en/stable/backends/Google/) | Supported - [Configuration varies depending on HPC infrastructure](https://cromwell.readthedocs.io/en/stable/tutorials/HPCIntro/) |
Expand Down Expand Up @@ -118,7 +118,7 @@ If Cromwell is running in server mode, the workflow can be submitted using cURL.

This section describes the inputs required for a run of the workflow. Typically, only the sample-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided [for each backend](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends).

Workflow inputs for each entrypoint are described in [singleton](./wiki/singleton) and [family](./wiki/family) documentation.
Workflow inputs for each entrypoint are described in [singleton](./docs/singleton) and [family](./docs/family) documentation.

At a high level, we have two types of inputs files:

Expand All @@ -136,7 +136,7 @@ Docker images definitions used by this workflow can be found in [the wdl-dockerf
The Docker image used by a particular step of the workflow can be identified by looking at the `docker` key in the `runtime` block for the given task. Images can be referenced in the following table by looking for the name after the final `/` character and before the `@sha256:...`. For example, the image referred to here is "align_hifiasm":
> ~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87
Tool versions and Docker images used in these workflows can be found in the [tools and containers](./wiki/tools_containers) documentation.
Tool versions and Docker images used in these workflows can be found in the [tools and containers](./docs/tools_containers) documentation.

---

Expand Down
1 change: 1 addition & 0 deletions docs/backend-aws-healthomics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# TBD
27 changes: 27 additions & 0 deletions docs/backend-azure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Configuring Cromwell on Azure

Workflows can be run in Azure by setting up [Cromwell on Azure (CoA)](https://github.com/microsoft/CromwellOnAzure). Documentation on deploying and configuring an instance of CoA can be found [here](https://github.com/microsoft/CromwellOnAzure/wiki/Deploy-your-instance-of-Cromwell-on-Azure).

## Requirements

- [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure) version 3.2+; version 4.0+ is recommended

## Configuring and running the workflow

### Filling out workflow inputs

Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json).

See [the inputs section of the main README](./singleton#inputs) for more information on the structure of the inputs.json file.

### Running via Cromwell on Azure

Instructions for running a workflow from Cromwell on Azure are described in [the Cromwell on Azure documentation](https://github.com/microsoft/CromwellOnAzure/wiki/Running-Workflows).

## Reference data hosted in Azure

To use Azure reference data, add the following line to your `containers-to-mount` file in your Cromwell on Azure installation ([more info here](https://github.com/microsoft/CromwellOnAzure/blob/develop/docs/troubleshooting-guide.md#use-input-data-files-from-an-existing-azure-storage-account-that-my-lab-or-team-is-currently-using)):

`https://datasetpbrarediseases.blob.core.windows.net/dataset?si=public&spr=https&sv=2021-06-08&sr=c&sig=o6OkcqWWlGcGOOr8I8gCA%2BJwlpA%2FYsRz0DMB8CCtCJk%3D`

The [Azure input file template](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/azure/singleton.azure.inputs.json) has paths to the reference files in this blob storage prefilled.
31 changes: 31 additions & 0 deletions docs/backend-gcp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Configuring Cromwell on GCP

[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used to set up the resources needed to run the workflow.

## Configuring and running the workflow

### Filling out workflow inputs

Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/gcp/singleton.gcp.inputs.json).

See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file.

#### Determining available zones

To determine available zones in GCP, run the following; available zones within a region can be found in the first column of the output:

```bash
gcloud compute zones list | grep <region>
```

For example, the zones in region `us-central1` are `"us-central1-a us-central1-b us-central1c us-central1f"`.

## Running the workflow via Google's genomics Pipelines API

[Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/tutorials/PipelinesApi101/) on getting started with Google's genomics Pipelines API can be used as an example for how to run the workflow.

## Reference data hosted in GCP

GCP reference data is hosted in the `us-west1` region in the bucket `gs://pacbio-wdl`. This bucket is requester-pays, meaning that users will need to [provide a billing project in their Cromwell configuration](https://cromwell.readthedocs.io/en/stable/filesystems/GoogleCloudStorage/) in order to use files located in this bucket.

To avoid egress charges, Cromwell should be set up to spin up compute resources in the same region in which the data is located. If possible, add cohort data to the same region as the reference dataset, or consider mirroring this dataset in the region where your data is located. See [Google's information about data storage and egress charges for more information](https://cloud.google.com/storage/pricing).
52 changes: 52 additions & 0 deletions docs/backend-hpc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Installing and configuring for HPC backends

Either `miniwdl` or `Cromwell` can be used to run workflows on the HPC.

## Installing and configuring `miniwdl`

### Requirements

- [`miniwdl`](https://github.com/chanzuckerberg/miniwdl) >= 1.9.0
- [`miniwdl-slurm`](https://github.com/miniwdl-ext/miniwdl-slurm)

### Configuration

An [example miniwdl.cfg file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/miniwdl.cfg) is provided here. This should be placed at `~/.config/miniwdl.cfg` and edited to match your slurm configuration. This allows running workflows using a basic SLURM setup.

## Installing and configuring `Cromwell`

Cromwell supports a number of different HPC backends; see [Cromwell's documentation](https://cromwell.readthedocs.io/en/stable/backends/HPC/) for more information on configuring each of the backends. Cromwell can be used in a standalone "run" mode, or in "server" mode to allow for multiple users to submit workflows. In the example below, we provide example commands for running Cromwell in "run" mode.

## Running the workflow

### Filling out workflow inputs

Fill out any information missing in [the inputs file](https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/blob/main/backends/hpc/singleton.hpc.inputs.json). Once you have downloaded the reference data bundle, ensure that you have replaced the `<local_path_prefix>` in the input template file with the local path to the reference datasets on your HPC.

See [the inputs section of the singleton README](./singleton#inputs) for more information on the structure of the inputs.json file.

#### Running via miniwdl

```bash
miniwdl run workflows/singleton.wdl --input <inputs_json_file>
```

#### Running via Cromwell

```bash
cromwell run workflows/singleton.wdl --input <inputs_json_file>
```

## Reference data bundle

[<img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14027047.svg" alt="10.5281/zenodo.14027047">](https://zenodo.org/records/14027047)

Reference data is hosted on Zenodo at [10.5281/zenodo.14027047](https://zenodo.org/record/14027047). Download the reference data bundle and extract it to a location on your HPC, then update the input template file with the path to the reference data.

```bash
## download the reference data bundle
wget https://zenodo.org/record/14027047/files/hifi-wdl-resources-v2.0.0.tar

## extract the reference data bundle and rename as dataset
tar -xvf hifi-wdl-resources-v2.0.0.tar
```
3 changes: 3 additions & 0 deletions docs/backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- [hpc](./backend-hpc)
- [azure](./backend-azure)
- [gcp](./backend-gcp)
14 changes: 14 additions & 0 deletions docs/bam_stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# bam_stats outputs

```wdl
{sample}.{movie}.read_length_and_quality.tsv.gz - per read length and quality metrics
```

## `{sample}.{movie}.read_length_and_quality.tsv.gz` - per read length and quality metrics

Base metrics are extracted for each read from the uBAM and stored in these 4 columns:

- movie
- read name
- read length: length of query sequence
- read quality: transformation of `rq` tag into Phred (log) space, e.g., `rq:f:0.99` (99% accuracy, 1 error in 100 bases) is Phred 20 ($-10 \times \log(1 - 0.99)$); this value is capped at Phred 60 for `rq:f:1.0`
15 changes: 15 additions & 0 deletions docs/deepvariant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# DeepVariant subworkflow

```mermaid
flowchart TD
aBAM[/"HiFi aBAM"/] --> make_examples["DeepVariant make_examples"]
make_examples --> gpu{"gpu?"}
gpu -- yes --> call_variants_gpu["DeepVariant call_variants_gpu"]
gpu -- no --> call_variants_cpu["DeepVariant call_variants_cpu"]
call_variants_gpu --> postprocess_variants["DeepVariant postprocess_variants"]
call_variants_cpu --> postprocess_variants
postprocess_variants --> vcf[/"small variant VCF"/]
postprocess_variants --> gvcf[/"small variant gVCF"/]
```

This subworkflow runs the three steps of DeepVariant individually in order to make best use of resources. If a GPU is available and `gpu==true`, the `call_variants` step will run on 1 GPU and 8 cpu threads, otherwise it will run on 64 CPU threads. The `make_examples` and `postprocess_variants` steps will always run on the CPU.
Loading

0 comments on commit 0c9571c

Please sign in to comment.