-
Notifications
You must be signed in to change notification settings - Fork 7
Running WDL workflows locally
Start with our scientific manuscript presenting CZ ID’s bioinformatics pipeline for metagenomic classification of high-throughput sequencing datasets.
The pipeline receives FASTQ or FASTA inputs and processes them through numerous tools and reference databases. To enable external reproduction,
- reference databases are available for download from Amazon S3
- tools are packaged in a Docker image
- the analysis steps are wrapped in Workflow Description Language (WDL)
We’ll walk through each of these, first with small reference databases (representing viral sequences only), then the full-scale metagenomic versions.
Prepare to run the workflow by setting up miniwdl, a local WDL runner. Follow its Getting Started guide to begin. Briefly, this will entail (i) installing miniwdl using pip3 or conda, (ii) installing docker & configuring it so that the Unix user can control it without sudo, and (iii) trying miniwdl run_self_test
. We will also need git
available.
Recommended: enable miniwdl download caching. Multiple invocations of the workflow will be much more efficient if we activate miniwdl’s download cache feature, so that reference databases are downloaded from S3 on first use only. To activate the cache, set and export environment variables:
export MINIWDL__DOWNLOAD_CACHE__PUT=true
export MINIWDL__DOWNLOAD_CACHE__GET=true
export MINIWDL__DOWNLOAD_CACHE__DIR=/mnt/miniwdl_download_cache
Where the last directory is a suitable local storage location for large, cached files. This configuration can also be set using a cfg file instead of transient environment variables.
Change into some working directory (on the spacious scratch volume, if planning full-scale operations). Clone the chanzuckerberg/czid-workflows git repository containing the WDL and Docker source files.
git clone https://github.com/chanzuckerberg/czid-workflows.git
Test runs of the workflow require several gigabytes of scratch space. If running on an EC2 instance, you may want to expand your root EBS volume or mount instance storage, as well as relocate the Docker image storage directory. When running miniwdl, use the --dir DIR
option to use the given DIR as scratch space; for example, if your instance storage is mounted on /mnt (Ubuntu default), use miniwdl --dir /mnt
to run there. If you plan to run the czid-workflows test suite, set TMPDIR
to your directory using export TMPDIR=/mnt
.
We can either pull the Docker image from a GitHub Packages registry, or build it locally from the Dockerfile (which downloads resources from numerous web locations).
To pull the existing image, see the Packages page for czid-workflows. Browse to the image for the workflow you need to run, and note the current image tag. At the command line, first docker login
to the GitHub packages registry using a personal access token, then docker pull
the current tag. (You can use any GitHub account to log in, but you must log in.)
Building the image from the Dockerfile takes several minutes, but doesn’t require logging in anywhere. To build it,
./scripts/docker-build.sh workflows/consensus-genome -t czid-consensus-genome
or
./scripts/docker-build.sh workflows/short-read-mngs -t czid-short-read-mngs
and note the local tag (czid-consensus-genome
or czid-short-read-mngs
).
First let’s run the workflow on small, synthetic FASTQ reads using small reference databases (containing only viral sequences).
miniwdl run workflows/consensus-genome/run.wdl technology=ONT sample=test docker_image_id=<<TAG>> \
fastqs_0=workflows/consensus-genome/test/Ct20K.fastq.gz ref_accession_id=MN908947.3 \
--input workflows/consensus-genome/test/local_test.yml --verbose
miniwdl run workflows/short-read-mngs/local_driver.wdl \
docker_image_id=<<TAG>> \
fastqs_0=workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R1.fastq.gz \
fastqs_1=workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R2.fastq.gz \
-i workflows/short-read-mngs/test/local_test_viral.yml --verbose
Breaking this down,
-
docker_image_id=
should be set to the docker image tag you noted above (either the GitHub Packages tag, orczid-short-read-mngs
if you built the image locally) -
short-read-mngs/local_driver.wdl
is the top-level WDL for the metagenomics sequencing workflow - The pair of FASTQ files are small, synthetic read sets included for benchmarking
-
local_test_viral.yml
supplies boilerplate workflow inputs, such as the S3 paths for the viral reference databases
The first attempt will take some time to download the reference databases (about 6 GiB total). Thereafter, if miniwdl download caching is enabled as suggested above, running this or other small samples should take just a few minutes.
When the run completes, miniwdl prints a large JSON structure with all the outputs and output file paths in the created run directory. The miniwdl documentation has more information about the run directory’s organization.
The aggregated metrics for every taxon identified by the CZ ID short-read-mngs workflow can be found in the output file out/postprocess.refined_taxon_count_out_assembly_refined_taxon_counts_with_dcr_json/refined_taxon_counts_with_dcr.json
Each result entry from the CZ ID sample report is recorded as a single entry in the .json file, as shown below:
{"tax_id": "37124",
"tax_level": 1,
"genus_taxid": "11019",
"family_taxid": "11018",
"count": 1394,
"nonunique_count": 1394,
"unique_count": 1394,
"dcr": 1.0,
"percent_identity": 96.56100000000093,
"alignment_length": 11601.0,
"e_value": -307.6526555685972,
"count_type": "NT"}
To interrogate read-level taxonomic hits for the NT and NR databases independently, the following two files may be used:
out/postprocess.refined_gsnap_out_assembly_gsnap_hitsummary2_tab/gsnap.hitsummary2.tab
out/postprocess.refined_rapsearch2_out_assembly_rapsearch2_hitsummary2_tab/rapsearch2.hitsummary2.tab
The hitsummary2.tab
format is detailed here.
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000344/1 1 37124 NC_004162.2 37124 11019 11018 NODE_1_length_11589_cov_8.183024 NC_004162.2 37124 11019 11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000344/2 1 37124 NC_004162.2 37124 11019 11018 NODE_1_length_11589_cov_8.183024 NC_004162.2 37124 11019 11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000838/1 1 37124 NC_004162.2 37124 11019 11018 NODE_1_length_11589_cov_8.183024 NC_004162.2 37124 11019 11018
MG049915.1_0__benchmark_lineage_0_37124_11019_11018__s0000000838/2 1 37124 NC_004162.2 37124 11019 11018 NODE_1_length_11589_cov_8.183024 NC_004162.2 37124 11019 11018
To run the workflow on the full metagenomics databases used by CZ ID, we recommend starting with an Amazon EC2 r5d.24xlarge or i3.8xlarge instance in us-west-2. Such powerful instance types are needed to store the full-size databases on local disks for speedy random access, up to 3.5 TiB during a large run. (To keep this tutorial self-contained, we run everything on one big compute node; in a production system like CZ ID, one would distribute the WDL tasks & databases on several fit-for-purpose nodes.)
Launch the instance using an Ubuntu base image and install packages python3-pip docker git-core mdadm
. If, like EC2 instances, the available scratch disk space is divided among several virtual devices, then perform steps like the following to stripe them in a RAID0 array, creating one large volume:
NVME_DISKS=(/dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS?????????????????)
mdadm --create /dev/md0 --force --auto=yes --level=0 --chunk=256 \
--raid-devices=${#NVME_DISKS[@]} ${NVME_DISKS[@]}
mkfs.xfs /dev/md0
mount /dev/md0 /mnt
chown -R ubuntu /mnt
Check df -h
to verify that /mnt
has ≥ 3.5T space. Next, reconfigure Docker so that containers operate on the scratch volume, and so that the default ubuntu
user can control it:
echo '{"data-root": "/mnt/docker"}' >> /etc/docker/daemon.json
service restart docker
usermod -aG docker ubuntu
Follow the steps above to set up miniwdl and its download cache. The download cache is practically required for the full metagenomics databases, as some databases are reused by different workflow steps. Since the full runs take some hours, you may also wish to set up byobu
and/or mosh
to avoid losing work to SSH timeouts.
Change into a working directory under mnt
and, as above, clone czid-workflows
and pull or build the Docker image. Then launch the same small synthetic FASTQ pair on the full databases:
miniwdl run workflows/short-read-mngs/local_driver.wdl \
docker_image_id=<<TAG>> \
fastqs_0=workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R1.fastq.gz \
fastqs_1=workflows/short-read-mngs/test/norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v6__R2.fastq.gz \
-i workflows/short-read-mngs/test/local_test.yml --verbose
Historical note: before being packaged in WDL (2020), the short-read-mngs pipeline was orchestrated within a custom framework, idseq-dag. CZ ID is currently in a transition phase where the high-level pipeline is expressed in WDL, while the logic for individual steps largely resides in idseq-dag modules; and the WDL tasks just invoke the latter. The idseq-dag portions may recede slowly over time, as new and revised steps are implemented first as WDL tasks.