diff --git a/README.md b/README.md index 047c04358..025a62a14 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,68 @@ [![Build Status](https://travis-ci.com/broadinstitute/viral-pipelines.svg?branch=master)](https://travis-ci.com/broadinstitute/viral-pipelines) [![Documentation Status](https://readthedocs.org/projects/viral-pipelines/badge/?version=latest)](http://viral-pipelines.readthedocs.io/en/latest/?badge=latest) -viral-pipelines -=============== +# viral-pipelines A set of scripts and tools for the analysis of viral NGS data. Workflows are written in [WDL](https://github.com/openwdl/wdl) format. This is a portable workflow language that allows for easy execution on a wide variety of platforms: - - on individual machines (using miniWDL or Cromwell to execute) - - on commercial cloud platforms like GCP, AWS, or Azure (using Cromwell or CromwellOnAzure) - - on institutional HPC systems (using Cromwell) - - on commercial platform as a service vendors (like DNAnexus) - - on academic cloud platforms (like Terra) + - on individual machines (using [miniWDL](https://github.com/chanzuckerberg/miniwdl) or [Cromwell](https://github.com/broadinstitute/cromwell) to execute) + - on commercial cloud platforms like GCP, AWS, or Azure (using [Cromwell](https://github.com/broadinstitute/cromwell) or [CromwellOnAzure](https://github.com/microsoft/CromwellOnAzure)) + - on institutional HPC systems (using [Cromwell](https://github.com/broadinstitute/cromwell)) + - on commercial platform as a service vendors (like [DNAnexus](https://dnanexus.com/)) + - on academic cloud platforms (like [Terra](https://app.terra.bio/)) -Currently, all workflows are regularly deployed to a GCS bucket: [gs://viral-ngs-wdl](https://console.cloud.google.com/storage/browser/viral-ngs-wdl?forceOnBucketsSortingFiltering=false&organizationId=548622027621&project=gcid-viral-seq). + +## Obtaining the latest WDL workflows + +Workflows from this repository are continuously deployed to [Dockstore](https://dev.dockstore.net/organizations/BroadInstitute/collections/pgs), a GA4GH Tool Repository Service. They can then be easily imported to any bioinformatic compute platform that utilizes the TRS API and understands WDL (this includes Terra, DNAnexus, DNAstack, etc). + +Flattened workflows are also continuously deployed to a GCS bucket: [gs://viral-ngs-wdl](https://console.cloud.google.com/storage/browser/viral-ngs-wdl?forceOnBucketsSortingFiltering=false&organizationId=548622027621&project=gcid-viral-seq) and can be downloaded for local use. Workflows are also available in the [Terra featured workspace](https://app.terra.bio/#workspaces/pathogen-genomic-surveillance/COVID-19). Workflows are continuously deployed to a [DNAnexus CI project](https://platform.dnanexus.com/projects/F8PQ6380xf5bK0Qk0YPjB17P). -Continuous deploy to [Dockstore](https://dockstore.org/) is pending. -Basic execution ---------------- +## Basic execution + +The easiest way to get started is on a single, Docker-capable machine (your laptop, shared workstation, or virtual machine) using [miniWDL](https://github.com/chanzuckerberg/miniwdl). MiniWDL can be installed either via `pip` or `conda` (via conda-forge). After confirming that it works (`miniwdl run_self_test`, you can use [miniwdl run](https://github.com/chanzuckerberg/miniwdl#miniwdl-run) to invoke WDL workflows from this repository. + +For example, to list the inputs for the assemble_refbased workflow: + +``` +miniwdl run https://storage.googleapis.com/viral-ngs-wdl/quay.io/broadinstitute/viral-pipelines/2.0.21.3/assemble_refbased.wdl +``` + +This will emit: +``` +missing required inputs for assemble_refbased: reads_unmapped_bams, reference_fasta + +required inputs: + Array[File]+ reads_unmapped_bams + File reference_fasta + +optional inputs: + + +outputs: + +``` + +To then execute this workflow on your local machine, invoke it with like this: +``` +miniwdl run \ + https://storage.googleapis.com/viral-ngs-wdl/quay.io/broadinstitute/viral-pipelines/2.0.21.3/assemble_refbased.wdl \ + reads_unmapped_bams=PatientA_library1.bam \ + reads_unmapped_bams=PatientA_library2.bam \ + reference_fasta=/refs/NC_045512.2.fasta \ + trim_coords_bed=/refs/NC_045512.2-artic_primers-3.bed \ + sample_name=PatientA \ +``` + +In the above example, reads from two sequencing runs are aligned and merged together before consensus calling. The optional bed file provided turns on primer trimming at the given coordinates. -The easiest way to get started is on a single, Docker-capable machine (your laptop, shared workstation, or virtual machine) using [miniWDL](https://github.com/chanzuckerberg/miniwdl). MiniWDL can be installed via `pip` or `conda` (via conda-forge). After confirming that it works (`miniwdl run_self_test`, you can use [miniwdl run](https://github.com/chanzuckerberg/miniwdl#miniwdl-run) to invoke WDL workflows from this repository. For example: `miniwdl run https://storage.googleapis.com/viral-ngs-wdl/quay.io/broadinstitute/viral-pipelines/2.0.20.3/assemble_refbased.wdl` will execute the reference-based assembly pipeline, when provided with the appropriate inputs. -Available workflows -------------------- +## Available workflows The workflows provided here are more fully documented at our [ReadTheDocs](https://viral-pipelines.readthedocs.io/) page. diff --git a/docs/index.rst b/docs/index.rst index d0375ec3b..9defff307 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -16,4 +16,5 @@ Contents description pipes-wdl + ncbi_submission workflows diff --git a/docs/ncbi_submission.rst b/docs/ncbi_submission.rst new file mode 100644 index 000000000..418c8d63b --- /dev/null +++ b/docs/ncbi_submission.rst @@ -0,0 +1,108 @@ +Submitting viral sequences to NCBI +================================== + +Register your BioProject +------------------------ +*If you want to add samples to an existing BioProject, skip to Step 2.* + +1. Go to: https://submit.ncbi.nlm.nih.gov and login (new users - create new login). +#. Go to the Submissions tab and select BioProject - click on New Submission. +#. Follow the onscreen instructions and then click submit - you will receive a BioProject ID (``PRJNA###``) via email almost immediately. + + +Register your BioSamples +------------------------ + +1. Go to: https://submit.ncbi.nlm.nih.gov and login. +#. Go to the Submissions tab and select BioSample - click on New Submission. +#. Follow instructions, selecting "batch submission type" where applicable. +#. The metadata template to use is likely: "Pathogen affecting public health". +#. Follow template instructions (careful about date formatting) and submit as .txt file. +#. You will receive BioSamples IDs (``SAMN####``) via email (often 1-2 days later). + + +Set up an NCBI author template +------------------------------ +*If different author lists are used for different sets of samples, create a new .sbt file for each list* + +1. Go to: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ +#. Fill out the form including all authors and submitter information (if unpublished, the reference title can be just a general description of the project). +#. At the end of the form, include the BioProject number from Step 1 but NOT the BioSample number' +#. Click "create template" which will download an .sbt file to your computer' +#. Save file as "authors.sbt" or similar. If you have multiple author files, give each file a different name and prep your submissions as separate batches, one for each authors.sbt file. + + +Set up the BioSample map file +----------------------------- + +1. Set up an Excel spreadsheet in exactly the format below: + + ========= ============= + sample BioSample + sample1-1 SAMNxxxxxxxxx + sample2-1 SAMNxxxxxxxxx + ========= ============= + +2. The BioSample is the BioSample number (i.e., ``SAMNxxxxxxxx``) given to you by NCBI. +3. The sample name should match the FASTA header (not necessarily the file name). + a. Make sure your FASTA headers include segment numbers (i.e., IRF001-1) -- viral-ngs will fail otherwise! + b. If submitting a segmented virus (i.e., Lassa virus), each line should be a different segment, see example below (assumes sample2 is a 2-segmented virus) + c. For samples with multiple segments, the BioSample number should be the same for all segments + + ========= ============= + sample BioSample + sample1-1 SAMN04488486 + sample2-1 SAMN04488657 + sample2-2 SAMN04488657 + sample3-1 SAMN04489002 + ========= ============= + +4. Save the file as as a tab delimited text file (e.g. "biosample-map.txt"). +5. If preparing the file on a Mac computer in Microsoft Excel (which saves tab files in a 20th-century era OS9 format), ensure that tabs and newlines are entered correctly by opening the file (via the command line) in an editor such as Nano and unchecking the [Mac-format] option (in Nano: edit the file, save the file, then click OPTION-M). You can also opt to create this file directly in a text editor, ensuring there is exactly one tab character between columns (i.e., sampleBioSample in the first row). Command line converters such as ``mac2unix`` also work. + + +Set up the metadata file (aka Source Modifier Table) +---------------------------------------------------- +1. Set up an Excel spreadsheet in exactly the format below + a. This example shows sample2 as a 2-segmented virus. + b. All data should be on the same line (there are 9 columns). Here they are shown as separate tables simply for space reasons. + c. The "Sequence_ID" should match the "sample" field in the BioSample map (see Step 4). Note that this should match the FASTA header. + d. Shown are the some of the fields we typically use in NCBI submissions, but fields can be added or removed to suit your sample needs. Other fields we often include are: "isolation_source" (i.e., serum), "collected_by" (i.e., Redeemer's University), and "genotype". Here is the full list of fields accepted by NCBI: https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html. + e. The database cross-reference (db_xref) field number can be obtained by navigating to https://www.ncbi.nlm.nih.gov/taxonomy, searching for the organism of interest, and copying the "Taxonomy ID" number from the webpage. + + =========== =============== ======= ============================================= ===================== ========== ============ ============ ==================================================================================== + Sequence_ID collection_date country isolate organism lab_host host db_xref note + sample1-1 10-Mar-2014 Nigeria Ebola virus/H.sapiens-tc/GIN/2014/Makona-C05 Zaire ebolavirus Vero cells Homo sapiens taxon:186538 Harvest date: 01-Jan-2016; passaged 2x in cell culture (parent stock: SAMN01110234) + sample2-1 12-Mar-2014 Nigeria Lassa virus Macenta Lassa mammarenavirus Vero cells Homo sapiens taxon:11620 + sample2-2 12-Mar-2014 Nigeria Lassa virus Macenta Lassa mammarenavirus Vero cells Homo sapiens taxon:11620 + sample3-1 16-Mar-2014 Nigeria Ebola virus/H.sapiens-tc/GIN/2014/Makona-1121 Zaire ebolavirus Vero cells Homo sapiens taxon:186538 This sample was collected by Dr. Blood from a very sick patient. + =========== =============== ======= ============================================= ===================== ========== ============ ============ ==================================================================================== + +2. The data in this table is what actually shows up on NCBI with the genome. In many cases, it is a subset of the metadata you submitted when you registered the BioSamples. +3. Save this table as sample_meta.txt. If you make the file in Excel, double check the date formatting is preserved when you save -- it should be dd-mmm-yyyy format. +4. If preparing the file on a Mac computer in Microsoft Excel (which saves tab files in a 20th-century era OS9 format), ensure that tabs and newlines are entered correctly by opening the file (via the command line) in an editor such as Nano and unchecking the [Mac-format] option (in Nano: edit the file, save the file, then click OPTION-M). You can also opt to create this file directly in a text editor, ensuring there is exactly one tab character between columns (i.e., sampleBioSample in the first row). Command line converters such as ``mac2unix`` also work. + + +Prepare requisite input files for your submission batches +--------------------------------------------------------- + +1. Stage the above files you've prepared and other requisite inputs into the environment you plan to execute the :doc:`genbank` WDL workflow. If that is Terra, push these files into the appropriate GCS bucket, if DNAnexus, drop your files there. If you plan to execute locally (e.g. with ``miniwdl run``), move the files to an appropriate directory on your machine. The files you will need are the following: + a. The files you prepared above: the submission template (authors.sbt), the biosample map (biosample-map.txt), and the source modifier table (sample_meta.txt) + #. All of the assemblies you want to submit. These should be in fasta files, one per genome. Multi-segment/multi-chromosome genomes (such as Lassa virus, Influenza A, etc) should contain all segments within one fasta file. + #. Your reference genome, as a fasta file. Multi-segment/multi-chromosome genomes should contain all segments within one fasta file. The fasta sequence headers should be Genbank accession numbers. + #. Your reference gene annotations, as a series of TBL files, one per segment/chromosome. These must correspond to the accessions in you reference genome. + #. A genome coverage table as a two-column tabular text file (optional, but helpful). + #. The organism name (which should match what NCBI taxonomy calls the species you are submitting for). This is a string input to the workflow, not a file. + #. The sequencing technology used. This is a string input, not a file. +#. The reference genome you provide should be annotated in the way you want your genomes annotated on NCBI. If one doesn't exist, see the addendum below about creating your own feature list. +#. Note that you will have to run the pipeline separately for each virus you are submitting AND separately for each author list. + + +Run the genbank submission pipeline +----------------------------------- + +1. Run the :doc:`genbank` WDL workflow. This performs the following steps: it aligns your assemblies against a Genbank reference sequence, transfers gene annotation from that Genbank reference into your assemblies' coordinate spaces, and then takes your genomes, the transferred annotations, and all of the sample metadata prepared above, and produces a zipped bundle that you send to NCBI. There are two zip bundles: ``sequins_only.zip`` is the file to email to NCBI. ``all_files.zip`` contains a full set of files for your inspection prior to submission. +#. In the ``all_files.zip`` output, for each sample, you will see a ``.sqn``, ``.gbf``, ``.val``, and ``.tbl`` file. You should also see an ``errorsummary.val`` file that you can use to check for annotation errors (or you can check the ``.val`` file for each sample individually). Ideally, your samples should be error-free before you submit them to NCBI. For an explanation of the cryptic error messages, see: https://www.ncbi.nlm.nih.gov/genbank/genome_validation/. +#. Note: we've recently had trouble running tbl2asn with a molType specified. TO DO: describe how to deal with this. +#. Check your ``.gbf`` files for a preview of what your genbank entries will look like. Once you are happy with your files email the ``sequins_only.zip`` file to gb-sub@ncbi.nlm.nih.gov. +#. It often takes 2-8 weeks to receive a response and accession numbers for your samples. Do follow up if you haven’t heard anything for a few weeks! diff --git a/docs/pipes-wdl.rst b/docs/pipes-wdl.rst index 8b1cd2594..edb8a1146 100644 --- a/docs/pipes-wdl.rst +++ b/docs/pipes-wdl.rst @@ -1,12 +1,7 @@ Using the WDL pipelines ======================= -Rather than chaining together viral-ngs pipeline steps as series of tool -commands called in isolation, it is possible to execute them as a -complete automated pipeline, from processing raw sequencer output to -creating files suitable for GenBank submission. This utilizes the Workflow -Description Language, which is documented at: +Rather than chaining together viral-ngs pipeline steps as series of tool commands called in isolation, it is possible to execute them as a complete automated pipeline, from processing raw sequencer output to creating files suitable for GenBank submission. This utilizes the Workflow Description Language, which is documented at: https://github.com/openwdl/wdl -There are various methods for executing these workflows on your infrastructure -which are more thoroughly documented in our `README `_. +There are various methods for executing these workflows on your infrastructure which are more thoroughly documented in our `README `_. diff --git a/docs/workflows.rst b/docs/workflows.rst index bccc414ae..d2b4d9118 100644 --- a/docs/workflows.rst +++ b/docs/workflows.rst @@ -1,12 +1,13 @@ WDL Workflows ============= -Documentation for each workflow is provided here. Although there are many workflows -that serve different functions, some of the primary workflows we use most often include: - - demux_plus (on every sequencing run) - - classify_krakenuniq (included in demux_plus) - - assemble_denovo (for most viruses) - - assemble_refbased (for less diverse viruses, such as those from single point source human outbreaks) - - build_augur_tree (for nextstrain-based visualization of phylogeny) +Documentation for each workflow is provided here. Although there are many workflows that serve different functions, some of the primary workflows we use most often include: -.. toctree:: \ No newline at end of file + - :doc:`demux_plus` (on every sequencing run) + - :doc:`classify_krakenuniq` (included in demux_plus) + - :doc:`assemble_denovo` (for most viruses) + - :doc:`assemble_refbased` (for less diverse viruses, such as those from single point source human outbreaks) + - :doc:`build_augur_tree` (for nextstrain-based visualization of phylogeny) + - :doc:`genbank` (for NCBI Genbank submission) + +.. toctree:: diff --git a/pipes/WDL/tasks/tasks_interhost.wdl b/pipes/WDL/tasks/tasks_interhost.wdl index 3fdc83ace..69498a83f 100644 --- a/pipes/WDL/tasks/tasks_interhost.wdl +++ b/pipes/WDL/tasks/tasks_interhost.wdl @@ -4,7 +4,6 @@ task multi_align_mafft_ref { input { File reference_fasta Array[File]+ assemblies_fasta # fasta files, one per sample, multiple chrs per file okay - String fasta_basename = basename(reference_fasta, '.fasta') Int? mafft_maxIters Float? mafft_ep Float? mafft_gapOpeningPenalty @@ -13,6 +12,8 @@ task multi_align_mafft_ref { String docker="quay.io/broadinstitute/viral-phylo" } + String fasta_basename = basename(reference_fasta, '.fasta') + command { interhost.py --version | tee VERSION interhost.py multichr_mafft \ diff --git a/pipes/WDL/tasks/tasks_ncbi.wdl b/pipes/WDL/tasks/tasks_ncbi.wdl index 3b4848740..189283f2b 100644 --- a/pipes/WDL/tasks/tasks_ncbi.wdl +++ b/pipes/WDL/tasks/tasks_ncbi.wdl @@ -6,7 +6,6 @@ task download_fasta { Array[String]+ accessions String emailAddress - Int? machine_mem_gb String docker="quay.io/broadinstitute/viral-phylo" } @@ -26,9 +25,9 @@ task download_fasta { runtime { docker: "${docker}" - memory: select_first([machine_mem_gb, 3]) + " GB" + memory: "7 GB" cpu: 2 - dx_instance_type: "mem1_ssd1_v2_x2" + dx_instance_type: "mem2_ssd1_v2_x2" } } @@ -38,7 +37,6 @@ task download_annotations { String emailAddress String combined_out_prefix - Int? machine_mem_gb String docker="quay.io/broadinstitute/viral-phylo" } @@ -67,46 +65,50 @@ task download_annotations { runtime { docker: "${docker}" - memory: select_first([machine_mem_gb, 3]) + " GB" + memory: "7 GB" cpu: 2 - dx_instance_type: "mem1_ssd1_v2_x2" + dx_instance_type: "mem2_ssd1_v2_x2" } } task annot_transfer { + meta { + description: "Given a reference genome annotation in TBL format (e.g. from Genbank or RefSeq) and a multiple alignment of that reference to other genomes, produce new annotation files (TBL format with appropriate coordinate conversions) for each sequence in the multiple alignment. Resulting output can be fed to tbl2asn for Genbank submission." + } + input { - Array[File]+ multi_aln_fasta + File multi_aln_fasta File reference_fasta Array[File]+ reference_feature_table - Int? machine_mem_gb - String docker="quay.io/broadinstitute/viral-phylo" + String docker="quay.io/broadinstitute/viral-phylo" } - Array[Int] chr_nums=range(length(multi_aln_fasta)) - parameter_meta { - multi_aln_fasta: { description: "fasta; multiple alignments of sample sequences for each chromosome" } - reference_fasta: { description: "fasta; all chromosomes in one file" } - reference_feature_table: { description: "tbl; feature table corresponding to each chromosome in the alignment" } + multi_aln_fasta: { + description: "multiple alignment of sample sequences against a reference genome -- for a single chromosome", + patterns: ["*.fasta"] + } + reference_fasta: { + description: "Reference genome, all segments/chromosomes in one fasta file. Headers must be Genbank accessions.", + patterns: ["*.fasta"] + } + reference_feature_table: { + description: "NCBI Genbank feature tables, one file for each segment/chromosome described in reference_fasta.", + patterns: ["*.tbl"] + } } command { - set -ex -o pipefail + set -e ncbi.py --version | tee VERSION - echo ${sep=' ' multi_aln_fasta} > alignments.txt - echo ${sep=' ' reference_feature_table} > tbls.txt - for i in ${sep=' ' chr_nums}; do - _alignment_fasta=`cat alignments.txt | cut -f $(($i+1)) -d ' '` - _feature_tbl=`cat tbls.txt | cut -f $(($i+1)) -d ' '` - ncbi.py tbl_transfer_prealigned \ - $_alignment_fasta \ - ${reference_fasta} \ - $_feature_tbl \ - . \ - --oob_clip \ - --loglevel DEBUG - done + ncbi.py tbl_transfer_prealigned \ + ${multi_aln_fasta} \ + ${reference_fasta} \ + ${sep=' ' reference_feature_table} \ + . \ + --oob_clip \ + --loglevel DEBUG } output { @@ -116,49 +118,117 @@ task annot_transfer { runtime { docker: "${docker}" - memory: select_first([machine_mem_gb, 3]) + " GB" + memory: "3 GB" cpu: 2 dx_instance_type: "mem1_ssd1_v2_x2" } } task prepare_genbank { + meta { + description: "this task runs NCBI's tbl2asn" + } + input { Array[File]+ assemblies_fasta Array[File] annotations_tbl File authors_sbt - File biosampleMap - File genbankSourceTable - File? coverage_table # summary.assembly.txt (from Snakemake) -- change this to accept a list of mapped bam files and we can create this table ourselves - String sequencingTech - String comment # TO DO: make this optional - String organism - String molType = "cRNA" + File? biosampleMap + File? genbankSourceTable + File? coverage_table + String? sequencingTech + String? comment + String? organism + String? molType Int? machine_mem_gb String docker="quay.io/broadinstitute/viral-phylo" } + parameter_meta { + assemblies_fasta: { + description: "Assembled genomes. One chromosome/segment per fasta file.", + patterns: ["*.fasta"] + } + annotations_tbl: { + description: "Gene annotations in TBL format, one per fasta file. Filename basenames must match the assemblies_fasta basenames. These files are typically output from the ncbi.annot_transfer task.", + patterns: ["*.tbl"] + } + authors_sbt: { + description: "A genbank submission template file (SBT) with the author list, created at https://submit.ncbi.nlm.nih.gov/genbank/template/submission/", + patterns: ["*.sbt"] + } + biosampleMap: { + description: "A two column tab text file mapping sample IDs (first column) to NCBI BioSample accession numbers (second column). These typically take the format 'SAMN****' and are obtained by registering your samples first at https://submit.ncbi.nlm.nih.gov/", + patterns: ["*.txt", "*.tsv"] + } + genbankSourceTable: { + description: "A tab-delimited text file containing requisite metadata for Genbank (a 'source modifier table'). https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html", + patterns: ["*.txt", "*.tsv"] + } + coverage_table: { + description: "A two column tab text file mapping sample IDs (first column) to average sequencing coverage (second column, floating point number).", + patterns: ["*.txt", "*.tsv"] + } + sequencingTech: { + description: "The type of sequencer used to generate reads. NCBI has a controlled vocabulary for this value which can be found here: https://submit.ncbi.nlm.nih.gov/structcomment/nongenomes/" + } + organism: { + description: "The scientific name for the organism being submitted. This is typically the species name and should match the name given by the NCBI Taxonomy database. For more info, see: https://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#Organism" + } + molType: { + description: "The type of molecule being described. Any value allowed by the INSDC controlled vocabulary may be used here. Valid values are described at http://www.insdc.org/controlled-vocabulary-moltype-qualifier" + } + comment: { + description: "Optional comments that can be displayed in the COMMENT section of the Genbank record. This may include any disclaimers about assembly quality or notes about pre-publication availability or requests to discuss pre-publication use with authors." + } + + } + command { set -ex -o pipefail - cp ${sep=' ' annotations_tbl} . ncbi.py --version | tee VERSION - ncbi.py prep_genbank_files \ + cp ${sep=' ' annotations_tbl} . + + touch special_args + if [ -n "${comment}" ]; then + echo "--comment" >> special_args + echo "${comment}" >> special_args + fi + if [ -n "${sequencingTech}" ]; then + echo "--sequencing_tech" >> special_args + echo "${sequencingTech}" >> special_args + fi + if [ -n "${organism}" ]; then + echo "--organism" >> special_args + echo "${organism}" >> special_args + fi + if [ -n "${molType}" ]; then + echo "--mol_type" >> special_args + echo "${molType}" >> special_args + fi + if [ -n "${coverage_table}" ]; then + echo -e "sample\taln2self_cov_median" > coverage_table.txt + cat ${coverage_table} >> coverage_table.txt + echo "--coverage_table" >> special_args + echo coverage_table.txt >> special_args + fi + + cat special_args | xargs -d '\n' ncbi.py prep_genbank_files \ ${authors_sbt} \ ${sep=' ' assemblies_fasta} \ . \ - --mol_type ${molType} \ - --organism "${organism}" \ - --biosample_map ${biosampleMap} \ - --master_source_table ${genbankSourceTable} \ - ${'--coverage_table ' + coverage_table} \ - --comment "${comment}" \ - --sequencing_tech "${sequencingTech}" \ + ${'--biosample_map ' + biosampleMap} \ + ${'--master_source_table ' + genbankSourceTable} \ --loglevel DEBUG + zip sequins_only.zip *.sqn + zip all_files.zip *.sqn *.cmt *.gbf *.src *.fsa *.val mv errorsummary.val errorsummary.val.txt # to keep it separate from the glob } output { + File submission_zip = "sequins_only.zip" + File archive_zip = "all_files.zip" Array[File] sequin_files = glob("*.sqn") Array[File] structured_comment_files = glob("*.cmt") Array[File] genbank_preview_files = glob("*.gbf") diff --git a/pipes/WDL/workflows/assemble_refbased.wdl b/pipes/WDL/workflows/assemble_refbased.wdl index 336067dcb..38dcd41d1 100644 --- a/pipes/WDL/workflows/assemble_refbased.wdl +++ b/pipes/WDL/workflows/assemble_refbased.wdl @@ -14,7 +14,8 @@ workflow assemble_refbased { parameter_meta { sample_name: { - description: "Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters." + description: "Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.", + category: "common" } reads_unmapped_bams: { description: "Unaligned reads in BAM format", @@ -33,7 +34,8 @@ workflow assemble_refbased { } trim_coords_bed: { description: "optional primers to trim in reference coordinate space (0-based BED format)", - patterns: ["*.bed"] + patterns: ["*.bed"], + category: "common" } diff --git a/pipes/WDL/workflows/build_augur_tree.wdl b/pipes/WDL/workflows/build_augur_tree.wdl index cdc1df87e..3589ef960 100644 --- a/pipes/WDL/workflows/build_augur_tree.wdl +++ b/pipes/WDL/workflows/build_augur_tree.wdl @@ -9,7 +9,7 @@ workflow build_augur_tree { input { Array[File] assembly_fastas - File metadata + File sample_metadata String virus File ref_fasta File genbank_gb @@ -21,7 +21,7 @@ workflow build_augur_tree { description: "Set of assembled genomes to align and build trees. These must represent a single chromosome/segment of a genome only. Fastas may be one-sequence-per-individual or a concatenated multi-fasta (unaligned) or a mixture of the two. Fasta header records need to be pipe-delimited (|) for each metadata value.", patterns: ["*.fasta", "*.fa"] } - metadata: { + sample_metadata: { description: "Metadata in tab-separated text format. See https://nextstrain-augur.readthedocs.io/en/stable/faq/metadata.html for details.", patterns: ["*.txt", "*.tsv"] } @@ -61,14 +61,14 @@ workflow build_augur_tree { input: raw_tree = draft_augur_tree.aligned_tree, aligned_fasta = augur_mafft_align.aligned_sequences, - metadata = metadata, + metadata = sample_metadata, basename = virus } if(defined(ancestral_traits_to_infer) && length(select_first([ancestral_traits_to_infer,[]]))>0) { call nextstrain.ancestral_traits { input: tree = refine_augur_tree.tree_refined, - metadata = metadata, + metadata = sample_metadata, columns = select_first([ancestral_traits_to_infer,[]]), basename = virus } @@ -89,7 +89,7 @@ workflow build_augur_tree { call nextstrain.export_auspice_json { input: refined_tree = refine_augur_tree.tree_refined, - metadata = metadata, + metadata = sample_metadata, branch_lengths = refine_augur_tree.branch_lengths, traits = ancestral_traits.node_data_json, nt_muts = ancestral_tree.nt_muts_json, diff --git a/pipes/WDL/workflows/genbank.wdl b/pipes/WDL/workflows/genbank.wdl index f2f4d66da..8faa4cc4e 100644 --- a/pipes/WDL/workflows/genbank.wdl +++ b/pipes/WDL/workflows/genbank.wdl @@ -5,9 +5,73 @@ import "../tasks/tasks_ncbi.wdl" as ncbi workflow genbank { + meta { + description: "Prepare assemblies for Genbank submission. This includes annotation by simple coordinate transfer from Genbank annotations and a multiple alignment. See https://viral-pipelines.readthedocs.io/en/latest/ncbi_submission.html for details." + } + input { File reference_fasta - Array[File]+ assemblies_fasta # one per genome + Array[File]+ reference_annot_tbl + Array[File]+ assemblies_fasta + + File authors_sbt + File? biosampleMap + File? genbankSourceTable + File? coverage_table + String? sequencingTech + String? comment + String? organism + String? molType + } + + parameter_meta { + assemblies_fasta: { + description: "Genomes to prepare for Genbank submission. One file per genome: all segments/chromosomes included in one file. All fasta files must contain exactly the same number of sequences as reference_fasta (which must equal the number of files in reference_annot_tbl).", + patterns: ["*.fasta"] + } + reference_fasta: { + description: "Reference genome, all segments/chromosomes in one fasta file. Headers must be Genbank accessions.", + patterns: ["*.fasta"] + } + reference_annot_tbl: { + description: "NCBI Genbank feature tables, one file for each segment/chromosome described in reference_fasta.", + patterns: ["*.tbl"] + } + authors_sbt: { + description: "A genbank submission template file (SBT) with the author list, created at https://submit.ncbi.nlm.nih.gov/genbank/template/submission/", + patterns: ["*.sbt"] + } + biosampleMap: { + description: "A two column tab text file mapping sample IDs (first column) to NCBI BioSample accession numbers (second column). These typically take the format 'SAMN****' and are obtained by registering your samples first at https://submit.ncbi.nlm.nih.gov/", + patterns: ["*.txt", "*.tsv"], + category: "common" + } + genbankSourceTable: { + description: "A tab-delimited text file containing requisite metadata for Genbank (a 'source modifier table'). https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html", + patterns: ["*.txt", "*.tsv", "*.src"], + category: "common" + } + coverage_table: { + description: "A two column tab text file mapping sample IDs (first column) to average sequencing coverage (second column, floating point number).", + patterns: ["*.txt", "*.tsv"], + category: "common" + } + sequencingTech: { + description: "The type of sequencer used to generate reads. NCBI has a controlled vocabulary for this value which can be found here: https://submit.ncbi.nlm.nih.gov/structcomment/nongenomes/", + category: "common" + } + organism: { + description: "The scientific name for the organism being submitted. This is typically the species name and should match the name given by the NCBI Taxonomy database. For more info, see: https://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#Organism", + category: "common" + } + molType: { + description: "The type of molecule being described. This defaults to 'viral cRNA' as this pipeline is most commonly used for viral submissions, but any value allowed by the INSDC controlled vocabulary may be used here. Valid values are described at http://www.insdc.org/controlled-vocabulary-moltype-qualifier", + category: "common" + } + comment: { + description: "Optional comments that can be displayed in the COMMENT section of the Genbank record. This may include any disclaimers about assembly quality or notes about pre-publication availability or requests to discuss pre-publication use with authors." + } + } call interhost.multi_align_mafft_ref as mafft { @@ -16,32 +80,40 @@ workflow genbank { assemblies_fasta = assemblies_fasta } - call ncbi.annot_transfer as annot { - input: - multi_aln_fasta = mafft.alignments_by_chr, - reference_fasta = reference_fasta + scatter(alignment_by_chr in mafft.alignments_by_chr) { + call ncbi.annot_transfer as annot { + input: + multi_aln_fasta = alignment_by_chr, + reference_fasta = reference_fasta, + reference_feature_table = reference_annot_tbl + } } call ncbi.prepare_genbank as prep_genbank { input: assemblies_fasta = assemblies_fasta, - annotations_tbl = annot.transferred_feature_tables + annotations_tbl = flatten(annot.transferred_feature_tables), + authors_sbt = authors_sbt, + biosampleMap = biosampleMap, + genbankSourceTable = genbankSourceTable, + coverage_table = coverage_table, + sequencingTech = sequencingTech, + comment = comment, + organism = organism, + molType = molType } output { - Array[File] alignments_by_chr = mafft.alignments_by_chr + File submission_zip = prep_genbank.submission_zip + File archive_zip = prep_genbank.archive_zip + File errorSummary = prep_genbank.errorSummary - Array[File] transferred_feature_tables = annot.transferred_feature_tables - - Array[File] sequin_files = prep_genbank.sequin_files - Array[File] structured_comment_files = prep_genbank.structured_comment_files + Array[File] alignments_by_chr = mafft.alignments_by_chr + Array[File] transferred_feature_tables = flatten(annot.transferred_feature_tables) Array[File] genbank_preview_files = prep_genbank.genbank_preview_files - Array[File] source_table_files = prep_genbank.source_table_files - Array[File] fasta_per_chr_files = prep_genbank.fasta_per_chr_files Array[File] validation_files = prep_genbank.validation_files - File errorSummary = prep_genbank.errorSummary - String viral_phylo_version = mafft.viralngs_version + String viral_phylo_version = mafft.viralngs_version } } diff --git a/travis/install-miniwdl.sh b/travis/install-miniwdl.sh deleted file mode 100755 index e69de29bb..000000000