Skip to content

Commit

Permalink
Parameterize the record ID field for output sequences
Browse files Browse the repository at this point in the history
Include a `--output-id-field` parameter in augur parse to indicate the ID
field for the output sequences file, such as using "accession" instead of
"strain".

It is important to note that the `--output-id-field` parameter is not required,
and augur parse will fall back to the DEFAULT_ID_COLUMNS (e.g. ('strain','name'))
if it is not present. If none of the DEFAULT_ID_COLUMNS are present in the fields,
fall back to using the first field.

The `--output-id-field` parameter is designed to accept a single field name,
not multiple. User are required to provide a `--fields col1 col2 strain accession`
argument in the same invocation. It seems reasonable to expect the user to
choose a specific field name for the ID.

To prevent unintended behaviors, if `--output-id-field` is not present in
`--fields` (e.g., due to a typo), augur parse will raise an error instead of
falling back to DEFAULT_ID_COLUMNS.
  • Loading branch information
j23414 committed Jan 31, 2024
1 parent abc86a8 commit 8367e59
Show file tree
Hide file tree
Showing 3 changed files with 2,181 additions and 7 deletions.
20 changes: 13 additions & 7 deletions augur/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from .io.file import open_file
from .io.sequences import read_sequences, write_sequences
from .io.metadata import DEFAULT_ID_COLUMNS
from .dates import get_numerical_date_from_value
from .errors import AugurError

Expand Down Expand Up @@ -133,8 +134,6 @@ def parse_sequence(sequence, fields, strain_key="strain", separator="|", prettif
dayfirst=fix_dates_format=='dayfirst'
)

metadata["strain"] = sequence.id

return sequence, metadata


Expand All @@ -143,6 +142,8 @@ def register_parser(parent_subparsers):
parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format")
parser.add_argument('--output-sequences', required=True, help="output sequences file")
parser.add_argument('--output-metadata', required=True, help="output metadata file")
parser.add_argument('--output-id-field', required=False,
help="The record field to use as the sequence identifier in the FASTA output.")
parser.add_argument('--fields', required=True, nargs='+', help="fields in fasta header")
parser.add_argument('--prettify-fields', nargs='+', help="apply string prettifying operations (underscores to spaces, capitalization, etc) to specified metadata fields")
parser.add_argument('--separator', default='|', help="separator of fasta header")
Expand All @@ -162,12 +163,17 @@ def run(args):
# field to index the dictionary and the data frame
meta_data = {}

if 'name' in args.fields:
strain_key = 'name'
elif 'strain' in args.fields:
strain_key = 'strain'
if args.output_id_field:
if args.output_id_field not in args.fields:
raise AugurError(f"Output id field '{args.output_id_field}' not found in fields {args.fields}.")
strain_key = args.output_id_field
else:
strain_key = args.fields[0]
for possible_id in DEFAULT_ID_COLUMNS:
if possible_id in args.fields:
strain_key = possible_id
break
if not strain_key:
strain_key = args.fields[0]

# loop over sequences, parse fasta header of each sequence
with open_file(args.output_sequences, "wt") as handle:
Expand Down
30 changes: 30 additions & 0 deletions tests/functional/parse.t
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This should fail.
.* (re)
.* (re)
.* (re)
.* (re)
augur parse: error: the following arguments are required: --fields
[2]

Expand All @@ -32,6 +33,35 @@ Parse Zika sequences into sequences and metadata.
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Parse Zika sequences into sequences and metadata using a different metadata field as record id (e.g. accession)

$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --output-id-field accession \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst

$ diff -u "parse/sequences_acc.fasta" "$TMP/sequences.fasta"
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Try to parse Zika sequences with a misspelled field.
This should fail.

$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --output-id-field notexist \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
ERROR: Output id field 'notexist' not found in fields ['strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url'].
[2]

Parse compressed Zika sequences into sequences and metadata.

$ ${AUGUR} parse \
Expand Down
Loading

0 comments on commit 8367e59

Please sign in to comment.