Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse: Prefer strain over name as sequence ID field #1629

Merged
merged 5 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## __NEXT__

### Major Changes

* parse: When both `strain` and `name` fields are present, the `strain` field will now be used as the sequence ID field. [#1629][] (@victorlin)

### Features

* merge: Generated source columns (e.g. `__source_metadata_{NAME}`) may now have their name template changed with `--source-columns=TEMPLATE` or may be omitted entirely with `--no-source-columns`. [#1625][] (@tsibley)
Expand All @@ -13,6 +17,7 @@
[#1588]: https://github.com/nextstrain/augur/issues/1588
[#1598]: https://github.com/nextstrain/augur/issues/1598
[#1625]: https://github.com/nextstrain/augur/issues/1625
[#1629]: https://github.com/nextstrain/augur/pull/1629

## 25.4.0 (3 September 2024)

Expand Down
4 changes: 1 addition & 3 deletions DEPRECATED.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@ available for backwards compatibility, but should not be used in new code.

## `augur parse` preference of `name` over `strain` as the sequence ID field

*Deprecated in version 24.2.0 (February 2024). Planned to be reordered June 2024 or after.*

Currently, `augur parse` checks for a 'name' field and then a 'strain' field to use as a sequence ID. This order will be changed in favor of searching for a 'strain' and then a 'name' field to be more consistent with the rest of Augur.
*Deprecated in version 24.2.0 (February 2024). Reordered in version __NEXT__ (September 2024).*

Users who have both 'name' and 'strain' fields in their data, and want to favor using the 'name' field should add the following `augur parse` parameter `--output-id-field 'name'`.

Expand Down
27 changes: 17 additions & 10 deletions augur/parse.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
"""
Parse delimited fields from FASTA sequence names into a TSV and FASTA file.
"""
import Bio.SeqRecord
import pandas as pd
import sys
from typing import Dict, Sequence, Tuple

from .io.file import open_file
from .io.sequences import read_sequences, write_sequences
from .dates import get_numerical_date_from_value
from .errors import AugurError

PARSE_DEFAULT_ID_COLUMNS = ("name", "strain")
PARSE_DEFAULT_ID_COLUMNS = ("strain", "name")

forbidden_characters = str.maketrans(
{' ': None,
Expand Down Expand Up @@ -88,27 +90,34 @@ def prettify(x, trim=0, camelCase=False, etal=None, removeComma=False):
return res


def parse_sequence(sequence, fields, strain_key="strain", separator="|", prettify_fields=None, fix_dates_format=None):
def parse_sequence(
sequence: Bio.SeqRecord.SeqRecord,
fields: Sequence[str],
strain_key: str,
separator: str,
prettify_fields: Sequence[str],
fix_dates_format: str,
) -> Tuple[Bio.SeqRecord.SeqRecord, Dict[str, str]]:
"""Parse a single sequence record into a sequence record and associated metadata.

Parameters
----------
sequence : Bio.SeqRecord.SeqRecord
sequence
a BioPython sequence record to parse with metadata stored in its description field.

fields : list or tuple
fields
a list of names for fields expected in the given record's description.

strain_key : str
strain_key
name of the field to use as the given sequence's unique id

separator : str
separator
delimiter to split record description by.

prettify_fields : list or tuple
prettify_fields
a list of field names for which the values in those fields should be prettified.

fix_dates_format : str
fix_dates_format
parse "date" field into the requested canonical format ("dayfirst" or "monthfirst").

Returns
Expand Down Expand Up @@ -178,8 +187,6 @@ def run(args):
for possible_id in PARSE_DEFAULT_ID_COLUMNS:
if possible_id in args.fields:
strain_key = possible_id
if possible_id == "name" and "strain" in args.fields:
print("DEPRECATED: The default search order for the ID field will be changing from ('name', 'strain') to ('strain', 'name').\nUsers who prefer to keep using 'name' instead of 'strain' should use the parameter: --output-id-field 'name'", file=sys.stderr)
break
if not strain_key:
strain_key = args.fields[0]
Expand Down
3 changes: 1 addition & 2 deletions tests/functional/parse.t
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,9 @@ Parse Zika sequences into sequences and metadata, preferred default ids is 'name
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain virus name date region country division city db segment authors url title journal paper_url \
> --output-id-field 'name' \
> --prettify-fields region country division city \
> --fix-dates monthfirst
DEPRECATED: The default search order for the ID field will be changing from ('name', 'strain') to ('strain', 'name').
Users who prefer to keep using 'name' instead of 'strain' should use the parameter: --output-id-field 'name'

$ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Expand Down
4 changes: 3 additions & 1 deletion tests/test_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ def test_parse_sequence(self):
sequence_record,
fields=fields,
strain_key="strain",
prettify_fields=["region"]
separator="|",
prettify_fields=["region"],
fix_dates_format=None,
)

assert sequence.id == metadata["strain"]
Expand Down
Loading