Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parameter to augur parse to specify a different record ID for output sequences FASTA #1403

Merged
merged 5 commits into from
Feb 12, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jan 29, 2024

Description of proposed changes

As discussed in #1402, include a parameter in augur parse to indicate the ID field for the output sequences file, such as using "accession" instead of "strain" or "name."

Open to other suggestions for parameter names or let me know if I missed adding a test somewhere.

Related issue(s)

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from 85bf1bf to 114c901 Compare January 29, 2024 23:21
@j23414 j23414 changed the title Add output-metadata-id-field to augur parse Add a parameter to augur parse to specify a different record ID for output sequences FASTA Jan 29, 2024
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from 114c901 to d1aef81 Compare January 29, 2024 23:26
@j23414 j23414 marked this pull request as ready for review January 29, 2024 23:27
augur/parse.py Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from 0e07091 to 7b45b2f Compare January 30, 2024 00:06
Copy link

codecov bot commented Jan 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (499f0e9) 67.46% compared to head (d313f58) 67.47%.

❗ Current head d313f58 differs from pull request most recent head 9d96cad. Consider uploading reports for the commit 9d96cad to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1403   +/-   ##
=======================================
  Coverage   67.46%   67.47%           
=======================================
  Files          69       69           
  Lines        7484     7474   -10     
  Branches     1840     1840           
=======================================
- Hits         5049     5043    -6     
+ Misses       2161     2159    -2     
+ Partials      274      272    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

augur/parse.py Outdated
strain_key = 'strain'
else:
strain_key = None
for possible_id in [args.output_id_field, DEFAULT_ID_COLUMNS]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @victorlin's suggested alternative to this block, but I wanted to mention that this code on line 167 doesn't do what you want it to do, @j23414. Instead of looping through a list of the user-provided output id field and the default ids, this will loop through the user-provided field and the tuple of defaults:

>>> DEFAULT_ID_COLUMNS = ("strain", "name")
>>> output_id_field = "accession"
>>> for possible_id in [output_id_field, DEFAULT_ID_COLUMNS]:
...    print(possible_id)
...
accession
('strain', 'name')

You could make this work as expected a couple of ways including iterable unpacking with the * operator or list concatenation.

# Iterable unpacking with *
>>> for possible_id in [output_id_field, *DEFAULT_ID_COLUMNS]:
...    print(possible_id)
...
accession
strain
name

# List concatenation
>>> for possible_id in [output_id_field] + list(DEFAULT_ID_COLUMNS):
...    print(possible_id)
...
accession
strain
name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah, thanks! 😅

@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from 7b45b2f to 8367e59 Compare January 31, 2024 18:46
@j23414 j23414 linked an issue Jan 31, 2024 that may be closed by this pull request
augur/parse.py Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch 2 times, most recently from fc6c6cc to f587119 Compare February 2, 2024 23:12
augur/parse.py Outdated Show resolved Hide resolved
@j23414 j23414 marked this pull request as draft February 7, 2024 21:03
@j23414 j23414 marked this pull request as ready for review February 7, 2024 23:31
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from 40e9a00 to 9810855 Compare February 7, 2024 23:40
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good pending some final touch ups

DEPRECATED.md Outdated Show resolved Hide resolved
augur/parse.py Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch 3 times, most recently from e41dbb5 to d313f58 Compare February 9, 2024 10:02
j23414 and others added 5 commits February 12, 2024 13:38
Originally augur parse first checks for 'name' then 'strain' and then the first field
to figure out the sequence id to use. This test checks the default id column order.
During augur parse, instead of hardcoding a check for a 'name' field and
then a 'strain' field to use as a sequence ID, reuse and check against the
io.metadata DEFAULT_ID_COLUMNS list to be more consistent with the rest of
augur.

This will result in a breaking behavior change in that parse is now checking
for 'strain' before 'name' as the default ID column. However, the argument
to be more consistent with the rest of augur is that the DEFAULT_ID_COLUMNS
list is already used in other parts of augur and augur parse should be no
different.
Include a `--output-id-field` parameter in augur parse to indicate the ID
field for the output sequences file, such as using "accession" instead of
"strain".

It is important to note that the `--output-id-field` parameter is not required,
and augur parse will fall back to the DEFAULT_ID_COLUMNS (e.g. ('strain','name'))
if it is not present. If none of the DEFAULT_ID_COLUMNS are present in the fields,
fall back to using the first field.

The `--output-id-field` parameter is designed to accept a single field name,
not multiple. User are required to provide a `--fields col1 col2 strain accession`
argument in the same invocation. It seems reasonable to expect the user to
choose a specific field name for the ID.

To prevent unintended behaviors, if `--output-id-field` is not present in
`--fields` (e.g., due to a typo), augur parse will raise an error instead of
falling back to DEFAULT_ID_COLUMNS.
… future.

Instead of a refactor and breaking change from 8dc0694, this commit adds a deprecation warning to the `parse` step that the default `id` field used will be reordered from ('name', 'strain') to ('strain', 'name').

This will give users time to update their scripts and workflows with an `--output-id-field 'name'` before the change is made.

co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
@j23414 j23414 force-pushed the parse-output-metadata-id-columns branch from da2a23c to 9d96cad Compare February 12, 2024 18:40
@j23414 j23414 merged commit dd8a1cb into master Feb 12, 2024
19 checks passed
@j23414 j23414 deleted the parse-output-metadata-id-columns branch February 12, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat request: add --metadata-id-columns param to augur parse?
3 participants