Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[translate] GFFs + node-data input produce invalid output #1346

Closed
jameshadfield opened this issue Dec 1, 2023 · 3 comments · Fixed by #1351
Closed

[translate] GFFs + node-data input produce invalid output #1346

jameshadfield opened this issue Dec 1, 2023 · 3 comments · Fixed by #1351
Assignees
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

Current Behavior

GFF references used by augur translate do not have the nuc (~genome) coordinates parsed, and thus are not exported as part of the node-data file's annotations when the input --ancestral-sequences is a node-data JSON. Thus, the node-data file doesn't conform to our schema and fails validation.

Expected behavior

Export a "nuc" definition in the annotations block so that the node data JSON is valid.

Possible solution

The provided node-data JSON will have the annotation→nuc entry, so we can just re-export this. For completeness this can be checked against the GFFs ##sequence-region pragma or the "region" feature (if they exist).

Your environment: if running Nextstrain locally

augur 23.1.1

@jameshadfield jameshadfield added the bug Something isn't working label Dec 1, 2023
@jameshadfield jameshadfield self-assigned this Dec 1, 2023
@huddlej
Copy link
Contributor

huddlej commented Dec 1, 2023

I ran into this issue recently, too, and noticed that “nuc” feature only gets loaded properly from a GFF by load_features when it appears as a line in the GFF file with a type of “source”. For example, the mpox workflow uses augur translate with a GFF file that includes a “source” entry for the full nucleotide sequence.

Adding a "source" entry to the GFF solved the problem for me then, but I agree that it could be nicer to just pull the existing nuc entry from the input nt muts JSON. I wish we generally looked for the nucleotide range in the ##sequence-region pragma, but I don't see that being defined consistently across pathogens...

@corneliusroemer
Copy link
Member

In general, it's best to allow gffs that come straight from genbank to work - adding source is a workaround that's not ideal but it's what I've been doing when necessary.

@jameshadfield
Copy link
Member Author

Source isn't a valid GFF ID¹, I think we've just been applying the INSDC/GenBank term to GFF files.

¹ GFF types (what we call features / feature names) must be a term from the Sequence Ontology, and 'source' isn't one according to their browser

huddlej added a commit to blab/veme-2022 that referenced this issue Dec 4, 2023
Fixes a bug with the latest version of Augur where the "nuc" annotation
did not get added the output of `augur translate`, causing downstream
commands like `augur clades` to fail the validation of the
`aa_muts.json`. This is a workaround, but we should actually look at the
sequence-region pragma to get this information [1].

[1] nextstrain/augur#1346
jameshadfield added a commit that referenced this issue Dec 5, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
jameshadfield added a commit that referenced this issue Dec 5, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
jameshadfield added a commit that referenced this issue Dec 11, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
jameshadfield added a commit that referenced this issue Dec 11, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
jameshadfield added a commit that referenced this issue Dec 20, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
jameshadfield added a commit that referenced this issue Dec 20, 2023
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants