Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of GenBank / GFF files #1351

Merged
merged 10 commits into from
Dec 20, 2023

Conversation

jameshadfield
Copy link
Member

See commit messages for full details. Lots of tests added and general improvements beyond just addressing the issues closed.

Closes #1349
Closes #1346
Closes #953

jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Dec 5, 2023
Including such information will become mandatory in Augur following
<nextstrain/augur#1351>.

Segment length from <https://www.ncbi.nlm.nih.gov/nuccore/CY114383>
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making such a clearly ordered and documented list of commits for this PR, @jameshadfield. It is a pleasure to read through them in order and easy to follow your logic.

I didn't make it through all of the commits (only halfway through), but I wanted to share my thoughts from the first half now, to get the conversation going. I'll continue reviewing tomorrow.

CHANGES.md Outdated
@@ -2,6 +2,14 @@

## __NEXT__

* ancestral, translate: Major improvements to how we parse GFF and GenBank reference files. [#1351][] (@jameshadfield)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We should add the changes from this PR into "Major changes" or "Features", depending on whether we end up requiring sequence pragmas, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you normally categorise changes? e.g. would

For GenBank files we now require the (mandatory) source feature to be present

be a breaking change and therefore a "Major change". Similarly for things like

For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd classify anything that potentially breaks an existing interface as a major change. In following description, the first sentence describes a new feature since it's adding some new functionality that didn't exist (although you could interpret this as fixing a bug in our GFF parsing 😄). The second sentence makes the change a major change since the new requirement for the nucleotide coordinates could break an existing workflow.

For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good general advice. Updated changelog accordingly.

augur/utils.py Outdated Show resolved Hide resolved
augur/utils.py Show resolved Hide resolved
augur/utils.py Show resolved Hide resolved
augur/utils.py Outdated Show resolved Hide resolved
augur/utils.py Outdated Show resolved Hide resolved
Comment on lines +215 to +218
if feat.type == "region":
nuc['region'] = feat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there documentation we could cite here to explain why we look for (or prefer) a region type? I only see this referenced once in the GFF3 spec under the "circular genomes" section. The rest of the examples in the GFF spec use the sequence region pragma.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in this thread...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But just to make sure I understand, did you find the region type when looking into circular genome annotations in the GFF spec or did it popup somewhere else outside of the spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen it elsewhere, but I can't remember where exactly. It's also part of the SO list. (I've never seen source used outside of Nextstrain.) When working on these issues I was surprised that GFF and (especially) GenBank specs are hard to find / somewhat incomplete.

augur/utils.py Outdated
Comment on lines 233 to 231
if 'pragma' in nuc:
return nuc['pragma']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer the region entry over the pragma entry here? I'm interpreting this to mean that if the GFF file has both entries, we take the region entry, but maybe that's wrong?

Copy link
Member Author

@jameshadfield jameshadfield Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as per your earlier review comment I'm happy to prefer pragma over region. But it doesn't matter very much in practice - if there are multiple entries then we guarantee they agree with each other and export one. The only difference is that one will have "type: region" and the other "type: ##sequence-region pragma". Auspice doesn't read this information, and I'm guessing no scripts do either!

Similarly for the 'seqid' field in our annotations. I don't think anything uses this, and I'd like to remove it, but it feels like more trouble than it's worth as maybe there is something that uses it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So, since it doesn't matter which one we pick, we could skip the check for the different keys and return the first value when the dict isn't empty:

if len(nuc) > 0:
    return list(nuc.values())[0]
else:
    raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use the "##sequence-region pragma" if available (part of this diff). I'd like to keep the checks for contradicting coordinates (see my comment in another thread).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield I didn't mean to suggest replacing the check for contradicting coordinates. The code example above would replace the if/elif/else branching logic that checks for which coordinates to return. Since they are all the same, it doesn't matter which one we return and we can just pick the first one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - and in practice this'll be deterministic because dictionary key creation order is preserved, so we could guarantee the pragma is exported if it exists, but I prefer the explicit (but more verbose) form you suggested here.

augur/utils.py Outdated
return nuc['pragma']
if 'source' in nuc:
return nuc['source']
raise AugurError("Internal error (_read_nuc_annotation_from_gff fallthrough). This is a bug in Augur's code.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could replace this exception with the exception associated with an empty nuc dictionary here and use an if/elif/else structure to choose what to return like so:

if 'region' in nuc:
    return nuc['region']
elif 'pragma' in nuc:
    return nuc['pragma']
elif 'source' in nuc:
    return nuc['source']
else:
    raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")

We shouldn't need a "fallthrough" exception then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (part of this diff]

augur/utils.py Show resolved Hide resolved
# Ensure the already-created nuc annotation coordinates match those parsed from the reference file
if (features['nuc'].location.start+1 != anc_seqs['annotations']['nuc']['start'] or
features['nuc'].location.end != anc_seqs['annotations']['nuc']['end']):
raise AugurError(f"The 'nuc' annotation coordinates parsed from {args.annotation!r} ({features['nuc'].location.start+1}..{features['nuc'].location.end})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bummer that this + 1 logic needs to exist in so many places (ancestral, translate, and utils) to communicate or compare coordinates between different systems. It's the kind of logic that is easy to forget to include in future code and then lead to subtle bugs. I don't have a great suggestion to fix this except maybe encapsulating or replacing the SeqFeature functionality so we can interact with objects that format coordinates in our expected "one-origin, inclusive" format. That's outside of the scope of this PR though.

Copy link
Member Author

@jameshadfield jameshadfield Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. Short of adding another layer on top of SeqFeature (which might alleviate this problem but at the cost of another layer to understand) I've leaned heavily on writing tests to catch this. Auspice has this issue as well and tests were the only path to sanity!

If this were JavaScript you could monkeypatch SeqFeature to add the method (prototype) startOneBased, but in python you would have to create a separate subclass as far as I understand.

Copy link
Member Author

@jameshadfield jameshadfield Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - so you can monkeypatch things in Python. For instance, adding

from Bio.SeqFeature import FeatureLocation
def start_one_based(self):
    return self.start + 1
FeatureLocation.start_one_based = start_one_based

at the top of utils.py seemingly allows us to call <feature>.location.start_one_based(), even when those features are created by SeqIO.read(...). I found this a bit surprising because that code is (somewhere) going to have its own import FeatureLocation, so I presume Python's import logic de-duplicates those imports and thus the above code snippet refers to the one-and-only FeatureLocation class.

Adding (or rebinding) methods also flows through to already instantiated objects. My understanding of how this works in Python isn't great (JS is much clearer for me). E.g. <instance>.<method> is clearly pointing to the (one-and-only) class definition, but <instance>.<some_data> is clearly pointing to an instance-specific address. My guess would be that instances have a lookup table of attributes (which in this example would include <some_data>), and if an attribute's not there we go look for it in the class (and then the parent class etc). It's python, so it's probably using namespaces.

To complicate matters a bit more, .location.start is an instance-attribute (of .location) and is itself an instance of <class 'Bio.SeqFeature.ExactPosition'>, so it couldn't be monkeypatched the way I demonstrate above. We might be able to emulate enough of the behaviour by using something like:

FeatureLocation.start_one_based = property(start_one_based)

So long story short, I'm not going to do this in this PR, but we absolutely could do this across Augur to improve the code readability.

@huddlej
Copy link
Contributor

huddlej commented Dec 7, 2023

I didn't end up having any more major comments today. Thanks again for doing all of the heavy lifting to fix this parsing logic so rigorously, @jameshadfield. Thanks also for the fix to the NA gene coordinates.

@jameshadfield jameshadfield force-pushed the james/translate/nuc-annotation branch 2 times, most recently from 95e427a to 454c53c Compare December 11, 2023 04:04
@jameshadfield jameshadfield force-pushed the james/translate/nuc-annotation branch 2 times, most recently from d11ba4e to d1f3247 Compare December 20, 2023 00:40
Base automatically changed from james/ancestral-translate-fixes to master December 20, 2023 00:43
Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (fc900aa) 65.96% compared to head (c91ca33) 66.14%.

Files Patch % Lines
augur/utils.py 81.05% 9 Missing and 9 partials ⚠️
augur/ancestral.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1351      +/-   ##
==========================================
+ Coverage   65.96%   66.14%   +0.17%     
==========================================
  Files          68       68              
  Lines        7172     7221      +49     
  Branches     1761     1777      +16     
==========================================
+ Hits         4731     4776      +45     
  Misses       2176     2176              
- Partials      265      269       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This follows my own advice in <#953 (comment)>
and sets the stage for subsequent commits to improve the code. There are
no functional changes here.
Both functions which call this would check for a return value of None
(indicating that the file didn't exist) and cause the augur function to
exit. It's cleaner to lift this into `load_features` and this makes it
easier for that function to raise errors in the future (e.g. on
malformed/empty reference files).
Also catches the edge case where a GFF has no valid rows. The printed
error messages should be helpful enough to identify the GFF formatting
error(s).
See comment added in code which details the previous behaviour. While
silently skipping genes without the necessary attributes isn't the best
solution in my opinion, it's at least now consistent (and also
consistent with how we handle GenBank parsing).

Closes #1349
This fixes a long-standing oversight in Augur where GFF files could only
define the nucleotide coordinates via a row with GFF type 'source'. We
now parse the (preferred) GFF type 'region' as well as the
'##sequence-region pragma'. This allows us to exit if the nuc
coordinates are not defined, and the error message should help users
correct their GFF files.

Note that the code is not yet implemented which guarantees the extracted
nuc coordinates will be exported in the JSON.

The changes here resulted in a number of tests needing updated, largely
due to us now parsing an additional feature from GFF files (the 'nuc'
feature)
A companion commit to the previous one, this time using GenBank files.
This name is reserved in our annotations schema to refer to the genome /
segment / sequence nucleotide annotation.
This builds off the preceding 3 commits which guarantee that a 'nuc'
feature will be parsed from the reference file. We now guarantee it'll
be exported in the node-data JSON.

Note that the change to the TB aa_muts.json test file was due to a bug
in the previous code, where `'type: feat['type']` would incorrectly
reuse the last defined `feat` in the preceding loop. (I think this is a
pitfall of using large "real-life" test files as it's impractical to
manually check the source-of-truth we are comparing against.)

Since the 'nuc' feature is guaranteed to exist, we can also check it
against the existing nuc annotation within `augur ancestral`, where
applicable.

Closes #953, although there is good commentary in that issue about
improving our parsing of GFFs more generally than that implemented here.

Closes #1346
These were previously either silently skipped or, in some cases an
unhandled exception was raised.
@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 20, 2023

Thanks again for the review @huddlej

Rebased onto master.

Merging despite CI failure, as that's unrelated to this PR - it's failing while running the pathogen-repo-rsv workflow (so did #1348). Issue created at nextstrain/rsv#50

@jameshadfield jameshadfield merged commit e346693 into master Dec 20, 2023
23 of 24 checks passed
@jameshadfield jameshadfield deleted the james/translate/nuc-annotation branch December 20, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants