Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of GenBank / GFF files #1351

Merged
merged 10 commits into from
Dec 20, 2023
14 changes: 14 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,18 @@

## __NEXT__

### Major Changes

* ancestral, translate: GenBank files now require the (GFF mandatory) source feature to be present.[#1351][] (@jameshadfield)
* ancestral, translate: For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required. [#1351][] (@jameshadfield)

### Features

* ancestral, translate: A range of improvements to how we parse GFF and GenBank reference files. [#1351][] (@jameshadfield)
* translate will now always export a 'nuc' annotation in the output JSON, allowing it to pass validation
* Gene/CDS names of 'nuc' are now forbidden.
* If a Gene/CDS in the GFF/GenBank file is unparsed we now print a warning.
* utils::load_features: This function may now raise `AugurError`. [#1351][] (@jameshadfield)
* ancestral: For VCF alignments, a VCF output file is now only created when requested via `--output-vcf`. [#1344][] (@jameshadfield)
* ancestral: Improvements to command line arguments. [#1344][] (@jameshadfield)
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
Expand All @@ -16,9 +26,13 @@
* translate: Improvements to command line arguments. [#1348][] (@jameshadfield)
* `--tree` and `--ancestral-sequences` are now required arguments.
* separate VCF-only arguments into their own group
* translate: Fixes a bug in the parsing behaviour of GFF files whereby the presence of the `--genes` command line argument would change how we read individual GFF lines. Issue [#1349][], PR [#1351][] (@jameshadfield)


[#1344]: https://github.com/nextstrain/augur/pull/1344
[#1348]: https://github.com/nextstrain/augur/pull/1348
[#1351]: https://github.com/nextstrain/augur/pull/1351
[#1349]: https://github.com/nextstrain/augur/issues/1349

## 23.1.1 (7 November 2023)

Expand Down
8 changes: 6 additions & 2 deletions augur/ancestral.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,8 +323,12 @@ def run(args):
from .utils import load_features
## load features; only requested features if genes given
features = load_features(args.annotation, args.genes)
if features is None:
raise AugurError("could not read features of reference sequence file")
# Ensure the already-created nuc annotation coordinates match those parsed from the reference file
if (features['nuc'].location.start+1 != anc_seqs['annotations']['nuc']['start'] or
features['nuc'].location.end != anc_seqs['annotations']['nuc']['end']):
raise AugurError(f"The 'nuc' annotation coordinates parsed from {args.annotation!r} ({features['nuc'].location.start+1}..{features['nuc'].location.end})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bummer that this + 1 logic needs to exist in so many places (ancestral, translate, and utils) to communicate or compare coordinates between different systems. It's the kind of logic that is easy to forget to include in future code and then lead to subtle bugs. I don't have a great suggestion to fix this except maybe encapsulating or replacing the SeqFeature functionality so we can interact with objects that format coordinates in our expected "one-origin, inclusive" format. That's outside of the scope of this PR though.

Copy link
Member Author

@jameshadfield jameshadfield Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. Short of adding another layer on top of SeqFeature (which might alleviate this problem but at the cost of another layer to understand) I've leaned heavily on writing tests to catch this. Auspice has this issue as well and tests were the only path to sanity!

If this were JavaScript you could monkeypatch SeqFeature to add the method (prototype) startOneBased, but in python you would have to create a separate subclass as far as I understand.

Copy link
Member Author

@jameshadfield jameshadfield Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - so you can monkeypatch things in Python. For instance, adding

from Bio.SeqFeature import FeatureLocation
def start_one_based(self):
    return self.start + 1
FeatureLocation.start_one_based = start_one_based

at the top of utils.py seemingly allows us to call <feature>.location.start_one_based(), even when those features are created by SeqIO.read(...). I found this a bit surprising because that code is (somewhere) going to have its own import FeatureLocation, so I presume Python's import logic de-duplicates those imports and thus the above code snippet refers to the one-and-only FeatureLocation class.

Adding (or rebinding) methods also flows through to already instantiated objects. My understanding of how this works in Python isn't great (JS is much clearer for me). E.g. <instance>.<method> is clearly pointing to the (one-and-only) class definition, but <instance>.<some_data> is clearly pointing to an instance-specific address. My guess would be that instances have a lookup table of attributes (which in this example would include <some_data>), and if an attribute's not there we go look for it in the class (and then the parent class etc). It's python, so it's probably using namespaces.

To complicate matters a bit more, .location.start is an instance-attribute (of .location) and is itself an instance of <class 'Bio.SeqFeature.ExactPosition'>, so it couldn't be monkeypatched the way I demonstrate above. We might be able to emulate enough of the behaviour by using something like:

FeatureLocation.start_one_based = property(start_one_based)

So long story short, I'm not going to do this in this PR, but we absolutely could do this across Augur to improve the code readability.

f" don't match the provided sequence data coordinates ({anc_seqs['annotations']['nuc']['start']}..{anc_seqs['annotations']['nuc']['end']}).")

print("Read in {} features from reference sequence file".format(len(features)))
for gene in args.genes:
print(f"Processing gene: {gene}")
Expand Down
25 changes: 11 additions & 14 deletions augur/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,19 +403,16 @@ def run(args):

## load features; only requested features if genes given
features = load_features(args.reference_sequence, genes)
if features is None:
print("ERROR: could not read features of reference sequence file")
return 1
print("Read in {} features from reference sequence file".format(len(features)))

## Read in sequences & for each sequence translate each feature _except for_ the source (nuc) feature
## Note that `load_features` _only_ extracts {'gene', 'source'} for GFF files, {'CDS', 'source'} for GenBank.
## Read in sequences & for each sequence translate each feature _except for_ the 'nuc' feature name
## Note that except for the 'nuc' annotation, `load_features` _only_ looks for 'gene' (GFF files) or 'CDS' (GenBank files)
translations = {}
if is_vcf:
(sequences, ref) = sequences_vcf(args.vcf_reference, args.ancestral_sequences)
features_without_variation = []
for fname, feat in features.items():
if feat.type=='source':
if fname=='nuc':
continue
try:
translations[fname] = translate_vcf_feature(sequences, ref, feat, fname)
Expand All @@ -425,26 +422,26 @@ def run(args):
print("{} genes had no mutations and so have been be excluded.".format(len(features_without_variation)))
else:
sequences = sequences_json(args.ancestral_sequences, tree)
translations = {fname: translate_feature(sequences, feat) for fname, feat in features.items() if feat.type != 'source'}
translations = {fname: translate_feature(sequences, feat) for fname, feat in features.items() if fname!='nuc'}

## glob the annotations for later auspice export
#
# Note that BioPython FeatureLocations use
# "Pythonic" coordinates: [zero-origin, half-open)
# Starting with augur v6 we use GFF coordinates: [one-origin, inclusive]
annotations = {}
annotations = {
'nuc': {'start': features['nuc'].location.start+1,
'end': features['nuc'].location.end,
'strand': '+',
'type': features['nuc'].type, # (unused by auspice)
'seqid': args.reference_sequence} # (unused by auspice)
}
for fname, feat in features.items():
annotations[fname] = {'seqid':args.reference_sequence,
'type':feat.type,
'start':int(feat.location.start)+1,
'end':int(feat.location.end),
'strand': {+1:'+', -1:'-', 0:'?', None:None}[feat.location.strand]}
if is_vcf: #need to add our own nuc
annotations['nuc'] = {'seqid':args.reference_sequence,
'type':feat.type,
'start': 1,
'end': len(ref),
'strand': '+'}

## determine amino acid mutations for each node
try:
Expand Down
Loading
Loading