Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to specify arbitrary branch & clade labels #728

Merged
merged 13 commits into from
May 4, 2023

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented May 27, 2021

This PR consists of a pair of commits (see messages for details of each). These will allow us to specify the names of clades produced by augur clades and have augur export v2 export arbitrary branch labels without the need of ad-hoc scripts. This PR closes #720.

As an example for testing, the following patch shows how our ncov workflow can be simplified, as we can run multiple augur clades rules and pass their output directly to augur export, which removes the need for two extra rules.

diff --git a/workflow/snakemake_rules/main_workflow.smk b/workflow/snakemake_rules/main_workflow.smk
index f9cacf08..c978d321 100644
--- a/workflow/snakemake_rules/main_workflow.smk
+++ b/workflow/snakemake_rules/main_workflow.smk
@@ -915,6 +915,9 @@ rule clades:
         clades = rules.clade_files.output
     output:
         clade_data = "results/{build_name}/clades.json"
+    params:
+        trait_name = "clade_membership",
+        label_name = "clade"
     log:
         "logs/clades_{build_name}.txt"
     benchmark:
@@ -928,6 +931,7 @@ rule clades:
         augur clades --tree {input.tree} \
             --mutations {input.nuc_muts} {input.aa_muts} \
             --clades {input.clades} \
+            --trait-name {params.trait_name} --label-name {params.label_name} \
             --output-node-data {output.clade_data} 2>&1 | tee {log}
         """
 
@@ -940,7 +944,10 @@ rule emerging_lineages:
         emerging_lineages = config["files"]["emerging_lineages"],
         clades = config["files"]["clades"]
     output:
-        clade_data = "results/{build_name}/temp_emerging_lineages.json"
+        clade_data = "results/{build_name}/emerging_lineages.json"
+    params:
+        trait_name = "emerging_lineage",
+        label_name = "emerging_lineage"
     log:
         "logs/emerging_lineages_{build_name}.txt"
     benchmark:
@@ -954,28 +961,10 @@ rule emerging_lineages:
         augur clades --tree {input.tree} \
             --mutations {input.nuc_muts} {input.aa_muts} \
             --clades {input.emerging_lineages} \
+            --trait-name {params.trait_name} --label-name {params.label_name} \
             --output-node-data {output.clade_data} 2>&1 | tee {log}
         """
 
-rule rename_emerging_lineages:
-    input:
-        node_data = rules.emerging_lineages.output.clade_data
-    output:
-        clade_data = "results/{build_name}/emerging_lineages.json"
-    benchmark:
-        "benchmarks/rename_emerging_lineages_{build_name}.txt"
-    run:
-        import json
-        with open(input.node_data, 'r', encoding='utf-8') as fh:
-            d = json.load(fh)
-            new_data = {}
-            for k,v in d['nodes'].items():
-                if "clade_membership" in v:
-                    new_data[k] = {"emerging_lineage": v["clade_membership"]}
-        with open(output.clade_data, "w") as fh:
-            json.dump({"nodes": new_data}, fh, indent=2)
-
-
 rule colors:
     message: "Constructing colors file"
     input:
@@ -1124,7 +1113,7 @@ def _get_node_data_by_wildcards(wildcards):
         rules.refine.output.node_data,
         rules.ancestral.output.node_data,
         rules.translate.output.node_data,
-        rules.rename_emerging_lineages.output.clade_data,
+        rules.emerging_lineages.output.clade_data,
         rules.clades.output.clade_data,
         rules.recency.output.node_data,
         rules.traits.output.node_data,
@@ -1180,28 +1169,10 @@ rule export:
             --output {output.auspice_json} 2>&1 | tee {log}
         """
 
-rule add_branch_labels:
-    message: "Adding custom branch labels to the Auspice JSON"
-    input:
-        auspice_json = rules.export.output.auspice_json,
-        emerging_clades = rules.emerging_lineages.output.clade_data
-    output:
-        auspice_json = "results/{build_name}/ncov_with_branch_labels.json"
-    log:
-        "logs/add_branch_labels{build_name}.txt"
-    conda: config["conda_environment"]
-    shell:
-        """
-        python3 ./scripts/add_branch_labels.py \
-            --input {input.auspice_json} \
-            --emerging-clades {input.emerging_clades} \
-            --output {output.auspice_json}
-        """
-
 rule incorporate_travel_history:
     message: "Adjusting main auspice JSON to take into account travel history"
     input:
-        auspice_json = rules.add_branch_labels.output.auspice_json,
+        auspice_json = rules.export.output.auspice_json,
         colors = lambda w: config["builds"][w.build_name]["colors"] if "colors" in config["builds"][w.build_name] else ( config["files"]["colors"] if "colors" in config["files"] else rules.colors.output.colors.format(**w) ),
         lat_longs = config["files"]["lat_longs"]
     params:
@@ -1228,7 +1199,7 @@ rule incorporate_travel_history:
 rule finalize:
     message: "Remove extraneous colorings for main build and move frequencies"
     input:
-        auspice_json = lambda w: rules.add_branch_labels.output.auspice_json if config.get("skip_travel_history_adjustment", False) else rules.incorporate_travel_history.output.auspice_json,
+        auspice_json = lambda w: rules.export.output.auspice_json if config.get("skip_travel_history_adjustment", False) else rules.incorporate_travel_history.output.auspice_json,
         frequencies = rules.tip_frequencies.output.tip_frequencies_json,
         root_sequence_json = rules.export.output.root_sequence_json
     output:

I've tested this in a variety of settings, but more is needed. Unit tests (or similar) would be useful here, but it's been a while since I've written these for augur (anyone want to pair program these?).

@jameshadfield jameshadfield requested review from huddlej and a team May 27, 2021 23:46
@codecov

This comment has been minimized.

augur/clades.py Outdated Show resolved Hide resolved
augur/clades.py Outdated
def create_node_data_structure(basal_clade_nodes, clade_membership, args):
node_data = {}

if (not args.label_name and not args.trait_name):
Copy link
Member Author

@jameshadfield jameshadfield May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to allow workflows to continue without needing changes. There were 2 ways I could think of allowing this:

  1. augur clades without these 2 arguments used the old behaviour & exported both clade_membership and clade_annotation as node traits. These would be picked up by augur export v2 and the latter turned into the branch label clade. The downside is that the file structure for augur clades is different if you don't provide arguments than if you do.
  2. augur clades without these 2 arguments now stores clade membership as before but stores branch labels in the new branch_labels structure under a key clade. This structure needs no special interpretation by augur export v2, and we will end up with identical Auspice JSONs as previously. The downside is that the format of the file produced by augur clades differs.

I went with option 2, but am open to other suggestions.

Copy link
Contributor

@huddlej huddlej Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make clade and clade_membership the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?

As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.

If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run.

Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! My reason for requiring users to specify is to allow users to call augur clades multiple times in a single workflow, sometimes with branch labels, sometimes without (and perhaps sometimes without trait names etc). If these had defaults, then the defaults will end up exported in the auspice JSON which may be undesired and potentially confusing as it wouldn't be clear which invocation of augur clades produced it.

Concrete examples for discussion:

augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic
augur export ...

This is going to end up with "clade" branch labels representing pangolin clades, which wasn't the desired intention of the workflow.

augur clades ... --trait-label pango # no branch label - not guaranteed monophyletic
augur clades ... --branch-label emerging_lineage # no trait labels
augur clades ... --trait-label WHO
augur export ...

We're going to get branch labels "clade", which I think will be WHO clades (this is an implementation detail of augur export as to which one is picked - worst case they may be a mixture!). We're also going to get a colouring clade_membership, which actually refers to emerging lineage, but it isn't obvious why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Those examples helped! If I understand correctly, there are two separate issues that we’re trying to address by requiring these new arguments:

  1. Users can run augur clades multiple times with the same (default) attribute/label names and augur export will happily consume these and prefer one over the other in a surprising or unpredictable order (at least for the user).

  2. Users can run augur clades with one or both of the new arguments depending on what they want to annotate (clade attributes only, branch labels only, or both). Using default values would produce unwanted outcomes when users choose only attributes or only branch labels and also gets an annotation for the other possible representation.

Is that summary generally correct?

I can see how requiring these arguments tries to protect against conflicting node/branch attributes in augur export. This is similar to why augur distance requires --attribute-name. But this seems to be a general problem with the export logic where we don't check (I think?) for collisions in attribute names from different data sources. So, even though we require the user to specify attribute names, there is no reason they couldn’t specify the same names in separate commands and still get a surprise collision. Another way to address issue 1 would be to check for these types of collisions in augur export and either warn the user or throw an error. In addition to addressing Issue 1 here, this solution would also address other cases in the real world where people accidentally define the same attribute in separate runs of other augur commands. If issue 1 was the only issue, I'd still prefer to set sane defaults and not expect the user to change their behavior.

Issue 2 is one I missed on my initial read through the code (that you can define attributes or labels and not both). Still, I wonder about how bad it would be for users to get branch labels when they only request attributes. If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature? Is the worst case scenario here that the user is annoyed to get an annotation they don’t expect? We have already been providing these dual annotations, so would they actually be surprised? The main issue seems to be when the default names for the other representation conflict across multiple runs of the same command.

This example also makes me wonder about the value of using different names for attributes and branch labels. The name we use describes the data source of the clade annotations and not how Auspice represents clade annotations. That a clade appears as a color-by or branch label in Auspice is a separate technical consideration.

I also don’t see the harm in annotating both node and branch attributes with the same name by default. What if use the same name for both attributes and keep a sane default value (e.g., “clade”)? This approach allows the user who only runs clades once in a workflow to change nothing and run:

# Annotate both node and branch attributes. The user gets
# output that differs in its JSON structure but appears the
# same way in Auspice as it always has. Augur export knows
# how to handle the new JSON structure in this same release
# of Augur, so we don't need any special checks for backward
# compatibility.
augur clades \
    --clades clades.tsv \
    --output clades.json

Then, the user who wants to run multiple instances of clades in a single workflow can run the following commands to be more explicit about their attribute names:

# Provide explicit node/branch attribute names.
augur clades \
    --clades clades.tsv \
    --attribute-name nextstrain_clade \
    --output clades.json

augur clades \
    --clades pango.tsv \
    --attribute-name pango \
    --output pango.json

If users specify the same attribute name in separate data sources, augur export should complain loudly:

# Use the default attribute name. Annotate both node and
# branch attributes.
augur clades \
    --clades clades.tsv \
    --output clades.json

# Accidentally reuse the same default attribute name.
augur clades \
    --clades pango.tsv \
    --output pango.json

# Validate attribute names from distinct data sources.
augur export v2 \
    --node-data clades.json pango.json
    ...
ERROR: Multiple node data files ("clades.json", "pango.json") provide the same attribute name ("clade"). Resolve conflicting attribute names (e.g., by specifying `--attribute-name`) for these data files and try again.

Allowing default values makes this a backward-compatible change where most users do not have to do anything. Using the same name for node and branch attributes allows the user to know which augur clades invocation produced those attributes and not have to think about how the clade annotation is represented in Auspice. Checking for collisions in node/branch attribute names in augur export alerts the user when they accidentally reuse the same attribute names in separate invocations and tells them how to correct the problem (and fixes a more general issue with augur export). I think this approach also simplifies the code in this PR a bit by reusing the same attribute name.

What do you think?

Copy link
Member Author

@jameshadfield jameshadfield Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the scope of augur clades, I agree with these points (branch labels and node attrs stored under the same attribute name, both exported, one optional `--attribute-name" arg with a default of "clade"), and am happy to make the changes to that code.

Where it gets tricky is in augur export, because that is when we combine various pieces of data into a desired visualisation. To cleanly demarcate data generation vs visualisation, I do think these complexities are the remit of augur export. How we determine what's exported has always been somewhat poorly documented and without looking at the code I can't remember what happens in many cases:

  • What happens if pieces of (meta)data differ in the metadata TSV and a node-data file?
  • Is a node-data attribute always exported as a colouring, even if we provide a list of desired colourings in an auspice config JSON which doesn't specify it?
  • what about if we provide a list of colourings on the command line?
  • What about if we do both?
  • Are there special cases? (Yes, at least 18 and probably more.)

This relates to this PR as I think we want to have answers to the following questions:

  • Previously, clade colourings were exported as "clade_membership", and this was always set as a colouring if a node-data file provided it. This is easy to update to "clade" if we we want to keep this behaviour.
  • If an auspice config JSON specified a colouring for key="clade_membership", which many do, but such an attribute is no longer provided in any node-data JSONs, what do we do?
  • Is there a way to limit the exported colourings from node-data produced by augur clades? i.e. is specifying a list of colourings in the config JSON able to prevent the export of such a node-data attribute?
  • Currently there's no general way to export branch labels (that's part of this PR). Do we extend the auspice config PR to allow these to be specified? Does this act the same way as colorings?

P.S.

If I ask for emerging lineage branch labels and I get an emerging lineage color-by as a side effect, is that a bug or a feature?

It's a bug. The dataset for visualisation should be completely customisable - if you believe such a colouring / branch label is scientifically not valid, you should be able to prevent it appearing in Auspice. I realise there's many cases in augur export where things like these happen; they're bugs.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, @jameshadfield! It's going to make the ncov workflow simpler, but it also paves the way for us to do cool things with custom branch labels or alternate clade annotations in other projects.

My main request below is that we provide default values for the new attribute/label variables, so users do not have to provide values if they don't want to.

As with the schema update PR, we could merge this as is, or we could pair-program some doctests. Whatever works best for you...

augur/clades.py Outdated Show resolved Hide resolved
augur/clades.py Outdated Show resolved Hide resolved
augur/clades.py Show resolved Hide resolved
augur/clades.py Outdated Show resolved Hide resolved
augur/clades.py Outdated
def create_node_data_structure(basal_clade_nodes, clade_membership, args):
node_data = {}

if (not args.label_name and not args.trait_name):
Copy link
Contributor

@huddlej huddlej Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make clade and clade_membership the default values for the label and attribute names and store their values in the new structure? Is there a reason to make these required arguments in the future?

As a user, I would be surprised to find I need to define these values when I've never needed to before and I'd probably just use the defaults anyway.

If we allow these arguments to have default values, then we only need five lines of this function and those can be moved into run.

Edit (james) - got confused with GitHub's inlining, hid this comment, and now can't unhide it...

augur/clades.py Outdated Show resolved Hide resolved
augur/export_v2.py Outdated Show resolved Hide resolved
augur/export_v2.py Outdated Show resolved Hide resolved
augur/export_v2.py Outdated Show resolved Hide resolved
@jameshadfield
Copy link
Member Author

jameshadfield commented Jun 5, 2021

Thanks for the great review @huddlej. Following on from #728 (comment), I've updated augur clades with your suggestions in 16280dd (I'll squash this with it's parent before merge), but I haven't update augur export v2 yet.

I started some overly simple functional tests of augur clades using a small tree and a few mutations:

image

While creating these tests I noticed a bunch of little things which are all out of scope for this PR... should I create issues for these?

  • Despite the help indicating that nucleotide and/or amino-acid mutations are required, the node-data JSONs, when combined, must contain muts and aa_muts keys for each node because the augur clades codes assumes their existence.
  • Every node in the tree must have a corresponding entry in a node-data JSON, even if it has no mutations (this is asserted in NodeReader)
  • A single branch can define multiple mutations at the same position without an error being thrown, but each mutation overrides the previous and the results are unexpected. We should probably exit in this case.
  • #-prefixed lines in the clades TSV work as comments, but they're actually read as potentially valid clade definitions! I suggest we add comment='#' to pd.read_csv here.
  • The behaviour of augur clades means that if there are multiple nodes containing clade-defining mutations (i.e. the clade is polyphyletic), then we only annotate clades on the biggest monophyly. We should warn when situations like this arrise, or allow this to be relaxed. I expect it'll become common to want to define "clades" via a small set of constellation nCoV mutations, and expect polyphyletic colourings in Auspice.
  • Relatedly, how we calculate "biggest" took a bit of time for me to understand. As far as I can tell (it may be different for VCF inputs), we count the number of descendant nodes which have not mutated away from the clade-defining set of mutations, but don't require these nodes to actually be in the clade (e.g. tipE counted as within cladeDEF for this purpose, but in the output it is (correctly) annotated as cladeE).

@rneher
Copy link
Member

rneher commented Jun 9, 2021

This looks pretty good to me. A few questions:

We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly. I think the current (and probably previous) behavior

augur/augur/clades.py

Lines 152 to 158 in 16280dd

# propagate 'clade_membership' to children nodes
# don't propagate if encountering 'clade_annotation'
for node in tree.find_clades(order = 'preorder'):
for child in node:
# if the child doesn't define the start of its own clade, but the parent belongs to a clade, then inherit that membership
if child.name not in basal_clade_nodes and node.name in clade_membership:
clade_membership[child.name] = clade_membership[node.name]

is fine. But might be good to stick in a comment.

I am wondering whether we should instead of branch_labels here

augur/augur/clades.py

Lines 212 to 215 in 16280dd

node_data = {
'nodes': {node: {args.attribute_name: clade} for node,clade in clade_membership.items()},
'branch_labels': {node: {args.attribute_name: clade} for node,clade in basal_clade_nodes.items()}
}

use a structure like this

{
  nodes: { node1: { key: value}...},
  branches: { branch1: {key:value}...}
}

the key: value in this case could be labels: { pango: B.1.1.7}.

The structure would be a bit more symmetrical in branches and nodes and might be more future proof bc we could add additional branch attributes without cluttering the top level. In other commands, this is used for auxillary info like version numbers, etc....

@jameshadfield
Copy link
Member Author

jameshadfield commented Jun 9, 2021

Thanks @rneher

Current root node behaviour hasn't changed, but I'm not exactly sure what you mean. Are you saying that if a clade should be defined at the root, augur clades wouldn't do this? I would have expected it to do so if you provided a reference sequence.

re: updated nodes & branches structure, you're essentially proposing that the node-data structure for branches start to converge on the auspice dataset structure for branch_attrs. Would it be strange to have different structure for nodes & branches in node-data JSONs? cc @huddlej

jameshadfield added a commit to nextstrain/ncov that referenced this pull request Jun 10, 2021
This commit is a WIP commit to test the new functionality
being introduced in augur PR 728 [1]. This allows us to
simplify the nCoV workflow as we can explicitly define the
attribute names used for clade membership and branch
labelling.

These changes have only been tested for the "open" build,
which itself is a WIP.

[1] nextstrain/augur#728
@huddlej huddlej added this to the Feature release 12.1.0 milestone Jun 14, 2021
@huddlej
Copy link
Contributor

huddlej commented Jun 14, 2021

From @rneher's review:

We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly.

@jameshadfield, I understood this to mean that Augur is not guaranteed to assign the root node to a clade (the root sequence might not have any of the defined mutations), so its clade membership is implicitly undefined. We could add a check for the root node in the clades dict and then explicitly assign it a value, to make this logic clearer.

From @rneher's review:

The structure would be a bit more symmetrical in branches and nodes and might be more future proof...[snip]

I like this symmetry, too, as a flexible way to annotate anything we like about branches and mostly for the parallel naming of "physical" objects.

From @jameshadfield:

Would it be strange to have different structure for nodes & branches in node-data JSONs?

When you and I talked about this on Zoom, @jameshadfield, I think this was why we didn't use branches instead of branch_labels, but now I don't fully understand the issue. Is the issue that the final Auspice JSONs produce node_attrs and branch_attrs that don't have the same structure, so it might be misleading to use node data JSON inputs that suggest those inputs will be structured similarly?

Even if this is the case, it seems that the node data JSON format is a kind of generic interface that could be decoupled from how the final output of augur export handles the data. Not knowing nearly as much about Auspice as you and Richard, I wouldn't be surprised if augur export transformed my node data into something Auspice-specific...

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the edits to the main interface, @jameshadfield. This looks really good. The only bit to resolve before we merge is the branch_labels vs. branches naming question.

augur/clades.py Outdated Show resolved Hide resolved
augur/clades.py Outdated

# third pass to propagate 'clade_membership'
# propagate 'clade_membership' to children nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we could check for the root node's clade membership and assign it something like "undefined", if we wanted to handle this case explicitly.

Copy link
Member Author

@jameshadfield jameshadfield Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit unsure about all this.

Nodes not part of clades ("undefined") aren't part of the output of augur clades, so explicitly annotating the root node as such would be strange.

When the inputs define the sequence for the root node, then the root can be annotated with a clade - see the nCoV workflow where the root node+branch are assigned clade 19A. My understanding is that this needs the entire root sequence as an input, we don't infer this from the observed mutations, but this is something we should test / document.

tests/functional/clades.t Outdated Show resolved Hide resolved
@jameshadfield
Copy link
Member Author

Rebased this onto master now that #737 is merged & updated the issues @huddlej pointed out.

My observation about the node-data structure doesn't involve auspice, rather the difference this would cause in nodes & branches structure within a single node-data JSON, with branches having a second level of hierarchy, e.g.

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "labels": {
      "NODE_0000000": {
        "clade_membership": "19A"
      }
    }
  }
}

As long as we are aware of this, I'm happy to shift to this structure.

@huddlej
Copy link
Contributor

huddlej commented Jun 15, 2021

Ah, I see. That example clears it up. I think what @rneher is recommending looks like this instead:

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "NODE_0000000": {
      "labels": {
        "clade": "19A"
      }
    }
  }
}

How do you feel about this approach?

@rneher
Copy link
Member

rneher commented Jun 15, 2021

yes, this is what I meant. I hope this is more generic and future proof (things like support values could live on branches). On the other hand, we do assign a bunch of things to nodes that really should be branch properties (like mutations or branch lengths). So I guess we could stick the branch label to the node structure as

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A",
      "branch_labels":{"clade":'20A'},
    }
  },

But I would prefer a top-level branches to a top-level branch-labels.

@jameshadfield
Copy link
Member Author

Updated this PR to use the new structure from @huddlej / @rneher above:

{
  "nodes": {
    "ARG/Cordoba-12873-61/2020": {
      "clade_membership": "20A"
    }
  },
  "branches": {
    "NODE_0000000": {
      "labels": {
        "clade": "19A"
      }
    }
  }
}

And added some more functional tests. I think it'd be worth running nextstrain/ncov#660 with this (updated) PR as a final round of tests before merge. I'll start this run now.

@jameshadfield
Copy link
Member Author

jameshadfield commented Apr 11, 2023

After being on the agenda forever I'm finally going to get this merged. The overall summary is as per this comment above.

[@joverlee521] Based on conversation in Auspice, we should check that any arbitrary label_key is not "none" so that they don't clash with ?branchLabel=none to hide branch labels.

Good call - I've modified augur export and added this to a test to ensure such a key will not be exported.

[@trvrb] Start with labels as the only thing in branches and plan to migrate mutations etc... down the line.

I think this is the better direction - as per John & Richard's comments above. I do see the fear that mutations never get moved across, but I hope they do!

[@rneher] We don't seem to handle the case when the root node of the tree is not assigned to a clade explicitly.

I'm still wrapping my head around this and trying to construct a test to really understand what's going on here (and to understand if providing a reference changes things). I'll do that separately to this PR however as the behavior is unchanged here

This PR will close #720
This PR will close #1027

Our current implementation of read_node_data requires that every
node in the tree is specified in the (merged) node_data files. For
mutations this is overkill -- many nodes don't have mutations and it's
overkill to require node_data JSONs to specify things like
`"node_name": {"muts": []}`.

This may well be the general behaviour we want, but i didn't want to
modify the read_node_data function which sees extensive use.

A welcome side effect of these changes is that we no longer have to
supply both nuc and aa_muts.
See comments in tests/functional/clades.t

Also adds / updates comments and docstrings which were noticed as I
worked through the code relating to these tests.
Workflows may be using this so I elected to hide it rather than remove
it (and warn people it's a no-op if they do happen to be using it)
This function had a few subtle bugs in it which are fixed here, as well
as improving the warning message to explain how this may affect clade
inference.

Note that the presence of sequences on nodes other than the root is
not considered by augur clades.
We could check all of these up-front instead of exiting upon the first
error, and such a check should be part of validation within augur
clades, but this commit is a simple solution to fix a reported bug.

Closes #965
A fatal error is raised if no clades are defined, but if a clade is not
found on the tree it's only a warning.
Suggested in #735
Multiple mutations at the same position on a single branch are now a
fatal error. Previous behaviour was to overwrite such mutations when
parsing. Suggested by #735.
Multiple improvements to augur clades
@jameshadfield jameshadfield merged commit 631feb6 into master May 4, 2023
@jameshadfield jameshadfield deleted the branch-labels branch May 4, 2023 03:50
@corneliusroemer
Copy link
Member

corneliusroemer commented May 15, 2023

@jameshadfield It would be good to add to the Changelog how the internal representation of node data has changed. I couldn't find the info at a glance and this PR has many comments. See e.g. this failure where a script-created node-data-json is no longe accepted by export: nextstrain/conda-base#27 (comment)

ERROR: results/europe/rbd_levels.json did not contain either `nodes` or `branches`. Please check the formatting of this JSON!

Also, I think this should be reclassified as a breaking change, given that we use a lot of custom scripts in our workflows.

corneliusroemer added a commit that referenced this pull request May 15, 2023
In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring.

This is problematic for test scripts that may produce empty node data in test cases.

This PR removes the eager validation. In the future we could reintroduce it as a warning.
And possibly an error but with opt-out.
corneliusroemer added a commit that referenced this pull request May 15, 2023
Resolves #1215

Warn instead error when no nodes in a node data json, fixing issue introduced recently in PR #728

In PR #728, extra node data validation was introduced. In particular, files without information for either `nodes` or `branches` caused erroring.

This is problematic for test scripts that may produce empty node data in test cases.

This PR removes the eager validation. In the future we could reintroduce it as a warning.
And possibly an error but with opt-out.

This type of node data json was previously errored on by augur export, it is now accepted again:

```json
{
  "nodes": {},
  "rbd_level_details": {}
}
```

<!-- Start typing the name of a related issue and GitHub will auto-suggest the issue number for you.  -->
Fixes the ncov pathogen-CI issue: nextstrain/conda-base#27 (comment)

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

- [x] nextstrain/conda-base#27 (comment) is fixed, export now accepts empty nodes dicts again
jameshadfield added a commit to nextstrain/ncov that referenced this pull request May 16, 2023
This updates the workflow to use the new clades interface from augur
v22 (see nextstrain/augur#728). In the process we can remove two rules
from the workflow.

If this workflow is run with augur prior to v22, the emerging_lineages
rule  will error due to unknown arguments.

The script add_branch_labels.py is no longer used, but not removed here,
as it contains logic to export spike mutations as branch labels
which may be useful at some point. If we do use this, it would be better
to produce an intermediate node-data JSON with a custom branch label
to avoid modifying the auspice JSON after export.
jameshadfield added a commit to nextstrain/ncov that referenced this pull request May 16, 2023
This updates the workflow to use the new clades interface from augur
v22 (see nextstrain/augur#728). In the process we can remove two rules
from the workflow.

If this workflow is run with augur prior to v22, the emerging_lineages
rule  will error due to unknown arguments.

The script add_branch_labels.py is no longer used, but not removed here,
as it contains logic to export spike mutations as branch labels
which may be useful at some point. If we do use this, it would be better
to produce an intermediate node-data JSON with a custom branch label
to avoid modifying the auspice JSON after export.
jameshadfield added a commit that referenced this pull request May 16, 2023
The intention of the coloring logic is that if an auspice-config provides
the clade_membership key then it is exported at that position in the
colorings list. If clade_membership is not explicitly set in the config
(but is present in a node-data file) then we have (for a very long time)
added it as the very first entry in the colorings list.

PR #728 (augur v22.0.0) erroneously modified the behavior of the second
case described above, which has now been restored by this commit.
jameshadfield added a commit that referenced this pull request May 16, 2023
The intention of the coloring logic is that if an auspice-config provides
the clade_membership key then it is exported at that position in the
colorings list. If clade_membership is not explicitly set in the config
(but is present in a node-data file) then we have (for a very long time)
added it as the very first entry in the colorings list.

PR #728 (augur v22.0.0) erroneously modified the behavior of the second
case described above, which has now been restored by this commit.
jameshadfield added a commit to nextstrain/ncov that referenced this pull request May 16, 2023
This updates the workflow to use the new clades interface from augur
v22.0.1 (see nextstrain/augur#728). In the process we can remove two
rules from the workflow.

If this workflow is run with augur prior to v22, the emerging_lineages
rule  will error due to unknown arguments.

The script add_branch_labels.py is no longer used, but not removed here,
as it contains logic to export spike mutations as branch labels
which may be useful at some point. If we do use this, it would be better
to produce an intermediate node-data JSON with a custom branch label
to avoid modifying the auspice JSON after export.
jameshadfield added a commit to nextstrain/ncov that referenced this pull request May 16, 2023
This updates the workflow to use the new clades interface from augur
v22 (see nextstrain/augur#728). In the process we can remove two
rules from the workflow. The minimum augur version is bumped to 22.0.1,
as that includes a couple of important bug-fixes.

If this workflow is run with augur prior to v22, the emerging_lineages
rule  will error due to unknown arguments.

The script add_branch_labels.py is no longer used and thus removed here
(as recommended in code review: #1000 (comment))
Note that it contained unused functionality to export spike mutations;
if we reinstate this in the future we should update the output format
to produce a node-data JSON with a custom branch label to avoid modifying
the auspice JSON after export.
joverlee521 added a commit to nextstrain/ncov that referenced this pull request Jan 29, 2024
The JSON output from `augur clades` was updated to separate `nodes`
and `branches` in nextstrain/augur#728 so now
the `assign_rbd_levels` script needs to parse the `branches` in order
to find the basal node.
joverlee521 added a commit to nextstrain/ncov that referenced this pull request Jan 29, 2024
The JSON output from `augur clades` was updated to separate `nodes`
and `branches` in nextstrain/augur#728 so now
the `assign_rbd_levels` script needs to parse the `branches` in order
to find the basal node.
joverlee521 added a commit to nextstrain/ncov that referenced this pull request Jan 30, 2024
The JSON output from `augur clades` was updated to separate `nodes`
and `branches` in nextstrain/augur#728 so now
the `assign_rbd_levels` script needs to parse the `branches` in order
to find the basal node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Ability to export branch labels
7 participants