Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to specify arbitrary branch & clade labels #728

Merged
merged 13 commits into from
May 4, 2023

Commits on Apr 11, 2023

  1. Allow branch labels in node-data JSONs

    Previously branch labels could not be specified in data passed to
    `augur export v2` except for two "special cases":
    (i) AA mutations (stored in node-data-json -> nodes) would create branch
    labels "aa", if applicable.
    (ii) `clade_annotation` (stored in node-data-json -> nodes) was
    interpreted to be the "clade" branch label, and exported as such.
    
    Here we extend the allowed node-data structure to include a top-level
    key `branches` as described in [1] and the test data added here [2].
    This data is exported in the appropriate format for Auspice (unchanged).
    This paves the way for pipelines to define a range of branch labels for
    export. Currently the only usable key in this dict is 'labels'.
    
    If a branch label (via node-data-json -> branches -> node_name -> label)
    is provided for 'aa' or 'clade' then this will overwrite the values
    generated above (i, ii).
    
    A side-effect of this work is that the requirement for node-data JSONs
    to specify "nodes" has been relaxed (see [2] for an example); however
    if neither "nodes" nor "branches" are defined then we raise a validation
    error.
    
    [1] #720
    [2] ./tests/functional/export_v2/branch-labels.json
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    90d1a5f View commit details
    Browse the repository at this point in the history
  2. [clades] export labels as specific branch labels

    Previously clade membership (i.e. the coloring) and the branch labels
    defining the root of the clade were defined via:
    <OUTPUT_NODE_DATA> → nodes → <node_name> → clade_membership, and
    <OUTPUT_NODE_DATA> → nodes → <node_name> → clade_annotation.
    `augur export` would then convert the clade_annotation into a
    branch label named 'clade'.
    
    Here we change the format of augur clade's OUTPUT_NODE_DATA so
    that the membership and labels are now stored via:
    <OUTPUT_NODE_DATA> → nodes → <node_name> → clade_membership, and
    <OUTPUT_NODE_DATA> → branches → <node_name> → labels → clade.
    The previous commit modified augur export to handle this format.
    
    Augur pipelines should be fully backwards compatible as long as a new
    major version of augur is released, as we ensure that node-data files
    are created by the same augur (major) version. Scripts which relied on
    the format of this node-data file may be affected.
    
    Note that we keep the key 'clade_membership' deliberately: this is
    used in auspice-config JSONs and auspice URLs, and so changing it will
    cause lots of downstream issues for a minimal syntax improvement. (The
    `clade_annotation` key name was never exported in auspice JSONs.)
    
    This commit paves the way for allowing custom key names.
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    5ba7cf1 View commit details
    Browse the repository at this point in the history
  3. [clades] allow custom membership / label names

    These arguments shouldn't need to be used in most cases but are really
    useful for pipelines which run `augur clades` multiple times (e.g.
    nCoV's emerging lineages). This will allow _n_ node-data files to be
    passed to `augur export` with a resulting _n_ colorings and labels.
    (Currently you need multiple extra steps: the node-data JSON needs to
    have the key names changed, and then you need to manually set branch
    labels in the auspice JSON.)
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    fd88aa7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    007cb47 View commit details
    Browse the repository at this point in the history
  5. [clades] allow node-data nodes to be a subset of tree nodes

    Our current implementation of read_node_data requires that every
    node in the tree is specified in the (merged) node_data files. For
    mutations this is overkill -- many nodes don't have mutations and it's
    overkill to require node_data JSONs to specify things like
    `"node_name": {"muts": []}`.
    
    This may well be the general behaviour we want, but i didn't want to
    modify the read_node_data function which sees extensive use.
    
    A welcome side effect of these changes is that we no longer have to
    supply both nuc and aa_muts.
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    4316a7d View commit details
    Browse the repository at this point in the history
  6. [clades] tests for clades set at the root node

    See comments in tests/functional/clades.t
    
    Also adds / updates comments and docstrings which were noticed as I
    worked through the code relating to these tests.
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    22e2444 View commit details
    Browse the repository at this point in the history
  7. [clades] supress unused --references arg

    Workflows may be using this so I elected to hide it rather than remove
    it (and warn people it's a no-op if they do happen to be using it)
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    0cb841d View commit details
    Browse the repository at this point in the history
  8. [clades] improve reference sequence parsing

    This function had a few subtle bugs in it which are fixed here, as well
    as improving the warning message to explain how this may affect clade
    inference.
    
    Note that the presence of sequences on nodes other than the root is
    not considered by augur clades.
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    0aaf6a7 View commit details
    Browse the repository at this point in the history
  9. [clades] catch error where pos is beyond ref length

    We could check all of these up-front instead of exiting upon the first
    error, and such a check should be part of validation within augur
    clades, but this commit is a simple solution to fix a reported bug.
    
    Closes #965
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    a356a9e View commit details
    Browse the repository at this point in the history
  10. [clades] require required arguments

    Closes #1153
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    2c6b662 View commit details
    Browse the repository at this point in the history
  11. [clades] warnings for unfound clades

    A fatal error is raised if no clades are defined, but if a clade is not
    found on the tree it's only a warning.
    Suggested in #735
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    40e549d View commit details
    Browse the repository at this point in the history
  12. [clades] check for multiple mutations at same pos

    Multiple mutations at the same position on a single branch are now a
    fatal error. Previous behaviour was to overwrite such mutations when
    parsing. Suggested by #735.
    jameshadfield committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    e5cfc3a View commit details
    Browse the repository at this point in the history

Commits on May 4, 2023

  1. Merge pull request #1199 from nextstrain/clade-fixes

    Multiple improvements to augur clades
    jameshadfield authored May 4, 2023
    Configuration menu
    Copy the full SHA
    dd318ba View commit details
    Browse the repository at this point in the history