Skip to content

Commit

Permalink
Merge pull request #1384 from nextstrain/export-additional-metadata
Browse files Browse the repository at this point in the history
export v2: Add --metadata-columns option
  • Loading branch information
joverlee521 authored Feb 8, 2024
2 parents 8678ae9 + 32c9740 commit e4353b4
Show file tree
Hide file tree
Showing 5 changed files with 285 additions and 7 deletions.
6 changes: 4 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@

* `augur.io.read_metadata`: A new optional `dtype` argument allows custom data types for all columns. Automatic type inference still happens by default, so this is not a breaking change. [#1252][] (@victorlin)
* `augur.io.read_vcf` has been removed and usage replaced with TreeTime's function of the same name which has improved validation of the VCF file. [#1366][] (@jameshadfield)
* export v2: Add support to specify metadata columns to export without using them as colorings. This can be done with the `metadata_columns` property in the Auspice config JSON or via the `--metadata-columns` flag in the command line. [#1384][] (@joverlee521)

### Bug Fixes

Expand All @@ -34,13 +35,14 @@

[#1252]: https://github.com/nextstrain/augur/pull/1252
[#1366]: https://github.com/nextstrain/augur/pull/1366
[#1384]: https://github.com/nextstrain/augur/pull/1384
[#1400]: https://github.com/nextstrain/augur/pull/1400

## 24.0.0 (22 January 2024)

### Major Changes

* ancestral, translate: For VCF inputs please ensure you are using TreeTime 0.11.2 or later. A large number of bugfixes and improvements have been added in both Augur and TreeTime. [#1355][] and [TreeTime #263][] (@jameshadfield)
* ancestral, translate: For VCF inputs please ensure you are using TreeTime 0.11.2 or later. A large number of bugfixes and improvements have been added in both Augur and TreeTime. [#1355][] and [TreeTime #263][] (@jameshadfield)
* ancestral, translate: GenBank files now require the (GFF mandatory) source feature to be present. [#1351][] (@jameshadfield)
* ancestral, translate: For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required. [#1351][] (@jameshadfield)

Expand All @@ -57,7 +59,7 @@
* If a Gene/CDS in the GFF/GenBank file is unparsed we now print a warning.
* ancestral: For VCF alignments, a VCF output file is now only created when requested via `--output-vcf`. [#1344][] (@jameshadfield)
* ancestral: Improvements to command line arguments. [#1344][] (@jameshadfield)
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
* `--vcf-reference` and `--root-sequence` are now mutually exclusive.
* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
* utils::load_features: This function may now raise `AugurError`. [#1351][] (@jameshadfield)
Expand Down
7 changes: 7 additions & 0 deletions augur/data/schema-auspice-config-v2.json
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,13 @@
}
}
},
"metadata_columns": {
"description": "Metadata TSV columns to export in addition to columns provided as colorings.",
"$comment": "These columns will not be used as coloring options in Auspice but will be visible in the tree.",
"type": "array",
"uniqueItems": true,
"items": {"type": "string"}
},
"extensions": {
"description": "Data to be passed through to the the resulting dataset JSON",
"$comment": "Any type is accepted"
Expand Down
44 changes: 39 additions & 5 deletions augur/export_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ def convert_tree_to_json_structure(node, metadata, get_div, div=0):
Returns
-------
dict:
See schema-export-v2.json#/$defs/tree for full details.
See schema-export-v2.json#/$defs/tree for full details.
Node names are always set, and divergence is set if applicable
"""
node_struct = {'name': node.name, 'node_attrs': {}, 'branch_attrs': {}}
Expand Down Expand Up @@ -394,7 +394,7 @@ def _is_valid(coloring):
warn("[colorings] You asked for mutations (\"gt\"), but none are defined on the tree. They cannot be used as a coloring.")
return False
if key != "gt" and not trait_values:
warn("You asked for a color-by for trait '{}', but it has no values on the tree. It has been ignored.".format(key))
warn(f"Requested color-by field {key!r} does not exist and will not be used as a coloring or exported.")
return False
return True

Expand Down Expand Up @@ -734,7 +734,7 @@ def _recursively_set_data(node):
_recursively_set_data(data_json["tree"])


def set_node_attrs_on_tree(data_json, node_attrs):
def set_node_attrs_on_tree(data_json, node_attrs, additional_metadata_columns):
'''
Assign desired colorings, metadata etc to the `node_attrs` of nodes in the tree
Expand All @@ -743,10 +743,17 @@ def set_node_attrs_on_tree(data_json, node_attrs):
data_json : dict
node_attrs: dict
keys: strain names. values: dict with keys -> all available metadata (even "excluded" keys), values -> data (string / numeric / bool)
additional_metadata_columns: list
Requested additional metadata columns to export
'''

author_data = create_author_data(node_attrs)

def _transfer_additional_metadata_columns(node, raw_data):
for col in additional_metadata_columns:
if is_valid(raw_data.get(col, None)):
node["node_attrs"][col] = {"value": raw_data[col]}

def _transfer_vaccine_info(node, raw_data):
if raw_data.get("vaccine"):
node["node_attrs"]['vaccine'] = raw_data['vaccine']
Expand Down Expand Up @@ -798,6 +805,9 @@ def _transfer_author_data(node):
def _recursively_set_data(node):
# get all the available information for this particular node
raw_data = node_attrs[node["name"]]
# transfer requested metadata columns first so that the "special cases"
# below can overwrite them as necessary
_transfer_additional_metadata_columns(node, raw_data)
# transfer "special cases"
_transfer_vaccine_info(node, raw_data)
_transfer_hidden_flag(node, raw_data)
Expand Down Expand Up @@ -853,7 +863,7 @@ def register_parser(parent_subparsers):
required.add_argument('--tree','-t', metavar="newick", required=True, help="Phylogenetic tree, usually output from `augur refine`")
required.add_argument('--output', metavar="JSON", required=True, help="Output file (typically for visualisation in auspice)")

config = parser.add_argument_group(
config = parser.add_argument_group(
title="DISPLAY CONFIGURATION",
description="These control the display settings for auspice. \
You can supply a config JSON (which has all available options) or command line arguments (which are more limited but great to get started). \
Expand All @@ -866,6 +876,9 @@ def register_parser(parent_subparsers):
config.add_argument('--description', metavar="description.md", help="Markdown file with description of build and/or acknowledgements to be displayed by Auspice")
config.add_argument('--geo-resolutions', metavar="trait", nargs='+', help="Geographic traits to be displayed on map")
config.add_argument('--color-by-metadata', metavar="trait", nargs='+', help="Metadata columns to include as coloring options")
config.add_argument('--metadata-columns', nargs="+",
help="Metadata columns to export in addition to columns provided by --color-by-metadata or colorings in the Auspice configuration file. " +
"These columns will not be used as coloring options in Auspice but will be visible in the tree.")
config.add_argument('--panels', metavar="panels", nargs='+', choices=['tree', 'map', 'entropy', 'frequencies', 'measurements'], help="Restrict panel display in auspice. Options are %(choices)s. Ignore this option to display all available panels.")

optional_inputs = parser.add_argument_group(
Expand Down Expand Up @@ -1096,6 +1109,26 @@ def get_config(args):
del config["vaccine_choices"]
return config


def get_additional_metadata_columns(config, command_line_metadata_columns, metadata_names):
# Command line args override what is set in the config file
if command_line_metadata_columns:
potential_metadata_columns = command_line_metadata_columns
else:
potential_metadata_columns = config.get("metadata_columns", [])

additional_metadata_columns = []
for col in potential_metadata_columns:
# Match the column names corrected within parse_node_data_and_metadata
corrected_col = update_deprecated_names(col)
if corrected_col not in metadata_names:
warn(f"Requested metadata column {col!r} does not exist and will not be exported")
continue
additional_metadata_columns.append(corrected_col)

return additional_metadata_columns


def run(args):
configure_warnings()
data_json = {"version": "v2", "meta": {"updated": time.strftime('%Y-%m-%d')}}
Expand Down Expand Up @@ -1141,6 +1174,7 @@ def run(args):
node_data, node_attrs, node_data_names, metadata_names, branch_attrs = \
parse_node_data_and_metadata(T, node_data_file, metadata_file)
config = get_config(args)
additional_metadata_columns = get_additional_metadata_columns(config, args.metadata_columns, metadata_names)

# set metadata data structures
set_title(data_json, config, args.title)
Expand Down Expand Up @@ -1169,7 +1203,7 @@ def run(args):

# set tree structure
data_json["tree"] = convert_tree_to_json_structure(T.root, node_attrs, node_div(T, node_attrs))
set_node_attrs_on_tree(data_json, node_attrs)
set_node_attrs_on_tree(data_json, node_attrs, additional_metadata_columns)
set_branch_attrs_on_tree(data_json, branch_attrs)

set_geo_resolutions(data_json, config, args.geo_resolutions, read_lat_longs(args.lat_longs), node_attrs)
Expand Down
114 changes: 114 additions & 0 deletions tests/functional/export_v2/cram/metadata-columns.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
Setup

$ source "$TESTDIR"/_setup.sh

Create files for testing.

$ cat >metadata.tsv <<~~
> strain field_A field_B
> tipA AA AAA
> tipB BB BBB
> tipC CC CCC
> tipD DD DDD
> tipE EE EEE
> tipF FF FFF
> ~~

$ cat >tree.nwk <<~~
> (tipA:1,(tipB:1,tipC:1)internalBC:2,(tipD:3,tipE:4,tipF:1)internalDEF:5)ROOT:0;
> ~~

$ cat >auspice-config.json <<~~
> {"metadata_columns": ["field_A", "field_B"]}
> ~~

$ cat >auspice-config-overridden.json <<~~
> {"metadata_columns": ["overridden_field"]}
> ~~

Run export with tree and metadata with additional columns.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --metadata-columns "field_A" "field_B" \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}

Missing columns are skipped with a warning.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --metadata-columns "field_A" "field_B" "missing_field" \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null
WARNING: Requested metadata column 'missing_field' does not exist and will not be exported
\s{0} (re)

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}

Specifying a fields with both --metadata-columns and --colory-by-metadata should result in field used as a coloring and a filter.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --metadata-columns "field_A" "field_B" \
> --color-by-metadata "field_B" \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{'iterable_item_added': {"root['meta']['colorings'][0]": {'key': 'field_B', 'title': 'field_B', 'type': 'categorical'}, "root['meta']['filters'][0]": 'field_B'}}

Missing columns are skipped with a warning when specified by both --metadata-columns and --color-by-metadata.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --metadata-columns "field_A" "field_B" "missing_field" \
> --color-by-metadata "missing_field" \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null
WARNING: Requested metadata column 'missing_field' does not exist and will not be exported
\s{0} (re)
WARNING: Requested color-by field 'missing_field' does not exist and will not be used as a coloring or exported.
\s{0} (re)

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}

Specifying additional metadata columns via the Auspice configuration file.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --auspice-config auspice-config.json \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}

Specifying additional metadata columns via command line overrides the Auspice configuration file.

$ ${AUGUR} export v2 \
> --tree tree.nwk \
> --metadata metadata.tsv \
> --auspice-config auspice-config-overridden.json \
> --metadata-columns "field_A" "field_B" \
> --maintainers "Nextstrain Team" \
> --output dataset.json > /dev/null

$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" "$TESTDIR/../data/dataset-with-additional-metadata-columns.json" dataset.json \
> --exclude-paths "root['meta']['updated']" "root['meta']['maintainers']"
{}
Loading

0 comments on commit e4353b4

Please sign in to comment.