Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect ancestors in feature metadata coloring/propagation #473

Closed
fedarko opened this issue Jan 19, 2021 · 2 comments · Fixed by #487
Closed

Respect ancestors in feature metadata coloring/propagation #473

fedarko opened this issue Jan 19, 2021 · 2 comments · Fixed by #487

Comments

@fedarko
Copy link
Collaborator

fedarko commented Jan 19, 2021

I brought this up in #471 (comment), but for the sake of everyone's sanity this should probably be its own issue. This isn't a very pressing issue (the only common case it should run into is stuff like s__ for Level 7), but there are corner-cases where real levels of taxa can have the same name but different ancestors. In these cases, we should assign these values distinct colors and not treat them as "equal" when doing propagation.

As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Level 7 (species) in a 16S dataset using the default QIIME color map, you'll probably see a lot of clades of the tree colored as red due to all of the tips in the clade sharing a species classification of s__, even if they're from different genera/families/etc:

yike

Addressing this would definitely be possible, by for example representing the values in each Level N string as the full taxonomy to that point (e.g. setting Level 7 to k__Bacteria; p__Firmicutes; c__Somecoolclass; o__Ogeezimrunningoutoftaxonomynamesiknow; f__Isanyonereadingthis; g__Himom; s__ instead of just s__) -- in some ways this is similar to a point @antgonza raised a few weeks ago in #422.

@ElDeveloper
Copy link
Member

Hierarchical metadata should certainly be treated differently. This makes sense although I think we'll need to "special case" this to columns labeled as Level * right? And also be very clear to the user about what's going on.

@fedarko
Copy link
Collaborator Author

fedarko commented Jan 19, 2021

Agreed. What should help a bit with the special casing is that the Level * columns are created by EMPress' Python code (in this module), so we should be able to pass some information about these columns specifically to the JS code saying something like "Hey this is a taxonomy metadata column, look at its ancestors". (I don't think it will be feasible to detect already-split-up taxonomy columns and try to disambiguate them -- that sounds like it would be challenging and prone to errors.)

One solution that should work, with the added benefit of only modifying the Python code: we automatically could go through the split-up taxonomy (produced by the code linked above) and identify all of the ambiguous values in each level (so both the s__ cases in Level 7, and also stuff like g__ in Level 6, Unspecified at any level, etc.) We should then be able to add text to these values (from the higher levels) to disambiguate these levels as needed: so, for example, for Level 7 this would probably involve prefixing each s__ with the respective genus label, producing things like g__Streptococcus; s__ as the metadata values where needed. This would cause distinct colors to be assigned to each taxonomy value, as well as avoid ambiguous values from being treated as equal during propagation. (This would increase the metadata size somewhat, but speedups like #337 should offset this at least partially.)

fedarko added a commit to fedarko/empress that referenced this issue Feb 6, 2021
Still need to do a lot, incl:
-Color "Other" values and show in legend. Will likely require
 modifying the Legend class to pass stuff into there.

-Don't sort legend by name -- instead, if the mostFreq checkbox is
 checked, the values should be in descending order of frequency.
 This way experimentally increasing the value of N will not change
 previously assigned colors. (Note that this'll cause the color
 assignments to differ with the "normal" color assignments in the
 case where the mostFreq checkbox is checked -- not really a way i
 see to satisfy both cases at once, so eh.)

-Make Continuous and mostFreq mutually exclusive. Checking one
 checkbox should uncheck the other, and if both are somehow true then
 Empress.js should throw an error.

-Apply this stuff to feature metadata coloring (just for the tree)?

-I should independently take care of biocore#473, which will prevent s__,
 Unspecified, etc. from being "common" values by themselves.
fedarko added a commit to fedarko/empress that referenced this issue Feb 6, 2021
fedarko added a commit to fedarko/empress that referenced this issue Feb 6, 2021
fedarko added a commit to fedarko/empress that referenced this issue Feb 6, 2021
The specific error code was E501. Annoyingly, ignoring it in the
Makefile's invocation of flake8 triggered MORE errors of a different
type for some reason lmao.

ANYWAY this should close biocore#473. There is still the question of if
the space tradeoff is worth it or not -- but at the very least this
functionality is ready.
fedarko added a commit to fedarko/empress that referenced this issue Feb 10, 2021
See docs enclosed for discussion vs. bool flag approach
fedarko added a commit to fedarko/empress that referenced this issue Feb 17, 2021
Closes biocore#473 and closes biocore#422.

Avoids having to store this stuff in the QZV, which will save
a lot of space. This does still involve storing the "expanded"
taxonomy for a given level all at once in memory, though -- I do
not think this will be horrendous, since it is not that much extra
data all things considered.
fedarko added a commit to fedarko/empress that referenced this issue Feb 17, 2021
kwcantrell pushed a commit that referenced this issue Apr 8, 2021
… ancestor info in the JS interface) (#487)

* DOC: various match_inputs() comments/doc fixes

* ENH: Pass ordered tax col names to Empress JS #473

See docs enclosed for discussion vs. bool flag approach

* ENH: Implement ancestor fm retrieval in JS

Closes #473 and closes #422.

Avoids having to store this stuff in the QZV, which will save
a lot of space. This does still involve storing the "expanded"
taxonomy for a given level all at once in memory, though -- I do
not think this will be horrendous, since it is not that much extra
data all things considered.

* STY: prettify

* MNT: abstract fm retrieval func generation

still gotta test (manually then automatically) tho...

* TST: unbreak js tests

* TST: unbreak some of the python tests

not done yet

* TST: unbreak all python tests

* TST: abstract python taxcol checking

* STY: fix flake8 long lines

* STY: more flake8 fixes

* TST: test #473 special fm stuff in JS

* MNT: pass length fm retrieval through new funcs

for sake of future consistency. really shouldn't change anything,
behavior- or performance-wise

* BUG: Change max width to be in vw units, not vh

makes this slightly larger, i think

* PERF/DOC: Add note abt repeated work - fm barplots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment