-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487
Conversation
See docs enclosed for discussion vs. bool flag approach
Closes biocore#473 and closes biocore#422. Avoids having to store this stuff in the QZV, which will save a lot of space. This does still involve storing the "expanded" taxonomy for a given level all at once in memory, though -- I do not think this will be horrendous, since it is not that much extra data all things considered.
still gotta test (manually then automatically) tho...
not done yet
for sake of future consistency. really shouldn't change anything, behavior- or performance-wise
makes this slightly larger, i think
One thing worth noting: the taxonomy information shown for biplot arrows is still stored as before, so now the two representations are different: I don't think this is a big problem, esp. since the colors were already different between biplot arrows and EMPress' assigned colors (since biplot arrows are [usually] a subset of the tips in the tree), but it may be worth trying to address this in a future issue or something (maybe doing the unwieldy Python metadata operation in #482, but just for the biplot arrow taxonomies...?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @fedarko! I definitely like concept of the PR, I just have a couple questions regarding performance.
for (var i = 0; i < taxIdx; i++) { | ||
var currTaxCol = this._splitTaxonomyColumns[i]; | ||
var currTaxColFMIdx = _.indexOf( | ||
this._featureMetadataColumns, | ||
currTaxCol | ||
); | ||
ancestorFMIndices.push(currTaxColFMIdx); | ||
} | ||
// We already know the index of the column we end at, so just put | ||
// it here at the end manually. (Saving this extra work probably | ||
// won't make an appreciable time difference, but it feels nice :) | ||
ancestorFMIndices.push(fmIdx); | ||
return function (fmRow) { | ||
var totalFMVal = ""; | ||
_.each(ancestorFMIndices, function (ancestorFMIdx, ii) { | ||
// Separate adjacent levels in the resulting f.m. value | ||
// shown: e.g. "k__Bacteria; p__Cyanobacteria" | ||
if (ii > 0) { | ||
totalFMVal += "; "; | ||
} | ||
totalFMVal += fmRow[ancestorFMIdx]; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of wasted computation (i.e. worst case, for each node we have to iterate though the feautreMetadataColumn to find the location of each individual level, making this an O(n*L^2) where n = # nodes and L = # levels).
Since the taxonomy columns are in order, would we be able to just find the location of the first tax column and the location of the level we want to show? For example, the pseudo-code to find the Level 5 would look something like:
level1Idx = location of Level 1 in this._featureMetadata
fmIdx = location of Level 5 in this._featureMetadata
for ii=level1Idx; i <= fmIdx
----if (ii != level1Idx)
-------- totalFMVal += "; "
----totalFMVal += fmRow[ii]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worst case, for each node we have to iterate though the feautreMetadataColumn to find the location of each individual level, making this an O(n*L^2) where n = # nodes and L = # levels
I am not sure about this: the function returned here is part of a closure which includes ancestorFMIndices
, and this returned function is what is called on every node (more precisely, for every row in the feature metadata being considered). So finding the location of the level columns in the feature metadata should only be done once*.
The only work that is repeated for each row in the feature metadata is the stuff in function (fmRow) { ... }
, which only involves retrieving L specific elements from an array (and we already know the indices of these elements, so each retrieval is ~O(1) -- so the total is O(nL), I think?)
Since the taxonomy columns are in order, would we be able to just find the location of the first tax column and the location of the level we want to show?
Yeah, this would be doable. I think your point is more that the _.each()
(as well as assembling ancestorFMIndices
in the first place) is inefficient, compared to a basic for loop like the one you showed? I guess so -- the reason for doing things this way was to avoid relying on the assumption that split-up taxonomy columns (Level 1, Level 2, ...) will always occur one after another in the feature metadata columns. This is currently always the case, but I guess it could change in the future (hence this function being written in such a roundabout way).
If you'd like I can rewrite things so that the assumption of split-up levels being in order one after another is explicitly tested in Python, and relied on here in JS. Does this sound good?
* "Once" for every coloring operation using feature metadata, at least. Now that I think about it we could probably store an object mapping split taxonomy column names to indices in the feature metadata columns and avoid all the indexOf
s, but I'm not convinced this will save much time for most datasets with a normal amount of taxonomy levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this would be doable. I think your point is more that the _.each() (as well as assembling ancestorFMIndices in the first place) is inefficient, compared to a basic for loop like the one you showed? I guess so -- the reason for doing things this way was to avoid relying on the assumption that split-up taxonomy columns (Level 1, Level 2, ...) will always occur one after another in the feature metadata columns. This is currently always the case, but I guess it could change in the future (hence this function being written in such a roundabout way).
I see what you mean. Your current method is definitely future proof and relies on less assumptions which would make it easy for future developers to understand + make the code base more stable if we were to change things around. I wouldn't worry about changing anything unless we find out that this does greatly impact performance then we can start think about different ways to quickly locate the location of each level without having to iterate.
// NOTE that, for drawing a barplot layer representing a taxonomy | ||
// column, we are essentially doing the work here of "reassembling" | ||
// the ancestor taxonomy with the child taxonomy info twice -- once | ||
// when we call this.getUniqueFeatureMetadataInfo() above, and again | ||
// as we go through this loop and look at each tip (this is done for | ||
// each tip by the "retrieval function" mentioned below). | ||
// | ||
// It would be ideal to use the mapping information returned by | ||
// this.getUniqueFeatureMetadataInfo() to avoid having to repeat this | ||
// work, although this would likely require restructuring the rest | ||
// of this function -- might be too much work for its own good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fedarko by chance did you profile this on a large tree? I worry that this might impact performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet, but I agree that that is a well-founded concern. I guess checking this with the EMP tree used in the paper would make sense -- @ElDeveloper, do you have any other suggestions for good benchmark datasets?
Looking at the Fig. 1 notebook for the paper, it seems like the taxonomy is already split up in the EMP BIOM table -- I guess the proper way to benchmark this would be:
-
Merge the various taxonomy levels in the feature metadata file produced by the Fig. 1 notebook into a single
Taxonomy
column in a new feature metadata file, which EMPress' python code will then be able to automatically detect and split up -
Run
empress tree-plot -t [newick file] -fm [feature metadata file generated above]
on two versions of EMPress:- The version of EMPress currently on the master branch
- The version of EMPress in this PR
Then, for each approach, try using feature metadata coloring to color the tree by each level in the taxonomy (1, 2, 3, 4, 5, 6, 7), and benchmark how long each level takes in each visualization (a total of 14 measurements). The goal is that this PR's EMPress version shouldn't take dramatically longer, I guess.
That being said, for what it's worth - using the Fig. 1 EMPire plot QZV, I can already see that when coloring by Level 7 (species) the majority of tips are colored either the color of s__
or Unspecified
. This PR will break these all up into separate categories based on their ancestral taxonomy, which will degrade performance due to adding more stuff to the legend / Colorer / etc., but I think this tradeoff is worth it. (Maybe we could add an option to the settings to disable this if users want?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EMP data should be good to test with.
Come to think about it. Do we even need to split the taxonomy in Python anymore? I seem to recall the only reason we did this was to show the feature metadata summaries in the Node Info Pop-up (IIRC). Can we do that on the fly or do we have any other use for the split levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think about it. Do we even need to split the taxonomy in Python anymore? I seem to recall the only reason we did this was to show the feature metadata summaries in the Node Info Pop-up (IIRC). Can we do that on the fly or do we have any other use for the split levels?
I don't think the pop-up was the only reason we added splitting -- it just generally makes life easier. I suspect that doing splitting on the fly might be more trouble than it is worth. Currently the splitting operation in Python is done using pandas vectorization, and I'm not sure if doing this quickly + on the fly in JS will be feasible.
Pre-splitting the taxonomy, like we do currently, has the advantage that it stores the taxonomic levels in separate "columns" of the feature metadata, so reconstructing these is just a matter of gluing together some of these columns for each metadata row as needed -- to me it seems both conceptually easier and more efficient than having to split up the taxonomy on the fly, which seems harder (since we don't know in advance where the ;
characters are for a given taxonomy string -- we could probably use some fancy heuristics to speed that up, but again might be more trouble than it's worth).
That said, there are two main benefits I see to moving more of this to JS. One is that not doing splitting in Python would allow us to avoid the zillion Unspecified
string values being stored in QZVs -- however, this should be (mostly) addressed by #337. The other benefit I see is that adding "splitting" utilities to the JS will make it easier to let users control this behavior themselves in the app -- for example if their taxonomy column is named something weird and wasn't treated as an actual taxonomy column by the current Python code. I think it'd make sense to defer that sort of work to the future, since it will likely require some significant restructuring of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, that makes sense. Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarking is done! Ran both this PR's version of EMPress and the master version of EMPress on the EMP EMPire plot QZV (Fig. 1A) from the paper. This was done using my laptop with 8 GB of RAM, so fancier computers will probably be much faster (it already takes its sweet time just loading the EMP dataset).
Both versions of EMPress were provided a modified EMP taxonomy file that was created by just merging all 7 columns (so that it would be handled specially as a "Taxonomy column" -- only really makes a difference for this PR), and were run using the circular layout and otherwise default settings as far as I can tell. As a disclaimer, some of these measurements were taken using my phone's stopwatch and some were done by just eyeballing my laptop clock, and I only took each measurement once (...so as with most things in life we would probably need a larger sample size).
Using the master branch of EMPress (as of writing)
Feature metadata coloring
Starting at Level 4, I started getting the standard "page unresponsive" dialog, which I either clicked "Wait" on or ignored.
Feature metadata column | Time taken (seconds; approximate) |
---|---|
Level 1 | 4.73 seconds |
Level 2 | 10.83 seconds |
Level 3 | 22.44 seconds |
Level 4 | 34 seconds |
Level 5 | 42 seconds |
Level 6 | 1 minute and 38 seconds |
Level 7 | 52 seconds (???) |
Level 7 screenshot
To prove that my computer didn't explode ;)
I suspect the reason coloring by Level 7 went comparatively quickly is that a ton of the tips here (145,171 / 317,314 [45.7%] of the entries in the EMP taxonomy file) have either s__
or Unspecified
as their species-level classification; this is why the tree looks overwhelmingly either red (the color that got assigned for s__
) or pale greyish (the color that got assigned for Unspecified
). As mentioned earlier, although this does make drawing faster, it is not a good thing for interpretation (intuitively, the tree shouldn't look so homogenous -- all of these s__
s span a really wide portion of the taxonomy, so they shouldn't all have the same color / label).
Using this PR's version of EMPress (up to date with the master branch as of writing)
Feature metadata coloring
The "page unresponsive" dialogs started coming up at Level 3 (rather than Level 4) here.
Feature metadata column | Time taken (seconds; approximate) | Extra time compared with master branch |
---|---|---|
Level 1 | 5 seconds | 0.27 seconds |
Level 2 | 14 seconds | 3.17 seconds |
Level 3 | 50 seconds | 27.56 seconds |
Level 4 | 1 minute and 40 seconds | 1 minute and 6 seconds |
Level 5 | 2 minutes and 6 seconds | 1 minute and 24 seconds |
Level 6 | 4 minutes and 37 seconds | 2 minutes and 59 seconds |
Level 7 | 5 minutes and 58 seconds | 5 minutes and 6 seconds |
Level 7 screenshot
The reason the tree has a lot of light blue is that this is the color assigned to the feature metadata string k__Bacteria; Unspecified; Unspecified; Unspecified; Unspecified; Unspecified; Unspecified
(i.e. for ASVs where the original taxonomy string was just "k__Bacteria"
-- there aren't any levels beyond "kingdom" given). What is nice about this PR is we can be more sure about this -- and also we can see flashes of other colors interspersed with the blue, since all of the s__
taxa are no longer colored identically.
Discussion
As expected, this PR slows down coloring operations using taxonomy columns, especially for more specific taxonomic levels. However, I'm happy that it at least works on my laptop with the full EMP dataset -- IMO this dataset is right around the upper limit of what EMPress can handle right now, at least including all of the extra data (sample and feature metadata, BIOM table, PCoA) that is being loaded alongside the bare tree structure.
I guess the key question is are these slowdowns worth it? It is not entirely my call, but I vote yes -- I think the interpretation / safety benefits (of avoiding the awkward situations described in #473) vastly outweigh the slowdowns (which should only be a big deal for really large datasets like this one?). I'd argue that although coloring by e.g. Level 7 is much faster in the master branch, the resulting visualization is not very useful.
There are lots of ways we can counteract the slowdowns -- #337 and #371 are both issues I have prototypes of, and I want to get to when I have some time. We could also maybe add an option to the settings that toggles this behavior, so that if users want the original behavior (faster coloring operations at the cost of collapsing certain unrelated taxonomies) they can get that.
...Anyway, please let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedarko for bench marking this! I like your idea of providing options to toggle between this method and the current method.
I think what we can do, at least for now, is set this method as our default. However, I think the majority of users would not think about going into the settings menu on their own to look for a Fast Taxonomy Method
option which may lead to a lot of frustration if it takes 1+ minutes to color by taxonomy. So, it would be nice if we provide some sort of warning to users to let them it may take a while to process and that they can change to old faster method. Or better yet, in the feature/barplot panels have a Fast Taxonomy Method
checkbox appear whenever a taxonomy column is selected (similar to how the continuous
checkbox is handled). Essentially have some mechanism in place that makes it obvious that there is a quick and dirty method to color by taxonomy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the key question is are these slowdowns worth it?
My take is that they are worth it. This is the correct way to group features and color the tree. And also, this is the strategy used in QIIME2's taxonomy barplots. Having inconsistent ways to display data within a platform is something we should always try to avoid.
My opinion would be to not focus on adding a "fast taxonomy UI component" but instead to focus on fixing #371. Because that is likely the BEST solution to this problem. By default we color the top N most frequent groups and if the user wants they can change the value of N. N can be represented as a slider in the feature coloring tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @kwcantrell and @ElDeveloper!
I agree that focusing on #371 might make sense, at least right now. Something that might complement #371 well is -- in Python -- figuring out the number of unique values in each feature / sample metadata field (or at least seeing if a given field has above a certain number of unique values), and passing these values to the HTML. Then, we can use these values to adjust the way the various coloring options' "defaults" show up in the UI.
I guess, less from a performance standpoint and more from a aesthetic one, some users might prefer to see just the values for a specific taxonomy column (e.g. only seeing phylum names without the k__
stuff unless they actually need it to disambiguate things) -- I have opened a new issue (#489) to discuss this, since I think accommodating this in a reasonable way will be nontrivial.
Is this PR ok to merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me. I will go ahead an merge it. Quick note on #371, everything will need to be done in js since python will be unable to account for tree shearing.
…473-but-not-slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedarko. @kwcantrell feel free to merge if everything looks good to you.
@fedarko, could you resolve conflicts? @kwcantrell can you take another look at this PR and merge if it looks good to you? |
…473-but-not-slow
Thanks, just resolved the conflict; PR should be ok to merge pending approval from @kwcantrell. After barplot clicking, the next thing I have on my plate for EMPress is #371 (drawing only the top K most frequent values in a field) -- as Yoshiki suggested, I think that will work really well with this PR's changes. |
The following artifacts were built for this PR: empire-biplot.qzv, empire.qzv, empress-tree.qzv, just-fm.qzv, plain.qzv |
Thanks so much @fedarko!
…On (Mar-30-21|14:30), Marcus Fedarko wrote:
Thanks, just resolved the conflict; PR should be ok to merge pending approval from @kwcantrell.
After barplot clicking, the next thing I have on my plate for EMPress is #371 (drawing only the top K most frequent values in a field) -- as Yoshiki suggested, I think that will work really well with this PR's changes.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://urldefense.com/v3/__https://github.com/biocore/empress/pull/487*issuecomment-810590583__;Iw!!Mih3wA!QbAwxX-sKZ5BEx4jz39Guti_-OzcXaDrco44iIpdXgygHpK6VEcEgFTOnABfMAM$
|
@kwcantrell any chance you can look at this PR, it's pending your approval so feel free to merge after you approve. 👍 |
Thanks so much @fedarko and @kwcantrell ! |
This is an alternate version of #482 that avoids dramatically increasing the size of the taxonomy information stored in the QZV -- instead, each entry in the feature metadata is "expanded" as needed in the JS. This means that in the worst case (visualizing the lowest level of a taxonomy, e.g. "Level 7" / species) we just need to stitch back together the original taxonomy for every tip described in the feature metadata (this process could be made more efficient -- see comment added in a95b652).
This works by passing a list of the "split taxonomy columns", in descending rank order (i.e.
Level 1
,Level 2
, ...) from the Python to the JS code, which (in certain contexts) uses this list to determine how much if any ancestor data to get for a given feature metadata column. Doing things in this way may make life easier in the future, since it'll let us name the taxonomy columns (after creating them) whatever we want, so we can do things like use 0-indexing / use a different term thanLevel
/ etc. if desired without breaking this functionality.As with #482, this closes #473 and closes #422.