Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design of Phylo #14

Closed
mkborregaard opened this issue Sep 11, 2018 · 30 comments
Closed

design of Phylo #14

mkborregaard opened this issue Sep 11, 2018 · 30 comments

Comments

@mkborregaard
Copy link
Member

mkborregaard commented Sep 11, 2018

Here are some comments on Phylo that came up while doing the plotting code, to start a conversation.

First, and most importantly, the duality between nodes and nodenames is quite confusing and difficult to navigate, and 90% of my coding troubles have stemmed from difficulties in understanding when you'd need a name and when you'd need the actual node. (Like, to get the root node I'd have to go first(nodenamefilter(isroot, tree)) :-O). Also a lot of your interface functions seem geared for navigating just this, and could be avoided. And using the node names seems perfectly superfluous. I think your idea of reimplementing them as a linked list is a very big improvement, but I suspect using MetaGraphs might be even better.

This would also bridge the gap to Phylogenies in Bio.jl which is built on LightGraphs, in fact we could probably use that package to replace the "guts" of Phylo, resulting in a new, stronger, hybrid to rule the juliaverse. And still keep all parts of your interface that you do want to keep. The fact that they are largely complementary should help.

I wouldn't mind helping with this.

The other issue is that the node iterators seem to not follow the traditional "depth-first" or "breath-first" patterns from the root to the tips, instead they just seem to start from some tip and go on! Algorithmically there are some great advantages to using those patterns, and it's less confusing. But I may be wrong here, maybe I'm confused. Oh and they're already implemented in MetaGraphs.

Finally I've put some recursive root-to-tips and tips-to-roots functions in the code, which are really useful for doing certain things.

@mkborregaard
Copy link
Member Author

Profiling the plotting calls shows most of the time is spent in nodeiter and getchildren, i.e. Phylo's iteration code.

@richardreeve
Copy link
Member

No surprise there. As I said before, none of this was designed at all, much less designed for speed or elegance... I just needed something quickly once I'd given up on the other packages and I knew nothing about phylogenetics - hence not addressing all of these traversal issues either - it's just using the order from the Dict. I'm happy to go with the MetaGraph stuff to fix this if it works out.

A minor point though - as far as first(nodenamefilter(isroot, tree)) is concerned, for me it's just nodenamefilter(isroot, tree) - there's no reason for there only to be one root in a tree. If the tips form two independent trees, that's fine by me... I don't know if this makes sense to phylogeneticists, but it's necessary for me as I sometimes need to do surgery on my trees :)

Anyway, making this a linked list or using MetaGraphs will almost certainly make a lot of these speed and name vs node problems go away. I'm completely comfortable with that, though the (obviously unsuccessful!) intention was that it should be easy because you could always use (tree, name, ...), although where possible you could call (node, ...) to speed things up. The linked list would get rid of that ambiguity so you could always use either (or just stick with node) - though I'm not sure (because I haven't looked yet) how metadata is stored - that may go back to requiring the nodenames unless the information is stored locally in the nodes... I'll have another look at Phylogenies and MetaGraphs tomorrow.

@richardreeve
Copy link
Member

PS Thanks a lot for all of these comments by the way - they’re very useful. No-one (including me!) has cared about how this code worked before, so it’s just sat there looking ugly and unloved!

@mkborregaard
Copy link
Member Author

I don't know if this makes sense to phylogeneticists

It really doesn't 😂 I bet there's a better way of expressing that.

To me the great epiphany (and I may be wrong so I'm looking forward to your opinion) is that MetaGraphs would allow to merge Phylogenies into this package. Given that we can most likely ignore PhyloNetworks and PhyloTrees, this means that there could be scope for one unified concrete implementation of phylogenies in Julia. With the strong visual identity made possible by the plotting I am adding, this could be the basis of a Software Note paper that might become very highly cited.

@richardreeve
Copy link
Member

Okay... but why does R allow this then?

> library(ape)
> read.tree(text="(((a,b),(c,d)),(e,f));")

Phylogenetic tree with 6 tips and 5 internal nodes.

Tip labels:
[1] "a" "b" "c" "d" "e" "f"

Rooted; no branch lengths.
> read.tree(text="((a,b),(c,d)),(e,f);")

Phylogenetic tree with 6 tips and 4 internal nodes.

Tip labels:
[1] "a" "b" "c" "d" "e" "f"

Rooted; no branch lengths.
> read.tree(text="(a,b),(c,d),(e,f);")

Phylogenetic tree with 6 tips and 3 internal nodes.

Tip labels:
[1] "a" "b" "c" "d" "e" "f"

Rooted; no branch lengths.

I think what these newick strings represent is one, two and three trees inside a single file - as evidenced by the decreasing number of internal nodes in the later trees. Is that right? Cutting trees at a certain time in the past (because history back to the overall root is meaningless) is a useful thing for me to be able to do (and googling, people do it using cutree in R), and this may require multiple trees in order to represent all of the tips. Is this such a strange thing to want to represent in one object?

@richardreeve
Copy link
Member

Okay, it seems pretty clear that trees should be or use some kind of AbstractMetaGraph to allow all of the functionality that provides. I've played around a bit with a MetaTree as a container for a MetaGraph, as a copy of MetaGraph containing a different SimpleGraph that allows non-integer labels, as a completely new AbstractMetaGraph subtype, and a few other things. None of them have quite got off the drawing board, but I'll keep playing. It's annoying that AbstractMetaGraph isn't really very abstract at all - it makes lots of assumptions about the fields of its subtypes - but at the same time MetaGraph assumes that all node names are integers...

@mkborregaard
Copy link
Member Author

Okay... but why does R allow this then?

Because R sucks? No, seriously, cutting trees is important, but I think it's much clearer if one tree corresponds to one object. I guess you might have a BagOfNodes type which is just a lot of nodes and connections, but that's not a tree IMHO. Also, as your example shows, trees may have a core property of being rooted or unrooted, but what if some of the trees in an object are rooted?

The cutree function does in fact not return a single tree object with multiple trees in it, it returns an index vector into the tree - which is the classical R way of doing anything so as to avoid too much copying of objects. Similarly you could imagine, in Julia, views into a tree, implemented essentially as a range with multiple stopping nodes on the tree.

In Ben Ward's original linked list implementation of Phylogenies, he'd solved that elegantly by a Phylogeny object simply being a wrapper around the link to the root node. You could then create a new subtending Phylogeny, with the same nodes, by creating a Phylogeny pointing at a different root node. That makes for very elegant tree surgery, and always gives you an easy handle on the root, which is the natural starting point of all tree iteration. I don't know for sure that this approach could be extended to a MetaGraphs approach if that's what we choose to go with.

@mkborregaard
Copy link
Member Author

Sounds interesting with your experiments with MetaGraphs. But I don't understand

but at the same time MetaGraph assumes that all node names are integers...

MetaGraph should allow you to use any kind of name object for indexing, no? Or am I not understanding what you mean?

@mkborregaard
Copy link
Member Author

mkborregaard commented Sep 13, 2018

Maybe this comment is relevant: https://discourse.julialang.org/t/lightgraphs-vs-simplegraphs/14643/4

@richardreeve
Copy link
Member

As you've seen, the underlying problem is the AbstractSimpleGraph, which requires integer (and possibly contiguous?) labels, and the fact that MetaGraphs uses a SimpleGraph as its underlying graph type. Your thread does link to a possibility but SimpleGraphs is dead (and not in the JuliaGraphs family) and the other proposed solution has problems of its own - just getting information about a single node becomes a search over all of the values in a Dict, which makes my current code look positively efficient... though maybe you're talking about the fact that set_indexing_prop! creates an index on the values (and so essentially creates a bijective map), which is a kind of solution. But then we're back to something very similar to R's depressing representation of a tree as an n x 2 matrix of connections between nodes and then a huge number of additional coping strategies to handle the fact that nobody cares about these numbers. I thought when we chatted about this at JuliaCon that was exactly what we were trying to avoid?

I'm increasingly thinking about creating a new AbstractGraph or AbstractMetaGraph type unless we want to go down this R-like route.

@richardreeve
Copy link
Member

As far as your comment on legal trees is concerned, I think your solution means we can never build a tree from the ground up using the tree structure. If we start with the tips on their own, it's already illegal and can't be manipulated (or even stored at all from what you said?) because there's more than one root, and the pointer is only to the root apparently. If we delete a branch, we have two trees and it's illegal. If we add a tip, ditto. I don't find that an elegant solution tbh unless I'm misunderstanding something important.

None of these problems exist for me (though lots of others do, don't get me wrong!) - I can store intermediate groups-of-subtrees in my tree, and that's a useful thing to be able to do. Using an underlying graph package obviously easily allows us to hold multiple sub-graphs in a single object, so nothing will change there. There are also obvious advantages to having them all together - apart from building and manipulating trees using the tree structure we create, I have datasets where I want to be able to represent all of my species, but I don't have a supertree, only a series of trees at the family or order level, for instance. They should be stored and plotted and generally represented together. You can always make a strict tree type like my BinaryTree that is fussy about this.

Incidentally, this is quite distinct from my TreeSets, where I was enforcing the tips being the same, and the metadata about the tips (not the rest) being shared to save space - they are really intended to store samples from the posterior of trees or trees for different chromosomes / genes / whatever of the samples.

Fundamentally, I don't understand what the problem is with having multiple subtrees in a single tree object?

PS I would enforce no common node names, and define the tree object as being rooted only if all subtrees were rooted I imagine, but those are pretty minor things.

@mkborregaard
Copy link
Member Author

mkborregaard commented Sep 13, 2018

Hm, that sounds like MetaGraphs don't quite live up to my expectation. Maybe the simple linked-list design you suggested, with metadata stored in a dict, might be easier to deal with after all. I am not 100% sure why @benjward abandoned that approach over using LightGraphs but probably for the same reasons that leads me to think it might be a good idea.

@richardreeve
Copy link
Member

The good news is that so long as I implement the AbstractGraph interface, we get all of the goodness (if not the speed) anyway...

@mkborregaard
Copy link
Member Author

mkborregaard commented Sep 13, 2018

It's frankly hard to imagine anything being faster than a linked Node list (using StaticArrays and dropping the notion of "branch") with a Dict{String, Node} lookup for the name anyway I guess.

@jangevaare
Copy link

jangevaare commented Sep 13, 2018

It's frankly hard to imagine anything being faster than a linked Node list (using StaticArrays and dropping the notion of "branch") with a Dict{String, Node} lookup for the name anyway I guess.

Based on the diminishing returns on large arrays with using StaticArrays, and the growing returns of some kind of SparseArray (and it being a stdlib module), may be best to explore that direction instead.

Dicts are okay for node lookup by name, but a formal bidirectional map would be even better. Not sure what exists in Julia for that... maybe there isn't a better way of doing it than just two dicts? (UniqueVectors.jl is probably worth looking at).

@mkborregaard
Copy link
Member Author

Well I mean that the connections of a node should be in a StaticArray, they are going to be very small, mostly just n = 2.

@mkborregaard
Copy link
Member Author

@richardreeve you'd just start the tree with an arbitrary node. Trees need not be rooted, and in fact in many tree-building algorithms you build an unrooted tree and just specify a node as the root at the very end (or don't, if you don't have a proper outgroup). That's not doable in the current design, because your division into "outbounds" and "inbounds" explicitly assumes that there's a root on the tree. In fact, an "inbound" node is just one that will be traversed earlier in a traversal starting at the root.

I agree that a BagOfNodes is distinct from a TreeSet - I like the latter type a lot. But to me, having an AbstractTree be a bag of nodes that may or may not be connected really doesn't feel right.

@richardreeve
Copy link
Member

Okay. In any event, I don't think this is any kind of dealbreaker, because we can just have two types of AbstractTree (well, three, because TreeSet is also an AbstractTree at the moment), one with exactly one tree in it, and one with one or more (BagOfTrees may be too frivolous - or maybe not? - but along those lines).

As far as the inbound and outbound business is concerned, every other in-memory and on-disk tree implementation I looked at had at least implicit directionality in its edges (though I only looked at half a dozen or so) - the whole newick/nexus format is basically explicitly rooted for instance. I couldn't really understand how that fitted in with the concept of unrootedness, but I just shrugged and copied it since this isn't my research area. If you don't have directionality in the edges for unrooted trees, you essentially need to have two different tree formats (for rooted and unrooted), or you can't work out locally from a node of a rooted tree which its descendants are without global knowledge of the tree structure, surely?

@TransGirlCodes
Copy link

TransGirlCodes commented Sep 14, 2018

I'm all for preserving and being compatible with the AbstractGraph interface, i.e. AbstractPhylogeny <: AbstractGraph, and then defining specific phylogeny structs (maybe even more than one for different use cases).

Regarding tree editing and intermediate invalid trees, what we have in assembly graphs at the moment is similar - an assembly has to start building from some point. So we allow for invalid structures to occur, but when plugging in such a graph into some analysis function that assumes the thing is valid, then a sanity check gets performed.

The reason I stopped using a linked list in the end was the allocation of gc'd objects, and then the indirections when traversing the linked list.
My go to structure now for anything graph like, including assembly graphs is a vector of nodes (in the case of assembly graphs they have a name and some sequence attached), and then some form of adjecency list, often a vector of vectors containing the conncections.

Pushing the boat out, I read in "The Art of Computer Programming" that you can actually store the entirety of a tree's structure (providing some constraints) in a single bit-vector. But that's some next level micro-optimisation trip that is likely not worth the human hours!

@mkborregaard
Copy link
Member Author

I'd think that an immutable Node struct wrapping a StaticArray of connections would be stack-allocated, no?

@TransGirlCodes
Copy link

TransGirlCodes commented Sep 14, 2018

It could be, we'd have to check, but when I was using linked-lists this was mcuh older version of julia. We can probably easily get around the issues I had before now.

@mkborregaard
Copy link
Member Author

If you don't have directionality in the edges for unrooted trees, you essentially need to have two different tree formats (for rooted and unrooted), or you can't work out locally from a node of a rooted tree which its descendants are without global knowledge of the tree structure, surely?

I think that's true. I was just (I think) adressing this concern that you'd build invalid trees if you didn't store all the nodes somewhere.

@mkborregaard
Copy link
Member Author

There's an ROpenSci package that aims to replace ape with a structure based on linked lists https://github.com/ropensci/phylogram

@TransGirlCodes
Copy link

TransGirlCodes commented Sep 14, 2018 via email

@richardreeve
Copy link
Member

Yes, the garbage collector was my concern too - hopefully that's no longer a problem. I'll give it a try anyway. Fortunately the Node being immutable doesn't stop us mutating the metadata inside, but obviously the StaticArrays will mean there a lot of reallocating going on if we allow polytomies (and modify trees). I still don't really understand gc in Julia as I've said before, so I'll start with normal vectors and go from there...

@mkborregaard
Copy link
Member Author

I checked and types wrapping a StaticArray seem to be heap allocated...

@mkborregaard
Copy link
Member Author

You can't use an SVector, as they are immutable, and if your linked list needs to be linked backwards and forwards you'd need all other nodes to be defined before defining your node, which can't be done... But you could use an MVector, where only the size is immutable. That would allow the Node type to be parametric on the number of descendants, fixing that in the type information. That would make it visible to the compiler whether something is a dichotomous node, a tip node etc. Does that make sense?

@richardreeve
Copy link
Member

Quick note to say master now does traversals preorder and postorder (as well as inorder and breadthfirst) using traversal(tree, order) starting at the root by default, or at the node passed or named in the optional third argument.

@mkborregaard
Copy link
Member Author

Awesome! wow you really are a nightowl btw

@mkborregaard
Copy link
Member Author

I'm wondering if #20 shouldn't close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants