-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design of Phylo #14
Comments
Profiling the plotting calls shows most of the time is spent in |
No surprise there. As I said before, none of this was designed at all, much less designed for speed or elegance... I just needed something quickly once I'd given up on the other packages and I knew nothing about phylogenetics - hence not addressing all of these traversal issues either - it's just using the order from the A minor point though - as far as Anyway, making this a linked list or using |
PS Thanks a lot for all of these comments by the way - they’re very useful. No-one (including me!) has cared about how this code worked before, so it’s just sat there looking ugly and unloved! |
It really doesn't 😂 I bet there's a better way of expressing that. To me the great epiphany (and I may be wrong so I'm looking forward to your opinion) is that MetaGraphs would allow to merge Phylogenies into this package. Given that we can most likely ignore PhyloNetworks and PhyloTrees, this means that there could be scope for one unified concrete implementation of phylogenies in Julia. With the strong visual identity made possible by the plotting I am adding, this could be the basis of a Software Note paper that might become very highly cited. |
Okay... but why does R allow this then? > library(ape)
> read.tree(text="(((a,b),(c,d)),(e,f));")
Phylogenetic tree with 6 tips and 5 internal nodes.
Tip labels:
[1] "a" "b" "c" "d" "e" "f"
Rooted; no branch lengths.
> read.tree(text="((a,b),(c,d)),(e,f);")
Phylogenetic tree with 6 tips and 4 internal nodes.
Tip labels:
[1] "a" "b" "c" "d" "e" "f"
Rooted; no branch lengths.
> read.tree(text="(a,b),(c,d),(e,f);")
Phylogenetic tree with 6 tips and 3 internal nodes.
Tip labels:
[1] "a" "b" "c" "d" "e" "f"
Rooted; no branch lengths. I think what these newick strings represent is one, two and three trees inside a single file - as evidenced by the decreasing number of internal nodes in the later trees. Is that right? Cutting trees at a certain time in the past (because history back to the overall root is meaningless) is a useful thing for me to be able to do (and googling, people do it using |
Okay, it seems pretty clear that trees should be or use some kind of |
Because R sucks? No, seriously, cutting trees is important, but I think it's much clearer if one tree corresponds to one object. I guess you might have a BagOfNodes type which is just a lot of nodes and connections, but that's not a tree IMHO. Also, as your example shows, trees may have a core property of being rooted or unrooted, but what if some of the trees in an object are rooted? The In Ben Ward's original linked list implementation of Phylogenies, he'd solved that elegantly by a |
Sounds interesting with your experiments with MetaGraphs. But I don't understand
MetaGraph should allow you to use any kind of |
Maybe this comment is relevant: https://discourse.julialang.org/t/lightgraphs-vs-simplegraphs/14643/4 |
As you've seen, the underlying problem is the AbstractSimpleGraph, which requires integer (and possibly contiguous?) labels, and the fact that MetaGraphs uses a SimpleGraph as its underlying graph type. Your thread does link to a possibility but SimpleGraphs is dead (and not in the JuliaGraphs family) and the other proposed solution has problems of its own - just getting information about a single node becomes a search over all of the values in a Dict, which makes my current code look positively efficient... though maybe you're talking about the fact that I'm increasingly thinking about creating a new |
As far as your comment on legal trees is concerned, I think your solution means we can never build a tree from the ground up using the tree structure. If we start with the tips on their own, it's already illegal and can't be manipulated (or even stored at all from what you said?) because there's more than one root, and the pointer is only to the root apparently. If we delete a branch, we have two trees and it's illegal. If we add a tip, ditto. I don't find that an elegant solution tbh unless I'm misunderstanding something important. None of these problems exist for me (though lots of others do, don't get me wrong!) - I can store intermediate groups-of-subtrees in my tree, and that's a useful thing to be able to do. Using an underlying graph package obviously easily allows us to hold multiple sub-graphs in a single object, so nothing will change there. There are also obvious advantages to having them all together - apart from building and manipulating trees using the tree structure we create, I have datasets where I want to be able to represent all of my species, but I don't have a supertree, only a series of trees at the family or order level, for instance. They should be stored and plotted and generally represented together. You can always make a strict tree type like my BinaryTree that is fussy about this. Incidentally, this is quite distinct from my Fundamentally, I don't understand what the problem is with having multiple subtrees in a single tree object? PS I would enforce no common node names, and define the tree object as being rooted only if all subtrees were rooted I imagine, but those are pretty minor things. |
Hm, that sounds like MetaGraphs don't quite live up to my expectation. Maybe the simple linked-list design you suggested, with metadata stored in a dict, might be easier to deal with after all. I am not 100% sure why @benjward abandoned that approach over using LightGraphs but probably for the same reasons that leads me to think it might be a good idea. |
The good news is that so long as I implement the AbstractGraph interface, we get all of the goodness (if not the speed) anyway... |
It's frankly hard to imagine anything being faster than a linked Node list (using StaticArrays and dropping the notion of "branch") with a |
Based on the diminishing returns on large arrays with using StaticArrays, and the growing returns of some kind of SparseArray (and it being a stdlib module), may be best to explore that direction instead.
|
Well I mean that the connections of a node should be in a StaticArray, they are going to be very small, mostly just n = 2. |
@richardreeve you'd just start the tree with an arbitrary node. Trees need not be rooted, and in fact in many tree-building algorithms you build an unrooted tree and just specify a node as the root at the very end (or don't, if you don't have a proper outgroup). That's not doable in the current design, because your division into "outbounds" and "inbounds" explicitly assumes that there's a root on the tree. In fact, an "inbound" node is just one that will be traversed earlier in a traversal starting at the root. I agree that a BagOfNodes is distinct from a TreeSet - I like the latter type a lot. But to me, having an AbstractTree be a bag of nodes that may or may not be connected really doesn't feel right. |
Okay. In any event, I don't think this is any kind of dealbreaker, because we can just have two types of AbstractTree (well, three, because TreeSet is also an AbstractTree at the moment), one with exactly one tree in it, and one with one or more (BagOfTrees may be too frivolous - or maybe not? - but along those lines). As far as the inbound and outbound business is concerned, every other in-memory and on-disk tree implementation I looked at had at least implicit directionality in its edges (though I only looked at half a dozen or so) - the whole newick/nexus format is basically explicitly rooted for instance. I couldn't really understand how that fitted in with the concept of unrootedness, but I just shrugged and copied it since this isn't my research area. If you don't have directionality in the edges for unrooted trees, you essentially need to have two different tree formats (for rooted and unrooted), or you can't work out locally from a node of a rooted tree which its descendants are without global knowledge of the tree structure, surely? |
I'm all for preserving and being compatible with the AbstractGraph interface, i.e. AbstractPhylogeny <: AbstractGraph, and then defining specific phylogeny structs (maybe even more than one for different use cases). Regarding tree editing and intermediate invalid trees, what we have in assembly graphs at the moment is similar - an assembly has to start building from some point. So we allow for invalid structures to occur, but when plugging in such a graph into some analysis function that assumes the thing is valid, then a sanity check gets performed. The reason I stopped using a linked list in the end was the allocation of gc'd objects, and then the indirections when traversing the linked list. Pushing the boat out, I read in "The Art of Computer Programming" that you can actually store the entirety of a tree's structure (providing some constraints) in a single bit-vector. But that's some next level micro-optimisation trip that is likely not worth the human hours! |
I'd think that an immutable Node struct wrapping a StaticArray of connections would be stack-allocated, no? |
It could be, we'd have to check, but when I was using linked-lists this was mcuh older version of julia. We can probably easily get around the issues I had before now. |
I think that's true. I was just (I think) adressing this concern that you'd build invalid trees if you didn't store all the nodes somewhere. |
There's an ROpenSci package that aims to replace ape with a structure based on linked lists https://github.com/ropensci/phylogram |
I think I reviewed it :)
… On 14 Sep 2018, at 16:32, Michael Krabbe Borregaard ***@***.***> wrote:
There's an ROpenSci package that aims to replace ape with a structure based on linked lists https://github.com/ropensci/phylogram <https://github.com/ropensci/phylogram>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADPejWkVzIw7XvLLOsgCEcTCuB8jq-79ks5ua8wZgaJpZM4WjNRd>.
|
Yes, the garbage collector was my concern too - hopefully that's no longer a problem. I'll give it a try anyway. Fortunately the Node being immutable doesn't stop us mutating the metadata inside, but obviously the StaticArrays will mean there a lot of reallocating going on if we allow polytomies (and modify trees). I still don't really understand gc in Julia as I've said before, so I'll start with normal vectors and go from there... |
I checked and types wrapping a StaticArray seem to be heap allocated... |
You can't use an SVector, as they are immutable, and if your linked list needs to be linked backwards and forwards you'd need all other nodes to be defined before defining your node, which can't be done... But you could use an MVector, where only the size is immutable. That would allow the Node type to be parametric on the number of descendants, fixing that in the type information. That would make it visible to the compiler whether something is a dichotomous node, a tip node etc. Does that make sense? |
Quick note to say master now does traversals preorder and postorder (as well as inorder and breadthfirst) using |
Awesome! wow you really are a nightowl btw |
I'm wondering if #20 shouldn't close this? |
Here are some comments on Phylo that came up while doing the plotting code, to start a conversation.
First, and most importantly, the duality between nodes and nodenames is quite confusing and difficult to navigate, and 90% of my coding troubles have stemmed from difficulties in understanding when you'd need a name and when you'd need the actual node. (Like, to get the root node I'd have to go
first(nodenamefilter(isroot, tree))
:-O). Also a lot of your interface functions seem geared for navigating just this, and could be avoided. And using the node names seems perfectly superfluous. I think your idea of reimplementing them as a linked list is a very big improvement, but I suspect using MetaGraphs might be even better.This would also bridge the gap to Phylogenies in Bio.jl which is built on LightGraphs, in fact we could probably use that package to replace the "guts" of Phylo, resulting in a new, stronger, hybrid to rule the juliaverse. And still keep all parts of your interface that you do want to keep. The fact that they are largely complementary should help.
I wouldn't mind helping with this.
The other issue is that the node iterators seem to not follow the traditional "depth-first" or "breath-first" patterns from the root to the tips, instead they just seem to start from some tip and go on! Algorithmically there are some great advantages to using those patterns, and it's less confusing. But I may be wrong here, maybe I'm confused. Oh and they're already implemented in MetaGraphs.
Finally I've put some recursive root-to-tips and tips-to-roots functions in the code, which are really useful for doing certain things.
The text was updated successfully, but these errors were encountered: