matUtils updates: new subcommand "fix"; improvements to MAT::Tree:move_node and mask --move-nodes #357
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This bundles several improvements to matUtils. Let me know if you would prefer for them to be pulled out into separate PRs.
I added a new subcommand
fix
to find and fix a recurring pattern that causes trouble for lineage designation: often there is a node whose mutation is simply a reversion of its grandparent's mutation, for example the final node in a path like this:... > A1C > G2T > C3A > T2G
In that case it would be preferable to move that final node to
... > A1C > C3A
, i.e. to move it to become a child of its great-grandparent with only its parent's mutation. As long as the grandparent, parent and final node all have only one mutation each, that operation is parsimony-preserving. It represents two independent occurrences of C3A on the A1C branch, as opposed to only one occurrence of C3A followed by an immediate reversion of the previous mutation in the path. This is motivated by the Pango lineage team's experience of SARS-CoV-2 with advantageous Spike mutations occurring multiple times and causing an increase in transmissions.I found that MAT::Tree::collapse_tree was moving nodes without first looking to see if the new parent node already had a child with the same mutation(s) as the incoming node. This could result in the new parent node having multiple children with the same mutation(s), which causes incorrect structure and trouble for correctly annotating lineages. So I updated MAT::Tree:move_node to first search the new parent node's children for the same mutations as the moved node. Both the existing child and the moved node could be either an internal node or a leaf node (sample); all combinations are handled. When both nodes are internal nodes, the children of the moved node become children of the new parent's existing child, and the moved node is then removed. Since the moved node might be removed, MAT::Tree::collapse_tree could no longer use BFS order, so I changed it to start with the deepest nodes and work back towards root.
mask --move-nodes
used to require that the new parent have exactly the same set of mutations as the original parent. I added support for a new parent with a strict subset of the original parent's mutations, adding the original parent's extra mutations to the moved node. [Then I could manually make moves like the newfix
subcommand's moves. I implemented this beforefix
, for testing.] I also found thatmask --mask-mutations
was doing more copying than necessary and sped it up a bit.