Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of JunctionTree based haplotype finding #6034

Merged
merged 7 commits into from
Oct 21, 2019

Conversation

jamesemery
Copy link
Collaborator

@jamesemery jamesemery commented Jul 12, 2019

This is a prototype of the basic infrastructure that must go in to make the junction tree based Haplotype finding work. I have pulled out a toggle for the HaplotypeCaller that that enables a separate ReadThreadingAssembler codepath for haplotype finding. Right now when this mode is enabled ExperimentalReadThreadingAssembler is used in conjunction with JuncitonTreeKBestHalotypeFinder to extract only haplotypes that show up in our junction trees with evidence of > 3 reads. This still poses problems with dangling end recovery as definitionally those branches never include complete junction tree data.

I will continue to work on this branch (as it is in a somewhat rough state still) but I would like to at least get some eyes on it before i get too deep in the weeds to at least validate the structural approach I have chosen.

Currently known issues in this branch:

  • Tests are failing due to resolution of non-unique reference sink vertexes, I would solicit help as to how best to resolve the case where junction trees point to both a reference stop allele and a continued path.
  • There is at least one very degenerate edge case that might cause the code to hang, I would also ask after what is the best way to close out of looping assembly structures that never have reads to close them (i.e. a "dangling end" hom-var that happens to point to a non-unique reference base).
  • Probably after discussion the threshold for discarding junction trees will be changed to instead use paths from the discarded tree first.

Resolves #5925

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I'm not at any particular stopping point, just submitting what I have so far.

@jamesemery
Copy link
Collaborator Author

@davidbenjamin @jonn-smith I have pushed the latest version of the code. Most of the changes since this was last shown are adding checks to about every level of the code for infinite loops (several places in the dangling end recovery code, and several places the new BestHaplotypeFinder). Additionally tests have been updated to capture these cases as well as several of the changes to functionality (no longer forcing reference start kmer to have a junction tree, limiting the cases where we actually attempt to follow an unsupported reference path when constructing a haplotype, reintroducing reference path weight to the calculation, etc...).

// Graph to be operated on, in this case cast as an ExperimentalReadThreadingGraph
ExperimentalReadThreadingGraph experimentalReadThreadingGraph;

public JunctionTreeKBestHaplotypeFinder(final BaseGraph<V, E> graph, final Set<V> sources, Set<V> sinks, final int branchWeightThreshold) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to self: this branchWeightThreshold in the constructor is currently not actually hooked up, please fix when responding to the comments in the branch.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a somewhat coherent stopping point. I have finished reviewing haplotype finding and have yet to do graph construction. That is, I reviewed the use of junction trees but not their generation.

*/
public class JTBestHaplotype<V extends BaseVertex, E extends BaseEdge> extends KBestHaplotype<V, E> {
private JunctionTreeView treesInQueue; // An object for storing and managing operations on the queue of junction trees active for this path
private int decisionEdgesTakenSinceLastJunctionTreeEvidence;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to both of us that we decided to replace this logic with "no revisiting edges without a junction tree telling us to do so"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed this decision into my next branch dependent on this one. It will come with the additional changes made to recover paths dropped by the junction tree traversals. For what its worth it is already implemented there.

throw new GATKException("While constructing graph, there was an incongruity between a JunctionTree edge and the edge present on graph traversal");
}

// Don't add edges to the symbolic end vertex here at all, thats handled elsewhere, also don't add the same edge again if we pulled it in from a younger tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put an {@ link} in the comment pointing to the "elsewhere"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats -> that's

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, actually, why not handle it here?

if (totalOut <= 0) {
treesInQueue.removeEldestTree();
} else { // Otherwise look at the next tree in the list
currentActiveNodeIndex++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that JunctionTreeView manages junction tree-based traversal it doesn't seem right to me that you have to keep track of this extra index externally. Is there any way to encapsulate all the state representing traversal inside JunctionTreeView?

I guess that what I mean is that JunctionTreeView could encapsulate not just the oldest tree but also the peeking ahead to younger trees above the evidence threshold.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Why don't I make an iterator that manages deleting empty trees and fetching newer trees?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round done. Back to @jamesemery.

* @param kmer the query kmer.
* @return {@code true} if we can start thread the sequence at this kmer, {@code false} otherwise.
*/
protected boolean isThreadingStart(final Kmer kmer, final boolean startThreadingOnlyAtExistingVertex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than stretch the definition of uniqueKmers and nonUniqueKmers, I would make this abstract and just implement it differently between the two child classes.

Copy link
Collaborator Author

@jamesemery jamesemery Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I have done the following. I pushed the code for "tracking" kmers down to the graph implementations. This means that there is now a trackKmer method as well as getNextKmerVertexForChainExtension where the actual logical differences between the two paths are resolved. Now there is a global kmer map that is owned by the interface but its up to the implementation what gets added to that map and the concept of having non-unique kmers or kmers that are otherwise disqualified is owned by the implementation classes.

I THINK this involves the least code duplication, many of these abstract methods are tiny methods which I suspect could lead to confusion. To that end I am going to ask @lbergelson to take a look at these 3 classes for his opinion once my response has been pushed.

* @return {@code true} to indicate that the some sink vertex is reachable by {@code currentVertex},
* {@code false} otherwise.
*/
private boolean findGuiltyVerticesAndEdgesToRemoveCycles(final BaseGraph<V, E> graph,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Indeed I would also comment that the vertexes that don't go to sinks can't actually occur at this stage in the code and would have been pruned before. Ultimately my answer to this would probably be to remove them entirely though since it doesn't matter in the new assembler....

@jamesemery
Copy link
Collaborator Author

@davidbenjamin Responded to most of your comments.

@jamesemery jamesemery force-pushed the je_ExperimentalModeMovedToEdgesInstead branch from 7c91e9a to fe36a8f Compare September 18, 2019 15:06
@jamesemery
Copy link
Collaborator Author

@davidbenjamin Rebased the branch

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I have to run for now but I have some initial comments. They're mostly nitpicky. Looks sane to me but there are some things that could be cleaner.

protected boolean startThreadingOnlyAtExistingVertex = false;
protected List<MultiDeBruijnVertex> referencePath = null;

private int maxMismatchesInDanglingHead = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments. There are a bunch of methods here that could be made private. Make them private. There are a some VisibleForTesting methods that could be made package protected, do it.

Group the abstract methods together so it's clear what's implemented in downstream classes.

import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test class should go into TestUtils instead of the main source packages I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, though its going to involve making a bunch of methods and fields protected rather than package protected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Eh, either way I guess.

final Collection<Kmer> nonUniquesFromSeq = determineNonUniqueKmers(sequenceForKmers, kmerSize);
if ( nonUniquesFromSeq.isEmpty() ) {
// remove this sequence from future consideration
it.remove();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remove looks weird and buggy, the iterator is on a copy of the sequences and as far as I can tell nothing looks at it again so the remove is pointless. If it was supposed to do something it probably isn't.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I think my solution would be to avoid touching that code because it was handed down by our forefathers and I would rather not affect the old graph code wherever possible.

* @return a non-null NonUniqueResult
*/
private NonUniqueResult determineKmerSizeAndNonUniques(final int minKmerSize, final int maxKmerSize) {
private Set<Kmer> determineNonUniques(final int kmerSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be simplified to be static and the kmers as input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, though given that its private anyway and will only ever be called from a non-static context being fed all of the sequences i'm not sure what making it static really accomplishes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it more clear that there's something wrong with the method when you do that. It's fine to do nothing here though. I just noticed this method was weird and wanted to make a note of it.

@davidbenjamin
Copy link
Contributor

@jamesemery Good to go. Suggestions for that argument:

  • --use-junction-trees
  • --find-haplotypes-with-phasing
  • --linked-de-bruijn-graph
  • --junction-tree-graph-traversal

I don't have anything particularly catchy.

@jamesemery jamesemery force-pushed the je_ExperimentalModeMovedToEdgesInstead branch from 566afdb to 8a45c50 Compare October 18, 2019 16:24
@jamesemery
Copy link
Collaborator Author

Given that this branch has been reviewed and that I would like to get this in so the next round of changes can be pushed into a PR this branch can be merged I think. No guarantees that there aren't lurking issues that might cause this code not to work as-is but the argument to enable this is hidden I'm willing to live with the consequences for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use junction trees to find best haplotypes
4 participants