-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValidateVariants gvcf validation #3445
Conversation
… dont fail when not using an interval argument
8abe461
to
d292401
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ideas. a few suggestions.
@@ -195,7 +201,15 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina | |||
final Set<String> rsIDs = getRSIDs(featureContext); | |||
|
|||
if (VALIDATE_GVCF) { | |||
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true); | |||
SimpleInterval refInterval = ref.getInterval(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finals
@@ -168,10 +168,16 @@ | |||
|
|||
private GenomeLocSortedSet genomeLocSortedSet; | |||
|
|||
// information to keep track of when validating a GVCF | |||
private int currentEnd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of keeping the end and the contig, keep a currentInterval and use the available math functions to check if the next interval abuts the current.
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true); | ||
SimpleInterval refInterval = ref.getInterval(); | ||
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set. | ||
if (refInterval.getContig().equals(currentContig) && currentEnd == (refInterval.getStart() - 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a Interval::abuts() function. please use it.
@@ -211,7 +225,13 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina | |||
@Override | |||
public Object onTraversalSuccess() { | |||
if (VALIDATE_GVCF) { | |||
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(getBestAvailableSequenceDictionary()))); | |||
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
Codecov Report
@@ Coverage Diff @@
## master #3445 +/- ##
==============================================
+ Coverage 80.287% 80.31% +0.023%
- Complexity 17633 17647 +14
==============================================
Files 1178 1178
Lines 63847 63854 +7
Branches 9928 9930 +2
==============================================
+ Hits 51261 51281 +20
+ Misses 8631 8628 -3
+ Partials 3955 3945 -10
|
back to you @yfarjoun |
final SimpleInterval refInterval = ref.getInterval(); | ||
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set. | ||
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) { | ||
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(refInterval.getContig(), previousInterval.getEnd(), vc.getEnd()), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused ad to why you are adding an interval in both cases of the if statment. I thought that the logic should be
if( contiguous)
merge currentInterval into previous one
else
add previous to set
replace previous with current
end
Specifically, only one branch add the interval.
ofcourse, you still need to remember to add at the end..of the traversal. Are you doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic I was going for was always add the current interval to the set, but use the mergeIfIntervalOverlaps
boolean in set.add()
in our favor if the previous and current interval are adjacent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically because i wanted all the set creation logic in one place and not have a straggler in onTraversalSuccess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Could you add comments to that effect? (I wonder who didn't comment about this idea previously?!!)
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true); | ||
final SimpleInterval refInterval = ref.getInterval(); | ||
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set. | ||
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some pre-optimizing going on here. Please create currentInterval
as a local final variable
Use it for checking overlap and for merging or replacing as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I understand what you were trying to do, I think that the comment is not clarifying. you need to state that we would like to merge abutting intervals, but that the genomeLocSortedSet only knows how to merge overlapping intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that the difference between these two lines is so small, perhaps you should define the start with a ternary:
final long start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ?
previousInterval.getEnd() : refInterval.getStart()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits and then you're good to go!
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true); | ||
final SimpleInterval refInterval = ref.getInterval(); | ||
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set. | ||
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that the difference between these two lines is so small, perhaps you should define the start with a ternary:
final long start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ?
previousInterval.getEnd() : refInterval.getStart()
Oh I really like that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this looks good. Please check that the speed is improved before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsotobroad I'm confused by this pr since it seems like it's just adding the same number of things to the set, just changing the boundary on them. It's not clear to me that it does what it's meant to.
@Override | ||
public void onTraversalStart() { | ||
if (VALIDATE_GVCF) { | ||
SAMSequenceDictionary dict = getBestAvailableSequenceDictionary(); | ||
dict = getBestAvailableSequenceDictionary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't store this, I would just call getBestAvailableSequenceDictionary() again when you need it again. It's not used in any sort of tight loop, so the expense is negligible. Adding fields should always be a last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set. | ||
final int start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ? | ||
previousInterval.getEnd() : refInterval.getStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this change, doesn't this still add the same number of elements to the set? Does the set do internal merging somehow? If so, why is doing the merging externally necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does the same number of .add()
calls but the number of elements in the set is smaller. The set internally merges things that overlap with the true
in the .add()
call. In a gvcf most records are adjacent to each other so nothing was being merged and we had a huge set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we notice the current interval is adjacent to the previous one, we "overlap" them before we add to the set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for the explanation, that makes sense. Could you add a more detailed comment explaining that, because i suspect the inner workings of the GenomeLocSortedSet
are not well known to most people.
if (intervalArgumentCollection.intervalsSpecified()){ | ||
intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(dict))); | ||
} else { | ||
intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromSequenceDictionary(dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add an integration test that exercises this new path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positive and negative tests added
@yfarjoun this is a memory optimization but I have confirmed it uses less memory and runs at the same speed |
// GenomeLocSortedSet will automatically merge intervals that are overlapping when setting `mergeIfIntervalOverlaps` | ||
// to true. In a GVCF most blocks are adjacent to each other so they wouldn't normally get merged. We check | ||
// if the current record is adjacent to the previous record and "overlap" them if they are so our set is as | ||
// small as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...while still containing the same bases."
@Override | ||
public void onTraversalStart() { | ||
if (VALIDATE_GVCF) { | ||
SAMSequenceDictionary dict = getBestAvailableSequenceDictionary(); | ||
SAMSequenceDictionary seqDictionary = getBestAvailableSequenceDictionary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
@@ -211,7 +224,14 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina | |||
@Override | |||
public Object onTraversalSuccess() { | |||
if (VALIDATE_GVCF) { | |||
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(getBestAvailableSequenceDictionary()))); | |||
final GenomeLocSortedSet intervalArgumentGenomeLocSortedSet; | |||
SAMSequenceDictionary seqDictionary = getBestAvailableSequenceDictionary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
I'm happy with this except for the missing |
final SimpleInterval refInterval = ref.getInterval(); | ||
|
||
// GenomeLocSortedSet will automatically merge intervals that are overlapping when setting `mergeIfIntervalOverlaps` | ||
// to true. In a GVCF most blocks are adjacent to each other so they wouldn't normally get merged. We check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, this is helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* merge adjacent blocks when validating GVCFs so we use less memory and dont fail when not using an interval argument
Merge adjacent blocks when validating GVCFs so we use less memory and dont fail when not using an interval argument