Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValidateVariants gvcf validation #3445

Merged
merged 9 commits into from
Aug 18, 2017
Merged

Conversation

jsotobroad
Copy link
Contributor

Merge adjacent blocks when validating GVCFs so we use less memory and dont fail when not using an interval argument

@jsotobroad jsotobroad force-pushed the js_optimize_validate_variants branch from 8abe461 to d292401 Compare August 16, 2017 19:36
Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ideas. a few suggestions.

@@ -195,7 +201,15 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina
final Set<String> rsIDs = getRSIDs(featureContext);

if (VALIDATE_GVCF) {
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true);
SimpleInterval refInterval = ref.getInterval();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals

@@ -168,10 +168,16 @@

private GenomeLocSortedSet genomeLocSortedSet;

// information to keep track of when validating a GVCF
private int currentEnd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of keeping the end and the contig, keep a currentInterval and use the available math functions to check if the next interval abuts the current.

genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true);
SimpleInterval refInterval = ref.getInterval();
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set.
if (refInterval.getContig().equals(currentContig) && currentEnd == (refInterval.getStart() - 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a Interval::abuts() function. please use it.

@@ -211,7 +225,13 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina
@Override
public Object onTraversalSuccess() {
if (VALIDATE_GVCF) {
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(getBestAvailableSequenceDictionary())));
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

@codecov-io
Copy link

codecov-io commented Aug 16, 2017

Codecov Report

Merging #3445 into master will increase coverage by 0.023%.
The diff coverage is 91.667%.

@@              Coverage Diff               @@
##              master    #3445       +/-   ##
==============================================
+ Coverage     80.287%   80.31%   +0.023%     
- Complexity     17633    17647       +14     
==============================================
  Files           1178     1178               
  Lines          63847    63854        +7     
  Branches        9928     9930        +2     
==============================================
+ Hits           51261    51281       +20     
+ Misses          8631     8628        -3     
+ Partials        3955     3945       -10
Impacted Files Coverage Δ Complexity Δ
...r/tools/walkers/variantutils/ValidateVariants.java 82.353% <91.667%> (+2.353%) 32 <0> (+4) ⬆️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 74.342% <0%> (-1.974%) 38% <0%> (ø)
...bender/tools/spark/sv/evidence/ReadClassifier.java 86.667% <0%> (+1.333%) 33% <0%> (+1%) ⬆️
...er/tools/spark/sv/evidence/BreakpointEvidence.java 80.702% <0%> (+2.193%) 13% <0%> (ø) ⬇️
...te/hellbender/tools/spark/sv/utils/SVInterval.java 90.244% <0%> (+2.439%) 27% <0%> (+1%) ⬆️
...ools/spark/sv/evidence/IntervalCoverageFinder.java 88.462% <0%> (+3.846%) 8% <0%> (+1%) ⬆️
...bender/tools/spark/sv/evidence/KSWindowFinder.java 98.039% <0%> (+3.922%) 18% <0%> (ø) ⬇️
...ellbender/tools/spark/sv/evidence/QNameFinder.java 95.455% <0%> (+4.545%) 9% <0%> (+1%) ⬆️
...lbender/tools/spark/sv/evidence/QNameKmerizer.java 95.238% <0%> (+4.762%) 9% <0%> (+1%) ⬆️
... and 2 more

@jsotobroad
Copy link
Contributor Author

back to you @yfarjoun

@jsotobroad jsotobroad requested a review from lbergelson August 16, 2017 20:50
final SimpleInterval refInterval = ref.getInterval();
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set.
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) {
genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(refInterval.getContig(), previousInterval.getEnd(), vc.getEnd()), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused ad to why you are adding an interval in both cases of the if statment. I thought that the logic should be

if( contiguous)
  merge currentInterval into previous one
else
  add previous to set 
  replace previous with current
end

Specifically, only one branch add the interval.

ofcourse, you still need to remember to add at the end..of the traversal. Are you doing that?

Copy link
Contributor Author

@jsotobroad jsotobroad Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic I was going for was always add the current interval to the set, but use the mergeIfIntervalOverlaps boolean in set.add() in our favor if the previous and current interval are adjacent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically because i wanted all the set creation logic in one place and not have a straggler in onTraversalSuccess

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Could you add comments to that effect? (I wonder who didn't comment about this idea previously?!!)

genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true);
final SimpleInterval refInterval = ref.getInterval();
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set.
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some pre-optimizing going on here. Please create currentInterval as a local final variable

Use it for checking overlap and for merging or replacing as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I understand what you were trying to do, I think that the comment is not clarifying. you need to state that we would like to merge abutting intervals, but that the genomeLocSortedSet only knows how to merge overlapping intervals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that the difference between these two lines is so small, perhaps you should define the start with a ternary:

final long start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ? 
previousInterval.getEnd() : refInterval.getStart()

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits and then you're good to go!

genomeLocSortedSet.add(genomeLocSortedSet.getGenomeLocParser().createGenomeLoc(ref.getInterval().getContig(), ref.getInterval().getStart(), vc.getEnd()), true);
final SimpleInterval refInterval = ref.getInterval();
// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set.
if (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that the difference between these two lines is so small, perhaps you should define the start with a ternary:

final long start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ? 
previousInterval.getEnd() : refInterval.getStart()

@jsotobroad
Copy link
Contributor Author

Oh I really like that

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this looks good. Please check that the speed is improved before merging.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsotobroad I'm confused by this pr since it seems like it's just adding the same number of things to the set, just changing the boundary on them. It's not clear to me that it does what it's meant to.

@Override
public void onTraversalStart() {
if (VALIDATE_GVCF) {
SAMSequenceDictionary dict = getBestAvailableSequenceDictionary();
dict = getBestAvailableSequenceDictionary();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't store this, I would just call getBestAvailableSequenceDictionary() again when you need it again. It's not used in any sort of tight loop, so the expense is negligible. Adding fields should always be a last resort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// Take advantage of adjacent blocks and just merge them so we dont have to keep so many objects in the set.
final int start = (previousInterval != null && previousInterval.overlapsWithMargin(refInterval, 1)) ?
previousInterval.getEnd() : refInterval.getStart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this change, doesn't this still add the same number of elements to the set? Does the set do internal merging somehow? If so, why is doing the merging externally necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does the same number of .add() calls but the number of elements in the set is smaller. The set internally merges things that overlap with the true in the .add() call. In a gvcf most records are adjacent to each other so nothing was being merged and we had a huge set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we notice the current interval is adjacent to the previous one, we "overlap" them before we add to the set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for the explanation, that makes sense. Could you add a more detailed comment explaining that, because i suspect the inner workings of the GenomeLocSortedSet are not well known to most people.

if (intervalArgumentCollection.intervalsSpecified()){
intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(dict)));
} else {
intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromSequenceDictionary(dict);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add an integration test that exercises this new path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positive and negative tests added

@jsotobroad
Copy link
Contributor Author

@yfarjoun this is a memory optimization but I have confirmed it uses less memory and runs at the same speed

// GenomeLocSortedSet will automatically merge intervals that are overlapping when setting `mergeIfIntervalOverlaps`
// to true. In a GVCF most blocks are adjacent to each other so they wouldn't normally get merged. We check
// if the current record is adjacent to the previous record and "overlap" them if they are so our set is as
// small as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...while still containing the same bases."

@Override
public void onTraversalStart() {
if (VALIDATE_GVCF) {
SAMSequenceDictionary dict = getBestAvailableSequenceDictionary();
SAMSequenceDictionary seqDictionary = getBestAvailableSequenceDictionary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

@@ -211,7 +224,14 @@ public void apply(final VariantContext vc, final ReadsContext readsContext, fina
@Override
public Object onTraversalSuccess() {
if (VALIDATE_GVCF) {
GenomeLocSortedSet intervalArgumentGenomeLocSortedSet = GenomeLocSortedSet.createSetFromList(genomeLocSortedSet.getGenomeLocParser(), IntervalUtils.genomeLocsFromLocatables(genomeLocSortedSet.getGenomeLocParser(), intervalArgumentCollection.getIntervals(getBestAvailableSequenceDictionary())));
final GenomeLocSortedSet intervalArgumentGenomeLocSortedSet;
SAMSequenceDictionary seqDictionary = getBestAvailableSequenceDictionary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

@yfarjoun
Copy link
Contributor

I'm happy with this except for the missing finals. @lbergelson ?

final SimpleInterval refInterval = ref.getInterval();

// GenomeLocSortedSet will automatically merge intervals that are overlapping when setting `mergeIfIntervalOverlaps`
// to true. In a GVCF most blocks are adjacent to each other so they wouldn't normally get merged. We check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this is helpful!

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jsotobroad jsotobroad merged commit a1292ee into master Aug 18, 2017
@jsotobroad jsotobroad deleted the js_optimize_validate_variants branch August 18, 2017 16:25
jonn-smith pushed a commit that referenced this pull request Aug 29, 2017
* merge adjacent blocks when validating GVCFs so we use less memory and dont fail when not using an interval argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants