-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234
Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1234 +/- ##
===============================================
- Coverage 69.329% 69.323% -0.006%
- Complexity 8104 8105 +1
===============================================
Files 543 543
Lines 32477 32480 +3
Branches 5500 5501 +1
===============================================
Hits 22516 22516
- Misses 7753 7755 +2
- Partials 2208 2209 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Some comments. It makes me sad to turn clean readable code into gross doubly indexed loops...
|
||
boolean sawNoCall = false, sawMultipleAlleles = false; | ||
Allele observedAllele = null; | ||
Allele firstAllele = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this rename makes sense. It's not the first allele, but the first non-no-call allele.
@@ -823,8 +823,8 @@ private boolean hasAllele(final Allele allele, final boolean ignoreRefState, fin | |||
return true; | |||
|
|||
final List<Allele> allelesToConsider = considerRefAllele ? getAlleles() : getAlternateAlleles(); | |||
for ( Allele a : allelesToConsider ) { | |||
if ( a.equals(allele, ignoreRefState) ) | |||
for ( int i = 0; i < allelesToConsider.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block can probably be replaced with a call to allelesToConsider.contains() instead which would be just as fast but less gross.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a good point, the code is ~identical too it looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, though I agree it would be nicer you can't provide a lambda to contains, and it just uses the stock equality method, but here we must use the overload with ignoreRefState
@@ -1687,7 +1688,12 @@ public boolean hasSymbolicAlleles() { | |||
} | |||
|
|||
public static boolean hasSymbolicAlleles( final List<Allele> alleles ) { | |||
return alleles.stream().anyMatch(Allele::isSymbolic); | |||
for (int i = 0; i < alleles.size(); i++ ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if performance is really that critical here, I would pull the call to size into the initializer so it isn't repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thats probably the right choice. This method in particular was showing up egregiously in the profiler considering what it does.
@@ -1485,9 +1485,10 @@ public String toStringWithoutGenotypes() { | |||
|
|||
boolean sawRef = false; | |||
for ( final Allele a : alleles ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be made into a doubly indexed array? wouldn't that be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would, but the input to that method is a collection, not a list so I don't actually have the means to access it by index and I don't want to change the VariantContext API to only accept a list. Getting rid of the iteration over the second list seems to have accounted for ~half of the runtime this method was taking up.
if ( g.isAvailable() ) { | ||
for ( Allele gAllele : g.getAlleles() ) { | ||
for ( int i = 0; i < genotypes.size(); i++ ) { | ||
if ( genotypes.get(i).isAvailable() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull out the get(i) as a variable if speed is so essential here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the inner loop be made index because that will be the most essential one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I missed that. Though its worth mentioning that most of the improvement in this method came from genotypes.isAvailible which was more expensive than the inner loop. I'll extract it out however since we are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this is probably the grosses of the unreadable code after this change...
…h slightly faster alternatives where it made a difference for the Haplotype Caller.
5dca768
to
b68d1ac
Compare
@lbergelson Responded to your comments. Let me know if there is anything else I should do for the branch. And I am very sorry about the readability cost, the doubly indexed list is a sad change that I wasn't necessarily going to implement but you did want it... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate it but it seems like a real, significant, performance improvement.
These are micro optimizations that add up to a pretty small change in the HaplotypeCaller (Exome +GVCF) profile when they are implemented. Here are some profiling snippets:
Before:
After:
Then some runtime examples on my laptop just chromosome 15 of the same bam:
Before:
4m26s, 4m16s, 4m18s, 4m15s, 4m28s
After:
4m04s, 4m09s, 4m15s, 4m11s, 4m07