Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234

jamesemery · 2018-12-03T21:47:23Z

These are micro optimizations that add up to a pretty small change in the HaplotypeCaller (Exome +GVCF) profile when they are implemented. Here are some profiling snippets:

Before:

After:

Then some runtime examples on my laptop just chromosome 15 of the same bam:
Before: 4m26s, 4m16s, 4m18s, 4m15s, 4m28s
After: 4m04s, 4m09s, 4m15s, 4m11s, 4m07

codecov-io · 2018-12-03T22:01:22Z

Codecov Report

Merging #1234 into master will decrease coverage by 0.006%.
The diff coverage is 96.154%.

@@               Coverage Diff               @@
##              master     #1234       +/-   ##
===============================================
- Coverage     69.329%   69.323%   -0.006%     
- Complexity      8104      8105        +1     
===============================================
  Files            543       543               
  Lines          32477     32480        +3     
  Branches        5500      5501        +1     
===============================================
  Hits           22516     22516               
- Misses          7753      7755        +2     
- Partials        2208      2209        +1

Impacted Files	Coverage Δ	Complexity Δ
.../htsjdk/variant/variantcontext/VariantContext.java	`77.714% <100%> (+0.128%)`	`246 <3> (+2)`	⬆️
...n/java/htsjdk/variant/variantcontext/Genotype.java	`59.036% <90.909%> (ø)`	`80 <0> (ø)`	⬇️
...samtools/util/AsyncBlockCompressedInputStream.java	`72% <0%> (-4%)`	`12% <0%> (-1%)`

lbergelson

@jamesemery Some comments. It makes me sad to turn clean readable code into gross doubly indexed loops...

lbergelson · 2018-12-03T22:32:23Z

src/main/java/htsjdk/variant/variantcontext/Genotype.java


        boolean sawNoCall = false, sawMultipleAlleles = false;
-        Allele observedAllele = null;
+        Allele firstAllele = null;


I'm not sure this rename makes sense. It's not the first allele, but the first non-no-call allele.

lbergelson · 2018-12-03T22:38:53Z

src/main/java/htsjdk/variant/variantcontext/VariantContext.java

@@ -823,8 +823,8 @@ private boolean hasAllele(final Allele allele, final boolean ignoreRefState, fin
            return true;

        final List<Allele> allelesToConsider = considerRefAllele ? getAlleles() : getAlternateAlleles();
-        for ( Allele a : allelesToConsider ) {
-            if ( a.equals(allele, ignoreRefState) )
+        for ( int i = 0; i < allelesToConsider.size(); i++) {


This whole block can probably be replaced with a call to allelesToConsider.contains() instead which would be just as fast but less gross.

thats a good point, the code is ~identical too it looks like

Unfortunately, though I agree it would be nicer you can't provide a lambda to contains, and it just uses the stock equality method, but here we must use the overload with ignoreRefState

lbergelson · 2018-12-03T22:41:27Z

src/main/java/htsjdk/variant/variantcontext/VariantContext.java

@@ -1687,7 +1688,12 @@ public boolean hasSymbolicAlleles() {
    }

    public static boolean hasSymbolicAlleles( final List<Allele> alleles ) {
-        return alleles.stream().anyMatch(Allele::isSymbolic);
+        for (int i = 0; i < alleles.size(); i++ ) {


if performance is really that critical here, I would pull the call to size into the initializer so it isn't repeated.

Sure, thats probably the right choice. This method in particular was showing up egregiously in the profiler considering what it does.

lbergelson · 2018-12-04T16:25:23Z

src/main/java/htsjdk/variant/variantcontext/VariantContext.java

@@ -1485,9 +1485,10 @@ public String toStringWithoutGenotypes() {

        boolean sawRef = false;
        for ( final Allele a : alleles ) {


should this be made into a doubly indexed array? wouldn't that be faster?

It would, but the input to that method is a collection, not a list so I don't actually have the means to access it by index and I don't want to change the VariantContext API to only accept a list. Getting rid of the iteration over the second list seems to have accounted for ~half of the runtime this method was taking up.

lbergelson · 2018-12-04T16:25:41Z

src/main/java/htsjdk/variant/variantcontext/VariantContext.java

-            if ( g.isAvailable() ) {
-                for ( Allele gAllele : g.getAlleles() ) {
+        for ( int i = 0; i < genotypes.size(); i++ ) {
+            if ( genotypes.get(i).isAvailable() ) {


pull out the get(i) as a variable if speed is so essential here.

shouldn't the inner loop be made index because that will be the most essential one

Yes, I missed that. Though its worth mentioning that most of the improvement in this method came from genotypes.isAvailible which was more expensive than the inner loop. I'll extract it out however since we are here

I would say this is probably the grosses of the unreadable code after this change...

…h slightly faster alternatives where it made a difference for the Haplotype Caller.

jamesemery · 2018-12-04T17:43:36Z

@lbergelson Responded to your comments. Let me know if there is anything else I should do for the branch. And I am very sorry about the readability cost, the doubly indexed list is a sad change that I wasn't necessarily going to implement but you did want it...

lbergelson

I hate it but it seems like a real, significant, performance improvement.

lbergelson requested changes Dec 4, 2018

View reviewed changes

lbergelson assigned jamesemery Dec 4, 2018

lbergelson added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Dec 4, 2018

jamesemery added 2 commits December 4, 2018 12:42

Replaced several for-each loops in VariantContext.Make() codepath wit…

0b5e3d8

…h slightly faster alternatives where it made a difference for the Haplotype Caller.

responded to first round of comments

b68d1ac

jamesemery force-pushed the je_speedupVariantContextBuilding branch from 5dca768 to b68d1ac Compare December 4, 2018 17:42

sorcery

363a69e

lbergelson approved these changes Dec 4, 2018

View reviewed changes

lbergelson merged commit d2360ff into samtools:master Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234

Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234

jamesemery commented Dec 3, 2018

codecov-io commented Dec 3, 2018 •

edited

Loading

lbergelson left a comment

lbergelson Dec 3, 2018

lbergelson Dec 3, 2018

jamesemery Dec 4, 2018

jamesemery Dec 4, 2018

lbergelson Dec 3, 2018

jamesemery Dec 4, 2018

lbergelson Dec 4, 2018

jamesemery Dec 4, 2018

lbergelson Dec 4, 2018

lbergelson Dec 4, 2018

jamesemery Dec 4, 2018

jamesemery Dec 4, 2018

jamesemery commented Dec 4, 2018

lbergelson left a comment

		@@ -1485,9 +1485,10 @@ public String toStringWithoutGenotypes() {

		boolean sawRef = false;
		for ( final Allele a : alleles ) {

Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234

Replaced several for-each loops in VariantContext.Make() based on HaplotypeCaller profiling #1234

Conversation

jamesemery commented Dec 3, 2018

codecov-io commented Dec 3, 2018 • edited Loading

Codecov Report

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Dec 4, 2018

lbergelson left a comment

Choose a reason for hiding this comment

codecov-io commented Dec 3, 2018 •

edited

Loading