-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert AF calculation bug and yet another reblocking fix #7670
Conversation
6f61408
to
483ec41
Compare
@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype | |||
if (maxAltAlleles < vc.getAlternateAlleles().size()) { | |||
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles); | |||
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) : | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false); | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These go with reverting the boolean variable to force PLs to be output
@@ -94,7 +94,7 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs, | |||
newLog10GQ = -0.1*g.getGQ(); | |||
} | |||
|
|||
final boolean useNewLikelihoods = !suppressUninformativePLs || | |||
final boolean useNewLikelihoods = forceEmitUninformativePLs || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old conditional wasn't right -- I trusted the IntelliJ refactoring too much. This is also a more logical name.
@@ -77,7 +77,7 @@ public static AlleleFrequencyCalculator makeCalculator(final DragstrParams drags | |||
final double[] log10Likelihoods; | |||
if (g.hasLikelihoods()) { | |||
log10Likelihoods = g.getLikelihoods().getAsVector(); | |||
} else if (g.isHomRef()) { | |||
} else if ( g.isHomRef() || g.isNoCall()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking out the no-call shouldn't have happened in 4.2.5.0, but it did because of poor test coverage. Added new multi-allelic tests to GGVCFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Back to you with my comments
@@ -55,7 +55,7 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs, | |||
final GenotypePriorCalculator gpc, | |||
final GenotypeAssignmentMethod assignmentMethod, | |||
final int depth, | |||
final boolean suppressUninformativePLs) { | |||
final boolean forceEmitUninformativePLs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check all usages of this method to make sure that the booleans have been flipped when necessary, since this arg now has the opposite meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this out instead. The reason it was still there was because I didn't want to change the SelectVariants integration test expected outputs, but that's not a good reason. I want to prefer explicitness and consistency over space savings or whatever the motivation was for dropping PLs sometimes.
@@ -77,7 +77,7 @@ public static AlleleFrequencyCalculator makeCalculator(final DragstrParams drags | |||
final double[] log10Likelihoods; | |||
if (g.hasLikelihoods()) { | |||
log10Likelihoods = g.getLikelihoods().getAsVector(); | |||
} else if (g.isHomRef()) { | |||
} else if ( g.isHomRef() || g.isNoCall()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a brief comment here documenting the reason why hom refs and no calls are handled in the same case here, to discourage our future selves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just too dangerous. One of the integration tests was doing the wrong thing, so who knows what QUAL came out. Instead I still prohibit no-calls here but call genotypes in the internal subsetting so we should always have calls when we have the right information.
@@ -1033,7 +1033,7 @@ private VariantContext subsetRecord(final VariantContext vc, final boolean prese | |||
newGC = sub.getNAlleles() == vc.getNAlleles() ? subGenotypesWithOldAlleles : | |||
AlleleSubsettingUtils.subsetAlleles(subGenotypesWithOldAlleles, 0, vc.getAlleles(), | |||
sub.getAlleles(), null, GenotypeAssignmentMethod.DO_NOT_ASSIGN_GENOTYPES, | |||
vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false); | |||
vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the force emit arg is critical here too, add a comment to that effect
@@ -143,7 +143,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v | |||
* @return true if we have enough info for AF calculation | |||
*/ | |||
public static boolean genotypeIsUsableForAFCalculation(Genotype g) { | |||
return g.hasLikelihoods() || (g.isHomRef() && g.hasGQ() && 2 == g.getPloidy()); | |||
return g.hasLikelihoods() || g.hasGQ() || g.getAlleles().stream().anyMatch(a -> a.isCalled() && a.isNonReference() && !a.isSymbolic()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this condition got reverted to its previous state too -- can you add a brief comment here explaining the logic for posterity?
@@ -1687,7 +1687,7 @@ public static void splitASFilters(VariantContext vc, List<Map<String, Object>> a | |||
genotypeAssignmentMethodUsed != GenotypeAssignmentMethod.SET_TO_NO_CALL) | |||
AlleleSubsettingUtils.addInfoFieldAnnotations(vc, builder, keepOriginalChrCounts); | |||
|
|||
builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, null, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0), true)); | |||
builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, null, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force emit is false here -- is that correct?
//23 highly multi-allelic sites across 54 1000G exomes to test allele subsetting and QUAL calculation | ||
{getTestFile("multiallelicQualRegression.vcf "), | ||
getTestFile("multiallelicQualRegression.expected.vcf"), | ||
NO_EXTRA_ARGS, hg38Reference} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've confirmed that this new test case fails with 4.2.5.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- it fails like the Hindenburg.
" -do-qual-approx --floor-blocks -GQB 20 -GQB 30 -GQB 40 " + | ||
" --" + StandardArgumentDefinitions.ADD_OUTPUT_VCF_COMMANDLINE + " false", | ||
Arrays.asList(getToolTestDataDir() + "prodWesOutput.g.vcf")); | ||
spec3.executeTest("testWithExactComparison", this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this test method should use a DataProvider
to feed it the three command lines, instead of embedding the three cases within one method like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@droazen there might be some new test failures (which I will address if/when they arise), but this is ready for re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Back to you -- just needs a few additional comments and then I think we should be good, provided the WARP and GATK tests pass
@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype | |||
if (maxAltAlleles < vc.getAlternateAlleles().size()) { | |||
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles); | |||
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) : | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false); | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment documenting why we want to use GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL
here
@@ -391,11 +391,12 @@ boolean isVcCoveredByDeletion(final VariantContext vc) { | |||
* {@link GenotypeLikelihoods#MAX_DIPLOID_ALT_ALLELES_THAT_CAN_BE_GENOTYPED}. | |||
*/ | |||
protected final boolean cannotBeGenotyped(final VariantContext vc) { | |||
// protect against too many alternate alleles that we can't even run AF on: | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill in or delete empty comment
} else if (g.isHomRef()) { | ||
} else if ( g.isHomRef()) { | ||
//no-call with no PLs seems risky, but there are a few places in the QUAL/AF code where we subset alleles, | ||
// but then leave the genotypes as no-calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems misplaced/confusing, since no-calls with no PLs will now get handled by the else
block below, not this block, and will always produce the Genotype does not contain likelihoods necessary to calculate posteriors
exception, right?
else {
throw new IllegalStateException("Genotype " + g + " does not contain likelihoods necessary to calculate posteriors.");
}
Recommend adding a comment to the else
block below explaining clearly why we want to always throw an IllegalStateException
here for no calls with no PLs.
@@ -568,8 +566,7 @@ VariantContext cleanUpHighQualityVariant(final VariantContext variant) { | |||
if(allelesNeedSubsetting && !keepAllAlts) { | |||
newAlleleSetUntrimmed.removeAll(allelesToDrop); | |||
final GenotypesContext gc = AlleleSubsettingUtils.subsetAlleles(variant.getGenotypes(), genotype.getPloidy(), variant.getAlleles(), | |||
newAlleleSetUntrimmed, null, GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN, | |||
variant.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true); | |||
newAlleleSetUntrimmed, null, GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment documenting why we're using GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN
here instead of the GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL
we're using everywhere else.
@@ -139,6 +139,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v | |||
/** | |||
* Do we have (or can we infer) likelihoods necessary for allele frequency calculation? | |||
* Some reblocked and/or DRAGEN GVCFs omit likelihoods for ref blocks, but we can estimate them | |||
* If GenomicsDB max alt threshold is too low, variants may also be missimg PLs -- reject those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missimg -> missing
…re's no non-ref Reenable EM in multi-allelic QUAL calculation
because that can cause problems
08150df
to
a245499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Back to you with a couple of final comments
@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype | |||
if (maxAltAlleles < vc.getAlternateAlleles().size()) { | |||
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles); | |||
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) : | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false); | |||
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that this should be hardcoded to BEST_MATCH_TO_ORIGINAL
rather than respecting configuration.genotypeArgs.genotypeAssignmentMethod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with no PLs no-calls are just going to cause problems, so keep any 0/0 genotypes as such. I added a comment to that extent.
src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypingEngine.java
Outdated
Show resolved
Hide resolved
} else if (g.isHomRef()) { | ||
} else if ( g.isHomRef()) { | ||
//no-call with no PLs seems risky, but there are a few places in the QUAL/AF code where we subset alleles, | ||
// but then leave the genotypes as no-calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still seems confusing/misplaced. It talks about handling no-calls in the case that now handles hom-ref (no-calls now fall through to the else
block below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -139,6 +139,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v | |||
/** | |||
* Do we have (or can we infer) likelihoods necessary for allele frequency calculation? | |||
* Some reblocked and/or DRAGEN GVCFs omit likelihoods for ref blocks, but we can estimate them | |||
* If GenomicsDB max alt threshold is too low, variants may also be missimg PLs -- reject those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missimg -> missing
Is this comment still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed typo. Yes, still relevant. I clarified that here variants means non-reference genotypes, where we can't estimate PLs.
@Test | ||
public void testGDBMaxAltsGreaterThanGGVCFsMaxAlts() throws IOException { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill in empty test method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized this case was redundant with testMaxAltsToCombineInGenomicsDB
. Deleted and added comment to testMaxAlts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Merge when tests pass @ldgauthier
The warp pipeline tests caught some cases that we apparently didn't have in our integration tests, but now we do!