Obsolete hard-coded default base quals for overlapping mates #4958

davidbenjamin · 2018-06-27T18:06:15Z

Every overlapping read pair in HaplotypeCaller and Mutect2
goes through FragmentUtils.adjustQualsOfOverlappingPairedFragments, which has the following hard-coded logic:

if ( firstReadBase == secondReadBase ) {
   firstReadQuals[firstReadIndex] = Math.min(firstReadQuals[firstReadIndex], HALF_OF_DEFAULT_PCR_ERROR_QUAL);
   secondReadQuals[i] = Math.min(secondReadQuals[i], HALF_OF_DEFAULT_PCR_ERROR_QUAL);
} else {
   firstReadQuals[firstReadIndex] = 0;
   secondReadQuals[i] = 0;
}

This makes sense -- we modify the quals to reflect that sequencing errors are independent but PCR errors are not. However, HALF_OF_DEFAULT_PCR_ERROR_QUAL is 20, so if we start out with high-quality UMI reads we basically kill the quality and with it our SNV sensitivity. Note that in the very important application of cfDNA basically every base is an overlap.

What's worse is that Mutect2 later throws out one of the reads, so that we could start out with two BQ = 60 reads and end up with one BQ = 20 read!

@fleharty @yfarjoun This is a problem. I could just make the PCR error quality an adjustable parameter, but can you think of anything else?

The text was updated successfully, but these errors were encountered:

yfarjoun · 2018-06-27T18:55:11Z

first of all, this looks like a nice easter-egg to find! congrats!

I agree with you that this is a real problem. I don't understand the logic, why HALF of PCR_ERROR_QUAL? if that's really 20, then it's way too low!!

I also object to the second part (when the bases disagree.) imagine that one base is A@Q2 and the other is T@Q60...why would you put both bases to Q0 in that case?

I think that when the bases agree they should essentially add up, but no more than "PCR_ERROR" (but not lose q-score if they are already higher then PCR_ERROR). If they differ, the bigger one should lose the points that the lower one has.....though that doesn't take into account the PCR error which needs to be thought out more carefully. We should take some time to figure out the model that allows for PCR error and then derive the posterior posteriors from that....

davidbenjamin · 2018-06-27T19:20:16Z

I agree with you that this is a real problem. I don't understand the logic, why HALF of PCR_ERROR_QUAL? if that's really 20, then it's way too low!!

The intent of this code is that downstream code will not do anything special to avoid over-counting
overlapping mates, so that assigning half of the PCR qual effectively gives the full PCR qual for the fragment. Of course, this assumption is wrong in the case of M2.

I also object to the second part (when the bases disagree.) imagine that one base is A@Q2 and the other is T@Q60...why would you put both bases to Q0 in that case?

Good point.

We should take some time to figure out the model that allows for PCR error and then derive the posterior posteriors from that...

Are you suggesting something like there are binary indicators for PCR error, read 1 sequencing error, read 2 sequencing error, with priors given by the PCR and base qualities, and we want the posteriors of these indicators given that the bases agree / disagree?

yfarjoun · 2018-06-27T20:19:49Z

Yes. We should expect that PCR error shouldn't affect the base-quality, so two high quality, disagreeing bases are an indication of a PCR error, while one low-quality base, and one high quality base that have differing qualities looks more like a sequencing error. We might be able to obtain a data-driven model for that using the overlapping bases themselves (over monomorphic sites).

The only problem is that this is only true when the reads haven't been processed by Consensus calling....but if we have a good model for consensus calling within haplotype caller we could avoid doing that upfront and simply deal with everything within haplotype caller. That would be ideal!

magicDGS · 2018-06-28T09:01:57Z

Time ago, I added a method to fix the qualities of overlapping mates in the same way as samtools because I needed it for other reasons:

gatk/src/main/java/org/broadinstitute/hellbender/utils/pileup/ReadPileup.java

Lines 310 to 349 in 8d929ed

    
               /** 
        
                * Fixes the quality of two elements that come from an overlapping pair in the same way as 
        
                * samtools does {@see tweak_overlap_quality function in 
        
                * <a href="https://github.com/samtools/htslib/blob/master/sam.c">samtools</a>}. 
        
                * The only difference with the samtools API is the cap for high values ({@link QualityUtils#MAX_SAM_QUAL_SCORE}). 
        
                */ 
        
               @VisibleForTesting 
        
               static void fixPairOverlappingQualities(final PileupElement firstElement, 
        
                                                       final PileupElement secondElement) { 
        
                   // only if they do not represent deletions 
        
                   if (!secondElement.isDeletion() && !firstElement.isDeletion()) { 
        
                       final byte[] firstQuals = firstElement.getRead().getBaseQualities(); 
        
                       final byte[] secondQuals = secondElement.getRead().getBaseQualities(); 
        
                       if (firstElement.getBase() == secondElement.getBase()) { 
        
                           // if both have the same base, extra confidence in the firts of them 
        
                           firstQuals[firstElement.getOffset()] = 
        
                                   (byte) (firstQuals[firstElement.getOffset()] + secondQuals[secondElement 
        
                                           .getOffset()]); 
        
                           // cap to maximum byte value 
        
                           if (firstQuals[firstElement.getOffset()] < 0 
        
                                   || firstQuals[firstElement.getOffset()] > QualityUtils.MAX_SAM_QUAL_SCORE) { 
        
                               firstQuals[firstElement.getOffset()] = QualityUtils.MAX_SAM_QUAL_SCORE; 
        
                           } 
        
                           secondQuals[secondElement.getOffset()] = 0; 
        
                       } else { 
        
                           // if not, we lost confidence in the one with higher quality 
        
                           if (firstElement.getQual() >= secondElement.getQual()) { 
        
                               firstQuals[firstElement.getOffset()] = 
        
                                       (byte) (SAMTOOLS_OVERLAP_LOW_CONFIDENCE * firstQuals[firstElement.getOffset()]); 
        
                               secondQuals[secondElement.getOffset()] = 0; 
        
                           } else { 
        
                               secondQuals[secondElement.getOffset()] = 
        
                                       (byte) (SAMTOOLS_OVERLAP_LOW_CONFIDENCE * secondQuals[secondElement.getOffset()]); 
        
                               firstQuals[firstElement.getOffset()] = 0; 
        
                           } 
        
                       } 
        
                       firstElement.getRead().setBaseQualities(firstQuals); 
        
                       secondElement.getRead().setBaseQualities(secondQuals); 
        
                   } 
        
               }

Maybe you can have a look to it as a starting point. Although I am not sure about which one is the logic for the agreement between bases (sum of qualities) or disagreement (SAMTOOLS_OVERLAP_LOW_CONFIDENCE factor), I think that what they are doing is more reasonable the FragmentUtils.adjustQualsOfOverlappingPairedFragments.

gmagoon · 2018-11-14T06:34:16Z

As a user of HaplotypeCaller, I've found that the GQs for homozygous genotypes can be significantly overestimated (by as much as a factor of two, on phred scale) in the presence of overlapping read pairs. The fragment gets double-counted (once for each read) in the genotype likelihood calculation, as HaplotypeCaller doesn't seem to apply anything like the filterOverlappingReads step used in Mutect2. I haven't been able to figure out the rationale behind the FragmentUtils.adjustQualsOfOverlappingPairedFragments approach, at least in the context of the genotype likelihood calculation (maybe it helps somehow for the preceding read likelihood calculation). The approach of @magicDGS / samtools makes much more sense to me, as does the filterOverlappingReads mentioned by @davidbenjamin apparently used in Mutect2, and the separate consideration of overlapping reads when calculating genotype likelihoods apparently used in UnifiedGenotyper.

davidbenjamin added question HaplotypeCaller bogus Mutect labels Jun 27, 2018

davidbenjamin added this to the Mutect 3 milestone Jun 27, 2018

gmagoon mentioned this issue Nov 20, 2018

HaplotypeCaller homozygous call GQ is overreported in the presence of overlapping read pairs #5436

Open

davidbenjamin mentioned this issue Feb 19, 2019

Overhaul of Mutect2 filtering #5688

Merged

davidbenjamin self-assigned this Feb 25, 2019

davidbenjamin mentioned this issue Mar 13, 2019

M2 modifies base quals instead of discarding overlapping read pairs #5794

Merged

davidbenjamin closed this as completed in #5794 Mar 13, 2019

sleyn mentioned this issue Sep 6, 2023

Mutect2 drops MBQ to 20 in overlapped reads lima1/PureCN#320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obsolete hard-coded default base quals for overlapping mates #4958

Obsolete hard-coded default base quals for overlapping mates #4958

davidbenjamin commented Jun 27, 2018

yfarjoun commented Jun 27, 2018

davidbenjamin commented Jun 27, 2018

yfarjoun commented Jun 27, 2018

magicDGS commented Jun 28, 2018

gmagoon commented Nov 14, 2018

Obsolete hard-coded default base quals for overlapping mates #4958

Obsolete hard-coded default base quals for overlapping mates #4958

Comments

davidbenjamin commented Jun 27, 2018

yfarjoun commented Jun 27, 2018

davidbenjamin commented Jun 27, 2018

yfarjoun commented Jun 27, 2018

magicDGS commented Jun 28, 2018

gmagoon commented Nov 14, 2018