-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obsolete hard-coded default base quals for overlapping mates #4958
Comments
first of all, this looks like a nice easter-egg to find! congrats! I agree with you that this is a real problem. I don't understand the logic, why HALF of PCR_ERROR_QUAL? if that's really 20, then it's way too low!! I also object to the second part (when the bases disagree.) imagine that one base is A@Q2 and the other is T@Q60...why would you put both bases to Q0 in that case? I think that when the bases agree they should essentially add up, but no more than "PCR_ERROR" (but not lose q-score if they are already higher then PCR_ERROR). If they differ, the bigger one should lose the points that the lower one has.....though that doesn't take into account the PCR error which needs to be thought out more carefully. We should take some time to figure out the model that allows for PCR error and then derive the posterior posteriors from that.... |
The intent of this code is that downstream code will not do anything special to avoid over-counting
Good point.
Are you suggesting something like there are binary indicators for PCR error, read 1 sequencing error, read 2 sequencing error, with priors given by the PCR and base qualities, and we want the posteriors of these indicators given that the bases agree / disagree? |
Yes. We should expect that PCR error shouldn't affect the base-quality, so two high quality, disagreeing bases are an indication of a PCR error, while one low-quality base, and one high quality base that have differing qualities looks more like a sequencing error. We might be able to obtain a data-driven model for that using the overlapping bases themselves (over monomorphic sites). The only problem is that this is only true when the reads haven't been processed by Consensus calling....but if we have a good model for consensus calling within haplotype caller we could avoid doing that upfront and simply deal with everything within haplotype caller. That would be ideal! |
Time ago, I added a method to fix the qualities of overlapping mates in the same way as gatk/src/main/java/org/broadinstitute/hellbender/utils/pileup/ReadPileup.java Lines 310 to 349 in 8d929ed
Maybe you can have a look to it as a starting point. Although I am not sure about which one is the logic for the agreement between bases (sum of qualities) or disagreement ( |
As a user of HaplotypeCaller, I've found that the GQs for homozygous genotypes can be significantly overestimated (by as much as a factor of two, on phred scale) in the presence of overlapping read pairs. The fragment gets double-counted (once for each read) in the genotype likelihood calculation, as HaplotypeCaller doesn't seem to apply anything like the |
Every overlapping read pair in HaplotypeCaller and Mutect2
goes through
FragmentUtils.adjustQualsOfOverlappingPairedFragments
, which has the following hard-coded logic:This makes sense -- we modify the quals to reflect that sequencing errors are independent but PCR errors are not. However,
HALF_OF_DEFAULT_PCR_ERROR_QUAL
is 20, so if we start out with high-quality UMI reads we basically kill the quality and with it our SNV sensitivity. Note that in the very important application of cfDNA basically every base is an overlap.What's worse is that Mutect2 later throws out one of the reads, so that we could start out with two BQ = 60 reads and end up with one BQ = 20 read!
@fleharty @yfarjoun This is a problem. I could just make the PCR error quality an adjustable parameter, but can you think of anything else?
The text was updated successfully, but these errors were encountered: