-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767
Comments
BTW I didn't check but I wouldn't be surprised if this also affects CollectTargetedPcrMetrics or any other "Metrics" tools. |
Also I learned that this bug doesn't affect version 2.7.1 (which also doesn't seem to include MAX and MIN coverage). So clearly this bug was introduced at some point, possibly when those additional metrics were added. MAX coverage was introduced in version 2.9.1 and it seems like many other modifications and upgrades were made to CollectHsMetrics around that time, so it's likely that the bug was introduced around that time. |
Thanks @eboyden if you are up for it, we would welcome a PR to address this issue. |
It looks like this bug was introduced in commit dffec7a, version 2.7.2, when the coverage histogram code was changed. The problematic code seems to be in TargetMetricsCollector.java. There is even already a comment that explicitly exposes the reason that highQualityDepthHistogram is not used for mean coverage (because it imposes a coverage cap), yet it nevertheless is still used for median and percentile coverages (which affects Fold_80).
It also looks like a similar error may be present elsewhere, e.g. in WgsMetrics.java:
Perhaps the solution is to simply revert to the way that median and percentile coverages were calculated as of 2.7.1? |
Closed by #1913 |
Bug Report
Affected tool(s)
CollectHsMetrics
Affected version(s)
at least through 2.26.6
Description
The description of the
--COVERAGE_CAP
parameter indicates that it's used for Theoretical Sensitivity calculations, with a default value of 200. But it actually affects several other metrics; these include Median coverage and Fold-80 (because the 20th percentile is also apparently capped), but not Mean coverage or Max coverage (not sure about Min coverage).This is related to these issues:
#938
#1361
as well as this GATK inquiry:
https://gatk.broadinstitute.org/hc/en-us/community/posts/360078460592-Is-there-a-max-value-for-Picard-CollectHsMetrics-MEDIAN-TARGET-COVERAGE-
Steps to reproduce
Run any dataset with any coverage greater than COVERAGE_CAP. The calculated Median coverage and Fold-80 values will be wrong, possibly severely. The documentation of CollectHsMetrics generally, and this parameter specifically, do not indicate why these values would be wrong.
Expected behavior
Ideally, Median coverage and Fold-80 (and any other affected metrics) should be calculated from the actual data, not coverage-capped data; the coverage cap should only be used for the theoretical sensitivity documentation, as documented. Alternatively, ALL metrics should be affected, including Mean and Max coverage (which may often result in a Fold-80 <1 because the Mean will be < COVERAGE_CAP but the 20th percentile may = COVERAGE_CAP). Either way, the documentation of this parameter should be updated to reflect how the coverage cap is actually implemented and which metrics it affects.
Actual behavior
When real coverage exceeds the coverage cap, the Median coverage is capped but the Mean (and Max) are not, which is confusing. Fold-80 is also typically (and confusingly) much too high, because the 20th percentile is also capped, so a real Mean coverage is divided by an artificially capped 20th percentile.
The text was updated successfully, but these errors were encountered: