HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

eboyden · 2021-12-22T09:45:37Z

Bug Report

Affected tool(s)

CollectHsMetrics

Affected version(s)

at least through 2.26.6

Description

The description of the --COVERAGE_CAP parameter indicates that it's used for Theoretical Sensitivity calculations, with a default value of 200. But it actually affects several other metrics; these include Median coverage and Fold-80 (because the 20th percentile is also apparently capped), but not Mean coverage or Max coverage (not sure about Min coverage).

This is related to these issues:
#938
#1361
as well as this GATK inquiry:
https://gatk.broadinstitute.org/hc/en-us/community/posts/360078460592-Is-there-a-max-value-for-Picard-CollectHsMetrics-MEDIAN-TARGET-COVERAGE-

Steps to reproduce

Run any dataset with any coverage greater than COVERAGE_CAP. The calculated Median coverage and Fold-80 values will be wrong, possibly severely. The documentation of CollectHsMetrics generally, and this parameter specifically, do not indicate why these values would be wrong.

Expected behavior

Ideally, Median coverage and Fold-80 (and any other affected metrics) should be calculated from the actual data, not coverage-capped data; the coverage cap should only be used for the theoretical sensitivity documentation, as documented. Alternatively, ALL metrics should be affected, including Mean and Max coverage (which may often result in a Fold-80 <1 because the Mean will be < COVERAGE_CAP but the 20th percentile may = COVERAGE_CAP). Either way, the documentation of this parameter should be updated to reflect how the coverage cap is actually implemented and which metrics it affects.

Actual behavior

When real coverage exceeds the coverage cap, the Median coverage is capped but the Mean (and Max) are not, which is confusing. Fold-80 is also typically (and confusingly) much too high, because the 20th percentile is also capped, so a real Mean coverage is divided by an artificially capped 20th percentile.

The text was updated successfully, but these errors were encountered:

eboyden · 2021-12-22T09:47:13Z

BTW I didn't check but I wouldn't be surprised if this also affects CollectTargetedPcrMetrics or any other "Metrics" tools.

eboyden · 2021-12-23T19:19:13Z

Also I learned that this bug doesn't affect version 2.7.1 (which also doesn't seem to include MAX and MIN coverage). So clearly this bug was introduced at some point, possibly when those additional metrics were added. MAX coverage was introduced in version 2.9.1 and it seems like many other modifications and upgrades were made to CollectHsMetrics around that time, so it's likely that the bug was introduced around that time.

gbggrant · 2022-02-02T19:54:53Z

Thanks @eboyden if you are up for it, we would welcome a PR to address this issue.

eboyden · 2023-09-01T22:04:49Z

It looks like this bug was introduced in commit dffec7a, version 2.7.2, when the coverage histogram code was changed.

The problematic code seems to be in TargetMetricsCollector.java. There is even already a comment that explicitly exposes the reason that highQualityDepthHistogram is not used for mean coverage (because it imposes a coverage cap), yet it nevertheless is still used for median and percentile coverages (which affects Fold_80).

            // we do this instead of highQualityDepthHistogram.getMean() because the histogram imposes a coverage cap
            metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY;
            metrics.MEDIAN_TARGET_COVERAGE = highQualityDepthHistogram.getMedian();
            metrics.MAX_TARGET_COVERAGE = maxDepth;
            // Use Math.min() to account for edge case where highQualityCoverageByTarget is empty (minDepth=Long.MAX_VALUE)
            metrics.MIN_TARGET_COVERAGE = Math.min(minDepth, maxDepth);

            // compute the coverage value such that 80% of target bases have better coverage than it i.e. 20th percentile
            // this roughly measures how much we must sequence extra such that 80% of target bases have coverage at least as deep as the current mean coverage
            metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);
            metrics.ZERO_CVG_TARGETS_PCT = zeroCoverageTargets / (double) allTargets.getIntervals().size();

It also looks like a similar error may be present elsewhere, e.g. in WgsMetrics.java:

        GENOME_TERRITORY = (long) highQualityDepthHistogram.getSumOfValues();
        MEAN_COVERAGE    = highQualityDepthHistogram.getMean();
        SD_COVERAGE      = highQualityDepthHistogram.getStandardDeviation();
        MEDIAN_COVERAGE  = highQualityDepthHistogram.getMedian();
        MAD_COVERAGE     = highQualityDepthHistogram.getMedianAbsoluteDeviation();

        PCT_1X   = MathUtil.sum(depthHistogramArray, 1, depthHistogramArray.length)   / (double) GENOME_TERRITORY;
        PCT_5X   = MathUtil.sum(depthHistogramArray, 5, depthHistogramArray.length)   / (double) GENOME_TERRITORY;
        PCT_10X  = MathUtil.sum(depthHistogramArray, 10, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_15X  = MathUtil.sum(depthHistogramArray, 15, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_20X  = MathUtil.sum(depthHistogramArray, 20, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_25X  = MathUtil.sum(depthHistogramArray, 25, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_30X  = MathUtil.sum(depthHistogramArray, 30, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_40X  = MathUtil.sum(depthHistogramArray, 40, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_50X  = MathUtil.sum(depthHistogramArray, 50, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_60X  = MathUtil.sum(depthHistogramArray, 60, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_70X  = MathUtil.sum(depthHistogramArray, 70, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_80X  = MathUtil.sum(depthHistogramArray, 80, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_90X  = MathUtil.sum(depthHistogramArray, 90, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_100X = MathUtil.sum(depthHistogramArray, 100, depthHistogramArray.length) / (double) GENOME_TERRITORY;


        // This roughly measures by how much we must over-sequence so that xx% of bases have coverage at least as deep as the current mean coverage:
        if (highQualityDepthHistogram.getCount() > 0) {
            FOLD_80_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);
            FOLD_90_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.1);
            FOLD_95_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.05);
        } else {
            FOLD_80_BASE_PENALTY = 0;
            FOLD_90_BASE_PENALTY = 0;
            FOLD_95_BASE_PENALTY = 0;
        }

Perhaps the solution is to simply revert to the way that median and percentile coverages were calculated as of 2.7.1?

JoeVieira · 2023-09-05T22:13:46Z

@gbggrant - @eboyden and I are colleagues at Molecular Loop - I've raised a PR that fixes this issue. #1913

kockan · 2024-02-15T19:24:12Z

Closed by #1913

JoeVieira mentioned this issue Sep 5, 2023

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

Merged

kockan closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

eboyden commented Dec 22, 2021 •

edited

Loading

eboyden commented Dec 22, 2021 •

edited

Loading

eboyden commented Dec 23, 2021

gbggrant commented Feb 2, 2022

eboyden commented Sep 1, 2023

JoeVieira commented Sep 5, 2023

kockan commented Feb 15, 2024

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

Comments

eboyden commented Dec 22, 2021 • edited Loading

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

eboyden commented Dec 22, 2021 • edited Loading

eboyden commented Dec 23, 2021

gbggrant commented Feb 2, 2022

eboyden commented Sep 1, 2023

JoeVieira commented Sep 5, 2023

kockan commented Feb 15, 2024

eboyden commented Dec 22, 2021 •

edited

Loading

eboyden commented Dec 22, 2021 •

edited

Loading