Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

Closed
eboyden opened this issue Dec 22, 2021 · 6 comments

Comments

@eboyden
Copy link

eboyden commented Dec 22, 2021

Bug Report

Affected tool(s)

CollectHsMetrics

Affected version(s)

at least through 2.26.6

Description

The description of the --COVERAGE_CAP parameter indicates that it's used for Theoretical Sensitivity calculations, with a default value of 200. But it actually affects several other metrics; these include Median coverage and Fold-80 (because the 20th percentile is also apparently capped), but not Mean coverage or Max coverage (not sure about Min coverage).

This is related to these issues:
#938
#1361
as well as this GATK inquiry:
https://gatk.broadinstitute.org/hc/en-us/community/posts/360078460592-Is-there-a-max-value-for-Picard-CollectHsMetrics-MEDIAN-TARGET-COVERAGE-

Steps to reproduce

Run any dataset with any coverage greater than COVERAGE_CAP. The calculated Median coverage and Fold-80 values will be wrong, possibly severely. The documentation of CollectHsMetrics generally, and this parameter specifically, do not indicate why these values would be wrong.

Expected behavior

Ideally, Median coverage and Fold-80 (and any other affected metrics) should be calculated from the actual data, not coverage-capped data; the coverage cap should only be used for the theoretical sensitivity documentation, as documented. Alternatively, ALL metrics should be affected, including Mean and Max coverage (which may often result in a Fold-80 <1 because the Mean will be < COVERAGE_CAP but the 20th percentile may = COVERAGE_CAP). Either way, the documentation of this parameter should be updated to reflect how the coverage cap is actually implemented and which metrics it affects.

Actual behavior

When real coverage exceeds the coverage cap, the Median coverage is capped but the Mean (and Max) are not, which is confusing. Fold-80 is also typically (and confusingly) much too high, because the 20th percentile is also capped, so a real Mean coverage is divided by an artificially capped 20th percentile.

@eboyden
Copy link
Author

eboyden commented Dec 22, 2021

BTW I didn't check but I wouldn't be surprised if this also affects CollectTargetedPcrMetrics or any other "Metrics" tools.

@eboyden
Copy link
Author

eboyden commented Dec 23, 2021

Also I learned that this bug doesn't affect version 2.7.1 (which also doesn't seem to include MAX and MIN coverage). So clearly this bug was introduced at some point, possibly when those additional metrics were added. MAX coverage was introduced in version 2.9.1 and it seems like many other modifications and upgrades were made to CollectHsMetrics around that time, so it's likely that the bug was introduced around that time.

@gbggrant
Copy link
Contributor

gbggrant commented Feb 2, 2022

Thanks @eboyden if you are up for it, we would welcome a PR to address this issue.

@eboyden
Copy link
Author

eboyden commented Sep 1, 2023

It looks like this bug was introduced in commit dffec7a, version 2.7.2, when the coverage histogram code was changed.

The problematic code seems to be in TargetMetricsCollector.java. There is even already a comment that explicitly exposes the reason that highQualityDepthHistogram is not used for mean coverage (because it imposes a coverage cap), yet it nevertheless is still used for median and percentile coverages (which affects Fold_80).

            // we do this instead of highQualityDepthHistogram.getMean() because the histogram imposes a coverage cap
            metrics.MEAN_TARGET_COVERAGE = (double) totalCoverage / metrics.TARGET_TERRITORY;
            metrics.MEDIAN_TARGET_COVERAGE = highQualityDepthHistogram.getMedian();
            metrics.MAX_TARGET_COVERAGE = maxDepth;
            // Use Math.min() to account for edge case where highQualityCoverageByTarget is empty (minDepth=Long.MAX_VALUE)
            metrics.MIN_TARGET_COVERAGE = Math.min(minDepth, maxDepth);

            // compute the coverage value such that 80% of target bases have better coverage than it i.e. 20th percentile
            // this roughly measures how much we must sequence extra such that 80% of target bases have coverage at least as deep as the current mean coverage
            metrics.FOLD_80_BASE_PENALTY = metrics.MEAN_TARGET_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);
            metrics.ZERO_CVG_TARGETS_PCT = zeroCoverageTargets / (double) allTargets.getIntervals().size();

It also looks like a similar error may be present elsewhere, e.g. in WgsMetrics.java:

        GENOME_TERRITORY = (long) highQualityDepthHistogram.getSumOfValues();
        MEAN_COVERAGE    = highQualityDepthHistogram.getMean();
        SD_COVERAGE      = highQualityDepthHistogram.getStandardDeviation();
        MEDIAN_COVERAGE  = highQualityDepthHistogram.getMedian();
        MAD_COVERAGE     = highQualityDepthHistogram.getMedianAbsoluteDeviation();

        PCT_1X   = MathUtil.sum(depthHistogramArray, 1, depthHistogramArray.length)   / (double) GENOME_TERRITORY;
        PCT_5X   = MathUtil.sum(depthHistogramArray, 5, depthHistogramArray.length)   / (double) GENOME_TERRITORY;
        PCT_10X  = MathUtil.sum(depthHistogramArray, 10, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_15X  = MathUtil.sum(depthHistogramArray, 15, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_20X  = MathUtil.sum(depthHistogramArray, 20, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_25X  = MathUtil.sum(depthHistogramArray, 25, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_30X  = MathUtil.sum(depthHistogramArray, 30, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_40X  = MathUtil.sum(depthHistogramArray, 40, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_50X  = MathUtil.sum(depthHistogramArray, 50, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_60X  = MathUtil.sum(depthHistogramArray, 60, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_70X  = MathUtil.sum(depthHistogramArray, 70, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_80X  = MathUtil.sum(depthHistogramArray, 80, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_90X  = MathUtil.sum(depthHistogramArray, 90, depthHistogramArray.length)  / (double) GENOME_TERRITORY;
        PCT_100X = MathUtil.sum(depthHistogramArray, 100, depthHistogramArray.length) / (double) GENOME_TERRITORY;


        // This roughly measures by how much we must over-sequence so that xx% of bases have coverage at least as deep as the current mean coverage:
        if (highQualityDepthHistogram.getCount() > 0) {
            FOLD_80_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.2);
            FOLD_90_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.1);
            FOLD_95_BASE_PENALTY = MEAN_COVERAGE / highQualityDepthHistogram.getPercentile(0.05);
        } else {
            FOLD_80_BASE_PENALTY = 0;
            FOLD_90_BASE_PENALTY = 0;
            FOLD_95_BASE_PENALTY = 0;
        }

Perhaps the solution is to simply revert to the way that median and percentile coverages were calculated as of 2.7.1?

@JoeVieira
Copy link
Contributor

@gbggrant - @eboyden and I are colleagues at Molecular Loop - I've raised a PR that fixes this issue. #1913

@kockan
Copy link
Contributor

kockan commented Feb 15, 2024

Closed by #1913

@kockan kockan closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants