Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

Merged

Conversation

JoeVieira
Copy link
Contributor

@JoeVieira JoeVieira commented Sep 5, 2023

This PR resolves issue #1767

Per documentation COVERAGE CAP was intended to only apply to Theoretical Sensitivity calculations. But it was applied to MEDIAN_COVERAGE, which then applied it to all statistics which that is used for.
https://gatk.broadinstitute.org/hc/en-us/articles/360036856051-CollectHsMetrics-Picard-#--COVERAGE_CAP

Changes with test coverage are

  • Update Histogram directly, rather than via an intermediate data structure
  • Which uses the full density passed, rather than limiting to the CAP
  • Test FOLD_80 score when passing a VERY low cap.
  • Coverage MEAN, MEDIAN & FOLD should still indicate perfect coverage.

@kockan
Copy link
Contributor

kockan commented Sep 12, 2023

Hi @JoeVieira , thanks for this PR! I don't believe the user you've tagged is working on Picard right now so we've asked for a review from someone else.

@JoeVieira
Copy link
Contributor Author

@kockan appreciated.

Copy link
Contributor

@kachulis kachulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @JoeVieira! Just a few small questions/comments.

Also, have you checked how speed and memory usage are affected by this change? presumably they aren't significant, but since you're replacing an array being indexed into with a map just would be good to confirm that there isn't a massive performance hit.

@JoeVieira
Copy link
Contributor Author

JoeVieira commented Sep 13, 2023

@kachulis Appreciate the feedback! I've updated this based on your feedback. I also shared your concern previously about additional memory usage - now with this change, it's only possible to have reduced memory usage & time.

@kachulis
Copy link
Contributor

I agree about the memory usage now, but I don't think it's guaranteed this will only ever lead to a speedup. We are replacing 10's of millions of long[] operations and 100's of Histogram operations with 10's of millions of Histogram operations. If the Histogram operations are very slow, this could result in a slowdown. I think the risk is quite low, but I just want to make sure this is run on some normal sized data and nothing crazy happens.

@JoeVieira
Copy link
Contributor Author

Fair enough - i'll run some heuristics

@JoeVieira
Copy link
Contributor Author

JoeVieira commented Sep 14, 2023

@kachulis

This isn't a perfect comparison, since the histogram now (correctly) has ~3x as many bins in it as the bugged current version. But at least it shows that performance is on par.

For a bam with 539688 reads:

PR:
[Thu Sep 14 15:22:40 EDT 2023] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.21 minutes. Runtime.totalMemory()=1090519040 java -jar picard.jar CollectHsMetrics -I S2.disarmed.sorted.bam -TI -BI -O 14.89s user 0.27s system 116% cpu 13.003 total
Current:
[Thu Sep 14 15:23:06 EDT 2023] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.20 minutes. Runtime.totalMemory()=1090519040 java -jar main.jar CollectHsMetrics -I S2.disarmed.sorted.bam -TI -BI -O 15.21s user 0.28s system 122% cpu 12.647 total

@JoeVieira JoeVieira requested a review from kachulis September 18, 2023 15:27
@JoeVieira
Copy link
Contributor Author

@kachulis Anything else before approving?

@kachulis
Copy link
Contributor

Hi @JoeVieira, sorry for the delayed response.

I think the performance test is reasonably convincing, but your comment about there being 3x as many bins in the output histogram lead me to think about some other issues as well. Specifically, writing out the full histogram (instead of the capped histogram) is a change from previous behavior, even compared to before the initial bug was introduced.

Looking into this a bit more, there are a two issues I'm wrestling with. The first is that as currently written this PR will introduce a behavior difference between highQualityDepthHistogram, which will not be capped, and unfilteredDepthHistogram, which will be capped. I think the behavior for both histograms should be the same when written out. However, I will note that unfilteredDepthHistogram is used in the TheoreticalSensitivity calculation, so naively uncapping it would uncap that calculation as well.

The other issue is that (and this is a long standing issue, not something introduced in this PR), there seems to be some differences between how COVERAGE_CAP is utilized in CollectWgsMetrics vs in CollectHsMetrics (and other targeted metrics). In wgs, the metrics are truly calculated in a capped fashion, and this is specified well in the documentation.

@Argument(shortName = "CAP", doc = "Treat positions with coverage exceeding this value as if they had coverage at this value (but calculate the difference for PCT_EXC_CAPPED).")
)

However, for targeted metrics, the situation seems a bit messier. It looks like, historically, target metrics were initially uncapped, and then some (such and median target coverage and the fold_x metrics) but not all became capped, and this PR would then uncap them again. I think this different behavior between wgs and targeted is reasonable, since for wgs very high coverage spikes are likely artifactual, while for targeted sequencing very high coverage is often the entire point (@yfarjoun do you remember if this was the intention?). It is a bit problematic, IMO, that the same parameter name (COVERAGE_CAP) is used in different ways for similar tools, but changing a parameter name that has been in use for years is rough, and anyway that's not the responsibility of your PR.

After all that, my current feeling is we have three different places in targeted metrics that are relevant to your PR, where we need to decide whether the calculation should be coverage capped or not:

  1. Theoretical Sensitivity
  2. Other coverage metrics (MEAN/MEDIAN_TARGET_COVERAGE, FOLD_80_BASE_PENALTY, etc)
  3. histograms as written to metrics files

currently, 1 and 3 are capped, while 2 is a mix of capped and uncapped (although originally was all uncapped). I think from the perspective of this PR, this best option is to leave 1 capped, make 2 uncapped, and then take your pick on 3 (I think either choice is reasonable). I think this means you just need to align the behaviors of the two histograms, while simultaneously keeping 1 capped and 2 uncapped, and then I will feel comfortable merging this.

@yfarjoun @takutosato do either of you have any thoughts about the COVERAGE_CAP behavior in CollectHsMetrics?

@JoeVieira
Copy link
Contributor Author

JoeVieira commented Sep 29, 2023

@kachulis I totally agree with this - this region of the code really became tangled a few years back.

Does anyone on this thread understand the intention of using the unfiltered data for the theoretical ( simply to get "all" data for simulation i presumed? )

The capped data set being called unfiltered only doesn't seem right, because it's also normalized.

Do we want to have the (uncapped, raw) unfiltered & high quality histograms outputted?

Do we need to have the histogram used for theoretical calculations also output?

Speaking of - is it correct to filter the datas used for (MEAN/MEDIAN_TARGET_COVERAGE, FOLD_80_BASE_PENALTY, etc) to just the "high quality coverage" ?

Edited based on reviewing the code again & updated my thinking on this.

@JoeVieira
Copy link
Contributor Author

I would propose

1.) Theoretical can no be uncapped - as that defeats it's purpose.
2.) Each histogram should be outputted ( Uncapped HighQuality, Uncapped Unflitered )
3.) outputting the exact histograms used for the stat generation seems important as well & therefore we should also output a new Capped Unfiltered Histograph called "Theoretical Capped Unfiltered ... "

I'm happy to update this PR to do so, if folks are in agreement.

@JoeVieira
Copy link
Contributor Author

@kachulis, @lbergelson Any thoughts here? I'm happy to update with the above proposed logic. I would love to wrap this up.

@kachulis
Copy link
Contributor

@JoeVieira sorry for the delayed response, have gotten distracted by other things.

I don't think the new histogram would be necessary, since the uncapped histogram contains all the information in the capped histogram (and the metrics would be calculated based on the uncapped data after this PR, wouldn't they).

For understanding the values used to calculate theoretical sensitivity the uncapped histogram can be easily converted into capped just by collapsing the top bins, so I would prefer to leave that as something a user could do themselves if they are investigating some data over adding another histogram.

otherwise I think your proposals are reasonable.

@eboyden
Copy link

eboyden commented Jan 25, 2024

Bumping this - we're very eager for a fixed version of CollectHsMetrics that outputs correct median coverage and Fold-80 scores.

@kockan
Copy link
Contributor

kockan commented Jan 29, 2024

Hi @eboyden , I'd like to get this merged as soon as possible as well. I was going through the previous comments and it seems like @kachulis was happy with the changes proposed by @JoeVieira , except the addition of a new capped unfiltered histogram. If @JoeVieira would like to make these changes I could ask for a quick re-review.

create unfiltered, uncapped histograph for output
close TODO of normalizing array directly, to avoid the extra flip between array & histograph.
@eboyden
Copy link

eboyden commented Jan 30, 2024

Appreciate everyone's help on this. For the Picard developers: this would be outside the scope of this PR, but I suggest looking into whether CollectTargetedPcrMetrics or CollectWgsMetrics or any other Collect...Metrics tools are affected by this same bug. Seems like at least CollectWgsMetrics might be affected, based on a brief investigation I did for the original bug report #1767

keep capped data for theoretical stats
move calculation of min / max depth into base loading
ensure histograms are not sparsely populated
@kockan
Copy link
Contributor

kockan commented Jan 31, 2024

Thanks a lot @eboyden ! If this is a bug that affects multiple metric collection tools, they should definitely be fixed as well. I'll relay your findings to the team and find out what can be done.

@JoeVieira
Copy link
Contributor Author

Okay all - i've updated this a lot, a lot of threads going on in this code.
I've made a solid effort to clean up where i touched.

This builds histograms directly, and ensured they aren't sparsely populated - which using getDepths directly would cause, this removes the need for another loop over the coverage data

The changes to output uncapped histograms does cause us to create an additional histograms, the quality histogram needs to be uncapped. The unfiltered coverage histogram is never outputted, so doesn't need to be created as an uncapped histogram.

@kockan @kachulis - lmk what you think - this was a bit of a rats nest to solve.

WRT @eboyden 's question - I'm not clear on why WGSmetris would be impacted by this specific bug, since it's a different code path - but a similar logical issue might exist in that collector

@kockan
Copy link
Contributor

kockan commented Jan 31, 2024

@JoeVieira Thanks for the updates! Looks good to me, but I am not as familiar with the intricacies of this tool as @kachulis (who returns in about a week and a half), so just to be safe I'd like to wait for his final review before merging this.

@JoeVieira
Copy link
Contributor Author

@kachulis @kockan - Hoping everyone is back from holiday & can have a look?

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kachulis is out on leave for a few more weeks so I took a look at this. I tried to follow the conversation and I think this PR does what was agreed upon in the comments, so 👍

@kockan kockan removed the request for review from kachulis February 15, 2024 18:46
@kockan kockan merged commit 6d33c94 into broadinstitute:master Feb 15, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants