Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XGBoostGqVariantFilter, a tool to recalibrate GQ #7705

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TedBrookings
Copy link
Contributor

  • Added MinGqVariantFilterBase
    • loads VCF, pedigree, UCSC genome tract, and truth data
    • calculates variant overlap with genome tracts
    • forms matrices, tensors, and other helping data for machine learning
    • provides for TRAIN and FILTER modes
    • provides functions for calculating loss given assigned min GQ values
    • computes best estimate of truth data used for training xgboost model
  • Added XGBoostMinGqVariantFilter
    • calculates new GQ based on gradient boosting
  • Added PropertiesTable for loading VCF properties into tensors
  • Added TractOverlapDetector for computing overlap properties with
    UCSC genome tracts

Training loss is based on weighted combination of heredity and truth
data, broken down by variant category.

@TedBrookings TedBrookings requested a review from mwalker174 March 2, 2022 17:46
@TedBrookings
Copy link
Contributor Author

TedBrookings commented Mar 2, 2022

There's a lot of stuff that I know is wrong here:

  1. This is based on a master that's super out of date (I don't want to rebase at this juncture, because I'd need to stop and verify that behavior didn't change due to something else changing in GATK)
  2. No unit tests. Up to this point, the basic structure has been changing a lot. It should be pretty well settled now though.
  3. Probably the main classes should be renamed to indicated that they are recalibrating GQ, not just filtering.
  4. I should probably put in a soft-filter option (just recalibrate GQ, don't set GT to no-call)
  5. Probably the output should be called something other than GQ. Phred-scaling is a bad match to probabilities near 50%, but people expect GQ to be Phred-scaled.
  6. Many of the default values are set at non-optimal values. I didn't want to rebuild the docker image each time I tweaked values, so those were tweaked from WDL settings instead. They should be set to something resembling "optimal" before final merge.

@TedBrookings TedBrookings force-pushed the tb_recalibrate_gq branch 3 times, most recently from 780af4f to 1b19b2e Compare July 5, 2022 14:09
@gatk-bot
Copy link

gatk-bot commented Jul 5, 2022

Github actions tests reported job failures from actions build 2616727886
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 2616727886.10 logs
unit 8 2616727886.1 logs
conda 8 2616727886.3 logs
variantcalling 8 2616727886.2 logs
integration 8 2616727886.0 logs

* Added MinGqVariantFilterBase
* * loads VCF, pedigree, UCSC genome tract, and truth data
* * calculates variant overlap with genome tracts
* * forms matrices, tensors, and other helping data for machine learning
* * provides for TRAIN and FILTER modes
* * provides functions for calculating loss given assigned min GQ values
* * computes best estimate of truth data used for training xgboost model
* * adds probabilities to VCF as phred-scaled GQ and also logit-scaled SL
* Added XGBoostMinGqVariantFilter
* * calculates new GQ based on gradient boosting
* Added PropertiesTable for loading VCF properties into tensors
* Added TractOverlapDetector for computing overlap properties with
  UCSC genome tracts

Training loss is based on weighted combination of heredity and truth
data, broken down by variant category.
* Fix genome track overlap calculations
* Get concordance annotations
* Changes to help with out-of-memory crashes, better debugging output
* Fix GQ and SL calculations to be more standard
@gatk-bot
Copy link

gatk-bot commented Sep 9, 2022

Github actions tests reported job failures from actions build 3024497902
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3024497902.10 logs
unit 8 3024497902.1 logs
conda 8 3024497902.3 logs
integration 8 3024497902.0 logs
variantcalling 8 3024497902.2 logs

@gatk-bot
Copy link

gatk-bot commented Sep 9, 2022

Github actions tests reported job failures from actions build 3024517679
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3024517679.10 logs
unit 8 3024517679.1 logs
conda 8 3024517679.3 logs
variantcalling 8 3024517679.2 logs
integration 8 3024517679.0 logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants