Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: variant-index refactoring #29

Merged
merged 11 commits into from
Oct 4, 2024
Merged

feat: variant-index refactoring #29

merged 11 commits into from
Oct 4, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Sep 24, 2024

Context

This PR closes opentargets/issues#3431

image

The main development being done was moving the code from variant_index dag to the genetics_etl dag to port new tasks (wrapped into variant annotation task group) that are required to run before the variantIndexStep can be successfully executed.

Things implemented:

  • Operator for annotating variant source with VEP command
  • Operator for partitionning the variant sources into chunks required by VEP batch job
  • TaskGroup in the configuration
  • Addition of biosample index step
  • New filters for validation steps
  • Deprecation of old hydra configuration

Limitations

The orchestration of variant_index and variant_annotation currently runs as:

  1. convert variant source to vcf (batch job) - 4 tasks | ~ 4min
  2. sort and partition vcfs into small chunks (python operator) - partitionned to 850 vcfs | very heavy in computations on local machine that runs airflow and requires pandas as a dependency ~ 3min
  3. annotate partitionned vcfs (batch job) - 850 tasks ~ 25 min
  4. perform variantIndexStep (merge variants from gnomad and variant sources) ~25min

The steps 1 and 2 should be merged and run as a dataproc job (the 1. requires only 4 batch tasks) for silplifying the configuration, although this is out of scope of this PR.

Test perfomed on successful run of the pipeline ending at the variant_index step (including) resulted in ~1h10min run

@project-defiant project-defiant changed the title Variant index feat: variant-index refactoring Sep 24, 2024
@project-defiant project-defiant marked this pull request as ready for review September 27, 2024 15:24
Copy link
Member

@javfg javfg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start!

I haven't gone in detail into it. As we discussed yesterday, we should try to figure out a way to separate concerns in a similar way as it is being done in the platform, with a configuration file for the pipeline itself and other files for the settings for the internals of running an application inside a step.

We could do this by either overriding settings in a hierarchical way (tool setting < pipeline settings), by templating, or by trying to isolate settings in one of the two places.

I realize this is harder in Gentropy due to it being cloud-agnostic. That is a target for the platform part too. Even if it's impossible to do completely, we should strive for it. I think this separation will enable us to separate concerns in the orchestrator code too.

@project-defiant project-defiant merged commit 7306a67 into dev Oct 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify Gentropy ETL and Variant Index DAGs
2 participants