-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: variant-index refactoring #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start!
I haven't gone in detail into it. As we discussed yesterday, we should try to figure out a way to separate concerns in a similar way as it is being done in the platform, with a configuration file for the pipeline itself and other files for the settings for the internals of running an application inside a step.
We could do this by either overriding settings in a hierarchical way (tool setting < pipeline settings), by templating, or by trying to isolate settings in one of the two places.
I realize this is harder in Gentropy due to it being cloud-agnostic. That is a target for the platform part too. Even if it's impossible to do completely, we should strive for it. I think this separation will enable us to separate concerns in the orchestrator code too.
Context
This PR closes opentargets/issues#3431
The main development being done was moving the code from variant_index dag to the genetics_etl dag to port new tasks (wrapped into
variant annotation
task group) that are required to run before thevariantIndexStep
can be successfully executed.Things implemented:
Limitations
The orchestration of
variant_index
andvariant_annotation
currently runs as:The steps 1 and 2 should be merged and run as a dataproc job (the 1. requires only 4 batch tasks) for silplifying the configuration, although this is out of scope of this PR.
Test perfomed on successful run of the pipeline ending at the variant_index step (including) resulted in ~1h10min run