-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactoring vep annotation step #67
Conversation
@@ -205,134 +248,3 @@ def _vep_command(self) -> list[str]: | |||
--plugin AlphaMissense,file={self.pm.cache_dir}/AlphaMissense_hg38.tsv.gz,transcript_match=1 \ | |||
--plugin CADD,snv={self.pm.cache_dir}/CADD_GRCh38_whole_genome_SNVs.tsv.gz", | |||
] | |||
|
|||
|
|||
class ConvertVariantsToVcfOperator(BaseOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand,this is no longer needed, because this functionality is done by a separate step, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed, as it is a part of the variant_to_vcf
step -> https://github.com/opentargets/gentropy/blob/4d8e7c42e8e94473495d12d50efb7d7942afbe73/src/gentropy/variant_index.py#L114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand the bits, all the changes are well thought out. And looks good. However I haven't tested them.
Context
This PR is a part of effort to reduce the complexity of genetics_etl dag. Due to the fact that we want to submerge the genetics_etl to the platform etl dag, we need to simplify the VariantAnnotation task group.
Implementations
This PR changes the logic of
variant_annotation
node in the genetics etl dag.variant_to_vcf
dag node implementation:Node was previously running as a google batch job with 4
variant_to_vcf
gentropy steps (one per each datasource* uniprot
* eva
* credible sets
* pharmgkb)
that was followed by
local
(run with python operator -ConvertVariantsToVcfOperator
) partitioning of the vcf based data sources.With refactoring of
variant_to_vcf
step implemented in refactor(convert to vcf): allow multiple input sources gentropy#891 the step can now take n datasources that are running in sequence and produce a single partitioned dataset. This allows for the:ConvertVariantsToVcfOperator
is no longer there#
in the column name without quoting the column name - issue comes whenvariant_to_vcf
wants to save the#CHROM
column as"#CHROM"
creating the following issue in VEP:Note
This PR requires the opentargets/gentropy#896 to be merged, as the
#CHROM
issue was resolved by dynamically change theCHROM
to#CHROM
, so thevariant_to_vcf
step should drop the hash.