Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactoring vep annotation step #67

Merged
merged 8 commits into from
Nov 5, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Nov 4, 2024

Context

This PR is a part of effort to reduce the complexity of genetics_etl dag. Due to the fact that we want to submerge the genetics_etl to the platform etl dag, we need to simplify the VariantAnnotation task group.

Implementations

This PR changes the logic of variant_annotation node in the genetics etl dag.

  • Change the variant_to_vcf dag node implementation:
    Node was previously running as a google batch job with 4 variant_to_vcf gentropy steps (one per each datasource
    * uniprot
    * eva
    * credible sets
    * pharmgkb)
    that was followed by local (run with python operator - ConvertVariantsToVcfOperator) partitioning of the vcf based data sources.
    With refactoring of variant_to_vcf step implemented in refactor(convert to vcf): allow multiple input sources gentropy#891 the step can now take n datasources that are running in sequence and produce a single partitioned dataset. This allows for the:
  • removal of google batch task for variant to vcf convertion
  • removal of the local partitioning of vcf data sources
  • drop pandas as dependency (no longer needed, as ConvertVariantsToVcfOperator is no longer there
  • removal of the entire TaskGroup and adding a single dataproc node for variant annotation instead
  • Refactoring of the VepPathManager and addition of the tests
  • Preparation for freeze_6
  • Handling issue with spark not able to write csv with # in the column name without quoting the column name - issue comes when variant_to_vcf wants to save the #CHROM column as "#CHROM" creating the following issue in VEP:
WARNING: line 1 skipped ("#CHROM" POS ID REF ALT QUAL FILTER INFO): Invalid start 'POS' or end '2' coordinate

Note

This PR requires the opentargets/gentropy#896 to be merged, as the #CHROM issue was resolved by dynamically change the CHROM to #CHROM, so the variant_to_vcf step should drop the hash.

@project-defiant project-defiant marked this pull request as ready for review November 5, 2024 10:09
@@ -205,134 +248,3 @@ def _vep_command(self) -> list[str]:
--plugin AlphaMissense,file={self.pm.cache_dir}/AlphaMissense_hg38.tsv.gz,transcript_match=1 \
--plugin CADD,snv={self.pm.cache_dir}/CADD_GRCh38_whole_genome_SNVs.tsv.gz",
]


class ConvertVariantsToVcfOperator(BaseOperator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand,this is no longer needed, because this functionality is done by a separate step, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the bits, all the changes are well thought out. And looks good. However I haven't tested them.

@DSuveges DSuveges merged commit 062be3b into dev Nov 5, 2024
2 checks passed
@DSuveges DSuveges deleted the szsz-refactor-variant-annot-step branch November 5, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants