Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(variant index): improved data structure #710

Merged
merged 18 commits into from
Sep 3, 2024

Conversation

DSuveges
Copy link
Contributor

✨ Context

Upon iterating on the variant page a number of points have been identified that can be improved. For more details see this comment

πŸ›  What does this PR implement

πŸ™ˆ Missing

Purely business logic, there's no integration with airflow.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@DSuveges DSuveges changed the title feature(variant index): improved data structure feat(variant index): improved data structure Jul 21, 2024
@DSuveges DSuveges marked this pull request as ready for review July 29, 2024 08:34
@project-defiant project-defiant self-requested a review September 3, 2024 09:19
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments

@@ -20,6 +20,12 @@
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate with line 326?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Not exactly. The field "hgvsg" indeed repeated twice, but in the first instance it is within the intergenic_consequences object, the other one is in the transcript_consequences object. Unfortunately, because the VEP output is not perfectly modelled, we need both: when there's no transcript consequence we need to access this value from intergenic_conseqence and vice versa.

@@ -375,6 +376,78 @@ def order_array_of_structs_by_field(column_name: str, field_name: str) -> Column
)


def order_array_of_structs_by_two_fields(
array_name: str, col1: str, col2: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Could you specify the ordering for col1 and col2 ? Otherwise the function name IMO is too generic. The docstring explains more what happens tho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed variables + added an example. I think this would help, however the sorting expression extremely complicated. It's not my expertise to write these kind of expression, so if you have ideas how to refactor, that would be great.

src/gentropy/common/spark_helpers.py Show resolved Hide resolved
when left.{col2} is null then 1
when right.{col2} is null then -1

when left.{col1} < right.{col1} then 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it work on string types as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't need to be numeric types.


when left.{col1} < right.{col1} then 1
when left.{col1} > right.{col1} then -1
when left.{col1} == right.{col1} and left.{col2} > right.{col2} then 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would incline that the ordering is done firstly by the col1, and in case they are equal, then by col2. You do not compare the case when left.col1 is Null and right.col1 is Null, but left.col2 > right.col2 and others as far as I understand it. It's quiet complex operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's why the docstring says:

This function doesn't deal with null values, assumes the sort columns are not nullable.

I didn't want to make this expression even more elaborate. The method is used when I need to sort object in the vep column and I can be confident these values are always present.

@DSuveges DSuveges merged commit 1a7b0d7 into dev Sep 3, 2024
4 checks passed
@DSuveges DSuveges deleted the ds_3350_variant_index_schema_update branch September 18, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants