-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(variant index): improved data structure #710
Conversation
β¦_3350_variant_index_schema_update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments
@@ -20,6 +20,12 @@ | |||
"containsNull": true, | |||
"elementType": { | |||
"fields": [ | |||
{ | |||
"metadata": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate with line 326?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Not exactly. The field "hgvsg"
indeed repeated twice, but in the first instance it is within the intergenic_consequences
object, the other one is in the transcript_consequences
object. Unfortunately, because the VEP output is not perfectly modelled, we need both: when there's no transcript consequence we need to access this value from intergenic_conseqence and vice versa.
src/gentropy/common/spark_helpers.py
Outdated
@@ -375,6 +376,78 @@ def order_array_of_structs_by_field(column_name: str, field_name: str) -> Column | |||
) | |||
|
|||
|
|||
def order_array_of_structs_by_two_fields( | |||
array_name: str, col1: str, col2: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Could you specify the ordering for col1 and col2 ? Otherwise the function name IMO is too generic. The docstring explains more what happens tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed variables + added an example. I think this would help, however the sorting expression extremely complicated. It's not my expertise to write these kind of expression, so if you have ideas how to refactor, that would be great.
src/gentropy/common/spark_helpers.py
Outdated
when left.{col2} is null then 1 | ||
when right.{col2} is null then -1 | ||
|
||
when left.{col1} < right.{col1} then 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it work on string types as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it doesn't need to be numeric types.
src/gentropy/common/spark_helpers.py
Outdated
|
||
when left.{col1} < right.{col1} then 1 | ||
when left.{col1} > right.{col1} then -1 | ||
when left.{col1} == right.{col1} and left.{col2} > right.{col2} then 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would incline that the ordering is done firstly by the col1, and in case they are equal, then by col2. You do not compare the case when left.col1 is Null and right.col1 is Null, but left.col2 > right.col2
and others as far as I understand it. It's quiet complex operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's why the docstring says:
This function doesn't deal with null values, assumes the sort columns are not nullable.
I didn't want to make this expression even more elaborate. The method is used when I need to sort object in the vep column and I can be confident these values are always present.
β¨ Context
Upon iterating on the variant page a number of points have been identified that can be improved. For more details see this comment
π What does this PR implement
π Missing
Purely business logic, there's no integration with airflow.
π¦ Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?