-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement posterior prob filter for COLOC at small overlaps N<10 #977
base: dev
Are you sure you want to change the base?
Conversation
@addramir Please let me know if you'd like to discuss changing the parameters (PP > 0.9 and N<10). |
Per @addramir's request, investigating the effect of minimum overlap cutoff on the proportion of significant H4s:
|
Looks good to me. Let's use N=5 as threshold. |
@ireneisdoomed the code and logic look good to me. Please have a look on the code - if it is fine - let's approve |
f.aggregate( | ||
f.transform( | ||
f.arrays_zip( | ||
fml.vector_to_array(f.col("left_posteriorProbability")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure that the converting to vectors and back is a correct way to handle the posteriorProbabilibies, I wish to understand the logic, why the VectorUDT
was used initially here. @ireneisdoomed do you know the reason behind the vectorization of the logBF values (is it the sparsity or default numpy conversion?)
), | ||
# row["0"] = left PP, row["1"] = right PP, row["tagVariantSourceList"] | ||
lambda row: f.when( | ||
(row["tagVariantSourceList"] == "both") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking for both
needs to happen before you array_zip
, otherwise you will end up with mixed results (not comming from the same variant) - I may be wrong though, can you provide a specific test to this part? You could add it on top of the colocalisation step test. I would like to see a test cases for:
- Case when left overlap (or right) does not exist, so the algorithm orders the lists (array_zip) correctly just taking into account the
both
- Case when all PIPs from left side are low and all PIPs from right side are high
- Case when at least one PIP from left and one PIP from right is high
To do this you would have to make an overlap example with at least 2 variants on one side and 3 variants on the other side
& (row["0"] > Coloc.POSTERIOR_CUTOFF) | ||
& (row["1"] > Coloc.POSTERIOR_CUTOFF), | ||
1.0, | ||
).otherwise(0.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure you are comparing 2 the same variants here (since they can be left or right oriented as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the comments, as it is a crucial part, lets use this chance to make test it a bit more.
✨ Context
There is instability of COLOC at small overlaps due to inflation of the likelihood term calculation, the problem is overviewed in the issue.
🛠 What does this PR implement
We now only pass small overlaps to COLOC if the overlapping probabilities are > 0.9 on both sides, which means the inflated H4s will be likely to be true colocalisation results. The parameters chosen here, PP > 0.9 and N <10 to achieve a comparable proportion of significant h4 results with the prev. OTG portal.
Tests related to COLOC have been adjusted accordingly, to include posterior probabilities of >0.9 on both sides to prevent filtering out.
🙈 Missing
Future work will involve dynamic penalising of the prior term in H4 calculations, when the overlaps are small.
Implement parameters into COLOC method
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?