Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement posterior prob filter for COLOC at small overlaps N<10 #977

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

xyg123
Copy link
Contributor

@xyg123 xyg123 commented Jan 22, 2025

✨ Context

There is instability of COLOC at small overlaps due to inflation of the likelihood term calculation, the problem is overviewed in the issue.

🛠 What does this PR implement

We now only pass small overlaps to COLOC if the overlapping probabilities are > 0.9 on both sides, which means the inflated H4s will be likely to be true colocalisation results. The parameters chosen here, PP > 0.9 and N <10 to achieve a comparable proportion of significant h4 results with the prev. OTG portal.

Total Colocalisation Results Number of COLOC H4 > 0.8 Percentage of COLOC H4 > 0.8
Gentropy release 24.12 23,709,155 17,553,867 74.04%
OTG portal release 22.10 7,408,493 4,357,079 58.81%
Gentropy 24.12 N>10 8,566,692 4,530,616 52.88%

Tests related to COLOC have been adjusted accordingly, to include posterior probabilities of >0.9 on both sides to prevent filtering out.

🙈 Missing

Future work will involve dynamic penalising of the prior term in H4 calculations, when the overlaps are small.
Implement parameters into COLOC method

🚦 Before submitting

  • [ x] Do these changes cover one single feature (one change at a time)?
  • [ x] Did you read the contributor guideline?
  • [ x] Did you make sure to update the documentation with your changes?
  • [ x] Did you make sure there is no commented out code in this PR?
  • [ x] Did you follow conventional commits standards in PR title and commit messages?
  • [ x] Did you make sure the branch is up-to-date with the dev branch?
  • [ x] Did you write any new necessary tests?
  • [ x] Did you make sure the changes pass local tests (make test)?
  • [ x] Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@xyg123
Copy link
Contributor Author

xyg123 commented Jan 22, 2025

@addramir Please let me know if you'd like to discuss changing the parameters (PP > 0.9 and N<10).

@xyg123
Copy link
Contributor Author

xyg123 commented Jan 23, 2025

Per @addramir's request, investigating the effect of minimum overlap cutoff on the proportion of significant H4s:

N Total_Count H4 > 0.8 Count Ratio
1 16627720 11303929 0.6798243535493742
2 14355138 9404867 0.6551568504600931
3 12896260 8159638 0.632713515391284
4 11851917 7271878 0.6135613335800445
5 11081132 6626729 0.5980191373949881
6 10409353 6058734 0.582047126271921
7 9846088 5586696 0.5674026070049344
8 9345295 5168813 0.5530925455001688
9 8930353 4828156 0.5406455937408073
10 8566692 4530616 0.5288641169777085

@addramir
Copy link
Contributor

Looks good to me. Let's use N=5 as threshold.

@addramir
Copy link
Contributor

@ireneisdoomed the code and logic look good to me. Please have a look on the code - if it is fine - let's approve

@project-defiant project-defiant self-requested a review January 30, 2025 11:01
f.aggregate(
f.transform(
f.arrays_zip(
fml.vector_to_array(f.col("left_posteriorProbability")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure that the converting to vectors and back is a correct way to handle the posteriorProbabilibies, I wish to understand the logic, why the VectorUDT was used initially here. @ireneisdoomed do you know the reason behind the vectorization of the logBF values (is it the sparsity or default numpy conversion?)

),
# row["0"] = left PP, row["1"] = right PP, row["tagVariantSourceList"]
lambda row: f.when(
(row["tagVariantSourceList"] == "both")
Copy link
Contributor

@project-defiant project-defiant Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking for both needs to happen before you array_zip, otherwise you will end up with mixed results (not comming from the same variant) - I may be wrong though, can you provide a specific test to this part? You could add it on top of the colocalisation step test. I would like to see a test cases for:

  1. Case when left overlap (or right) does not exist, so the algorithm orders the lists (array_zip) correctly just taking into account the both
  2. Case when all PIPs from left side are low and all PIPs from right side are high
  3. Case when at least one PIP from left and one PIP from right is high

To do this you would have to make an overlap example with at least 2 variants on one side and 3 variants on the other side

Comment on lines +365 to +368
& (row["0"] > Coloc.POSTERIOR_CUTOFF)
& (row["1"] > Coloc.POSTERIOR_CUTOFF),
1.0,
).otherwise(0.0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure you are comparing 2 the same variants here (since they can be left or right oriented as well)

Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the comments, as it is a crucial part, lets use this chance to make test it a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants