Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gwas catalog sumstats): finemapping #51

Merged
merged 16 commits into from
Oct 25, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Oct 18, 2024

Context

We want to perform locus breaker clumping and SuSiE finemapping on harmonised summary statistics comming from GWAS Catalog.

Implementations

This PR implements:

  • Dag for performing locus_breaker_clumping and study_index generation for GWAS Catalog summary statistics.
  • Dag for finemapping clumped results.
  • Next iteration of documentation for the GWAS Catalog.
  • Refactored finemapping operator to take into accout already finemapped loci and limit for the batch jobs to call

Note

Locus Breaker Clumping performance
The performance of LB clumping was not ideal. The step took ~2h to compute the StudyLocus starting from 69K harmonised summary statistics. See dataproc job.
This situation is a partially the result of the largely distributed dataset - see the first spike in nodes representing the first job to list all parquet files in subdirectories.

image
The number of loci resulted from clumping oscilated ~440K.

Running code with this branch we were able to perform the fine-mapping of the 441k loci in 7h.

The way how the finemapping works:

  1. list all loci outputed from locus breaker
  2. list all log files from previous finemapping runs
  3. make a diff and submit the jobs

This approach is not ideal due to the number of google API calls (knowledge post mortem - see distrubution of the calls in the buckets on 23rd of October ) we need to make when running list.objects, the better solution would be to:

  1. generate the manifests
  2. submit the batch jobs in consecutive order depending on the manifest
  3. cache the information if the manifest was used by finemapping job or not

This could be implemented as an enhancement in the future.

@project-defiant
Copy link
Collaborator Author

@addramir ignore the docs for now. I am changing them as soon as the dag is successful

@project-defiant project-defiant force-pushed the szsz-gwas-catalog-sumstat-locus-breaker branch from d949832 to dcde45e Compare October 18, 2024 13:12
@project-defiant project-defiant changed the title feat(gwas catalog sumstats): dag for locus breaker feat(gwas catalog sumstats): finemapping Oct 22, 2024
@project-defiant project-defiant marked this pull request as ready for review October 22, 2024 14:33
Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comment. All looks good and sensible, however I have to admit, I haven't run them. :D

@project-defiant
Copy link
Collaborator Author

@DSuveges thank you!

@project-defiant project-defiant merged commit 0ef3545 into dev Oct 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants