feat(ukb_ppp_eur) dags refactoring #39

project-defiant · 2024-10-11T19:07:23Z

Context

This PR closes #3493
We want to be able to run the harmonisation and perform susie finemapping batch job within the orchestration for ukb_ppp_eur data.

This PR summarizes the developments over the processing of the

ukb_ppp_eur_harmonisation (harmonisation + locus breaker) that results in the study_locus dataset partitioned by studyLocusId (run via dataproc)
ukb_ppp_eur_finemapping (finemapping manifest generation + susie finemapping) that results in the credible_set dataset partitioned by studyLocusId. (run via google batch)

To run the harmonisation, some steps needs to be pre-executed before. I have described these steps in the docs along with the data structure.

Additional tasks

Note

The yarn configuration option will now be added automatically to the dataproc steps

d0choa · 2024-10-11T20:03:45Z

I tried to read the outputs but the study_locus_validation step struggles to read the directory. Most likely because the directory contains the logs:

❯ gsutil ls 'gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5*'
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c.log

gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/:
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/_SUCCESS
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00000-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00003-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet

Error in study validation:
https://console.cloud.google.com/dataproc/jobs/02873e9e-e3f2-4618-9ad5-03d1bb417396/monitoring?region=europe-west1&project=open-targets-genetics-dev

py4j.protocol.Py4JJavaError: An error occurred while calling o114.load.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:�
	gs://ukb_ppp_eur_data/credible_set_datasets/susie
	gs://gwas_catalog_sumstats_susie/credible_sets
	gs://eqtl_catalogue_data/credible_set_datasets/eqtl_catalogue_susie
	gs://gwas_catalog_sumstats_pics/credible_sets
	gs://gwas_catalog_top_hits/credible_sets
	gs://finngen_data/r11/credible_set_datasets/susie

d0choa

Looking good. Have a look at the comment about the outputs

docs/README.md

src/ot_orchestration/utils/dataproc.py

project-defiant · 2024-10-11T22:42:36Z

Thanks @d0choa, I think that the logs were moved manually before

For now I have added a single dag step to move the log files, although it is very slow due to the amount of files to transfer. I think it would be better to update the gentropy command to get as input the path to log file and save the log file path in the partial manifests that are generated before the batch job is run.

Not ideal, but batch is way less flexible then dataproc :/

EDIT:
I have isolated the issue:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data=[("a", 1), ("b", 2)])
df.write.partitionBy("_2").parquet("dataset_1")
df.write.parquet('dataset_2')

# read both datasets
df2 = spark.read.parquet(*["dataset_1", "_dataset_2"])

In case one of the parquet datasets is partitionned and others are not, the result is broken when one attempts to read it.

Szymon Szyszkowski added 8 commits October 11, 2024 15:44

chore(ukb_ppp): update dag name

d194485

feat(ukb_ppp): harmonisation yaml update

d3a7d02

chore: move vep operators to batch dir

3b2737d

feat: new utility functions for batch

ab57ef8

chore: new tests for extract partition

491b102

feat: finemapping of ukb_ppp_eur dataset

d09910e

docs(ukb_ppp_eur): documentation effort

b095dea

fix(spark_session): test use of yarn in dataproc jobs

ea15779

project-defiant requested a review from d0choa October 11, 2024 19:21

project-defiant marked this pull request as ready for review October 11, 2024 19:22

docs(ukb_ppp_eur): dag svgs

6609750

d0choa approved these changes Oct 11, 2024

View reviewed changes

docs/README.md Show resolved Hide resolved

src/ot_orchestration/utils/dataproc.py Show resolved Hide resolved

fix(finemapping): move finemapping logs operator

cadccca

project-defiant merged commit a493782 into dev Oct 14, 2024
2 checks passed

This was referenced Oct 14, 2024

fix(yarn): add dataproc parameter to hydra comamnd parsing function in gentropy step #40

Merged

feat(susie_finemapper): allow for extraction of the log file from manifest opentargets/gentropy#859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ukb_ppp_eur) dags refactoring #39

feat(ukb_ppp_eur) dags refactoring #39

project-defiant commented Oct 11, 2024 •

edited

Loading

d0choa commented Oct 11, 2024 •

edited

Loading

d0choa left a comment

project-defiant commented Oct 11, 2024 •

edited

Loading

feat(ukb_ppp_eur) dags refactoring #39

feat(ukb_ppp_eur) dags refactoring #39

Conversation

project-defiant commented Oct 11, 2024 • edited Loading

Context

Additional tasks

d0choa commented Oct 11, 2024 • edited Loading

d0choa left a comment

Choose a reason for hiding this comment

project-defiant commented Oct 11, 2024 • edited Loading

project-defiant commented Oct 11, 2024 •

edited

Loading

d0choa commented Oct 11, 2024 •

edited

Loading

project-defiant commented Oct 11, 2024 •

edited

Loading