Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ukb_ppp_eur) dags refactoring #39

Merged
merged 10 commits into from
Oct 14, 2024

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Oct 11, 2024

Context

This PR closes #3493
We want to be able to run the harmonisation and perform susie finemapping batch job within the orchestration for ukb_ppp_eur data.

This PR summarizes the developments over the processing of the

  • ukb_ppp_eur_harmonisation (harmonisation + locus breaker) that results in the study_locus dataset partitioned by studyLocusId (run via dataproc)
  • ukb_ppp_eur_finemapping (finemapping manifest generation + susie finemapping) that results in the credible_set dataset partitioned by studyLocusId. (run via google batch)

To run the harmonisation, some steps needs to be pre-executed before. I have described these steps in the docs along with the data structure.

Additional tasks

  • Consolidated structure for batch operators (VEP and fine-mapping) in the package.
  • New function to extract partition from path.
  • Additional parameters AllocationPolicy added to the create_batch_task function.
  • Cleaned old data in ukb_ppp_eur_data and uploaded new datasets from last successful run of the dags.
  • Updated README documentation for the ukb_ppp_eur_data bucket.
  • Added a brief description of the fine-mapping step
  • Fixed issue with dataproc jobs that used the local[*] by default instead of yarn.
  • Readded the idle timeout for cluster deletion.
  • Added graphviz as a dev dependency to generate dag svgs.
  • Added instruction on how to generate dag svg.

Note

The yarn configuration option will now be added automatically to the dataproc steps

@project-defiant project-defiant marked this pull request as ready for review October 11, 2024 19:22
@d0choa
Copy link
Collaborator

d0choa commented Oct 11, 2024

I tried to read the outputs but the study_locus_validation step struggles to read the directory. Most likely because the directory contains the logs:

❯ gsutil ls 'gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5*'
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c.log

gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/:
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/_SUCCESS
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00000-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet
gs://ukb_ppp_eur_data/credible_set_datasets/susie/studyLocusId=f6c6edb08ed02502869faea2d46e7c5c/part-00003-e1d26427-69b6-473b-a6fb-d5c470f28280-c000.snappy.parquet

Error in study validation:
https://console.cloud.google.com/dataproc/jobs/02873e9e-e3f2-4618-9ad5-03d1bb417396/monitoring?region=europe-west1&project=open-targets-genetics-dev

py4j.protocol.Py4JJavaError: An error occurred while calling o114.load.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:�
	gs://ukb_ppp_eur_data/credible_set_datasets/susie
	gs://gwas_catalog_sumstats_susie/credible_sets
	gs://eqtl_catalogue_data/credible_set_datasets/eqtl_catalogue_susie
	gs://gwas_catalog_sumstats_pics/credible_sets
	gs://gwas_catalog_top_hits/credible_sets
	gs://finngen_data/r11/credible_set_datasets/susie

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Have a look at the comment about the outputs

@project-defiant
Copy link
Collaborator Author

project-defiant commented Oct 11, 2024

Thanks @d0choa, I think that the logs were moved manually before

For now I have added a single dag step to move the log files, although it is very slow due to the amount of files to transfer. I think it would be better to update the gentropy command to get as input the path to log file and save the log file path in the partial manifests that are generated before the batch job is run.

Not ideal, but batch is way less flexible then dataproc :/

EDIT:
I have isolated the issue:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data=[("a", 1), ("b", 2)])
df.write.partitionBy("_2").parquet("dataset_1")
df.write.parquet('dataset_2')

# read both datasets
df2 = spark.read.parquet(*["dataset_1", "_dataset_2"])

In case one of the parquet datasets is partitionned and others are not, the result is broken when one attempts to read it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The DAG for harmonisation and fine-mapping of UKBB-PPP
2 participants