-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ukb_ppp_eur) dags refactoring #39
Conversation
I tried to read the outputs but the
Error in study validation:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Have a look at the comment about the outputs
Thanks @d0choa, I think that the logs were moved manually before For now I have added a single dag step to move the log files, although it is very slow due to the amount of files to transfer. I think it would be better to update the gentropy command to get as input the path to log file and save the log file path in the partial manifests that are generated before the batch job is run. Not ideal, but batch is way less flexible then dataproc :/ EDIT: from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data=[("a", 1), ("b", 2)])
df.write.partitionBy("_2").parquet("dataset_1")
df.write.parquet('dataset_2')
# read both datasets
df2 = spark.read.parquet(*["dataset_1", "_dataset_2"]) In case one of the parquet datasets is partitionned and others are not, the result is broken when one attempts to read it. |
Context
This PR closes #3493
We want to be able to run the harmonisation and perform susie finemapping batch job within the orchestration for ukb_ppp_eur data.
This PR summarizes the developments over the processing of the
To run the harmonisation, some steps needs to be pre-executed before. I have described these steps in the docs along with the data structure.
Additional tasks
partition
from path.AllocationPolicy
added to thecreate_batch_task
function.ukb_ppp_eur_data
and uploaded new datasets from last successful run of the dags.ukb_ppp_eur_data
bucket.dataproc
jobs that used thelocal[*]
by default instead ofyarn
.graphviz
as a dev dependency to generate dag svgs.Note
The
yarn
configuration option will now be added automatically to the dataproc steps