-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qob] Support loading RGs from FASTA files #12736
Conversation
eddf3bd
to
8d89cdc
Compare
44b7c2c
to
a537793
Compare
0ff232e
to
93e60c1
Compare
93e60c1
to
857fdab
Compare
857fdab
to
b36d77d
Compare
b36d77d
to
37ca184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Will give it another look on Monday.
@tpoterba I just realized I forgot to propagate the FUSE config through to worker jobs. Should I be and I got lucky that the singular test is just doing everything driver-side? Or is there a test we can write to ensure that worker jobs access the FASTA data? |
@daniel-goldstein, BackendUtils.scala |
Admittedly a debatable choice but worthwhile in the context of tests which generate tiny pipelines dominated by CDA overhead |
… have the fuse mount
@tpoterba fixed the config issue and changed n_partitions to ensure workers are scheduled for the FASTA reading. I tested this on a single batch worker so the jobs overlapped and flexed the shared mount code, but we don't really have a guarantee in our test setup because batch has no way to force collocation of jobs (and even so we can't exactly force that the runtimes will overlap). I suppose if there's an issue here it will bubble up as a nondeterministic failure. Not great but perhaps good enough for now? |
Right now we run dataproc tests only on release, not on every commit, because they're too expensive/slow. That way we never release a version that can't pass. I wonder if that's also the right strategy here -- adding QoB release tests for things that only go wrong at scale. That said, I don't want to block on that. Awesome change, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give me a day to look this over one more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small things.
@@ -157,6 +159,9 @@ class ServiceBackend( | |||
if (backendContext.workerMemory != "None") { | |||
resources = resources.merge(JObject(("memory" -> JString(backendContext.workerMemory)))) | |||
} | |||
if (backendContext.storageRequirement != "0Gi") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking -- do we control this number and are certain it will be "0Gi" and not "0", "0Mi", etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this number is set in service_backend.py
. There is also no need for any compatibility requirements between service_backend.py
and ServiceBackend.scala
so this can change as we wish as long as they agree with each other in the same commit.
# and there is no persistent backend to keep in sync. | ||
# Sequence and liftover information are passed on RPC | ||
def add_sequence(self, name, fasta_file, index_file): # pylint: disable=unused-argument | ||
# FIXME Not only should this be in the cloud, it should be in the *right* cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had made it easy to check this. Can we modify self.validate_file_scheme
or add a new method that asserts certain URI schemes for a particular cloud? My VEP branch exposes a cloud
endpoint to the front end so maybe we do this separately....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I don't think we can do this until the client knows which cloud it is submitting jobs to, so I think best to come back to this after your VEP PR.
if config['mounted']: | ||
bucket = config['bucket'] | ||
assert bucket | ||
mount_path = self.cloudfuse_data_path(bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set config['mounted'] to False here?
@@ -2509,6 +2668,8 @@ def __init__(self, client_session: httpx.ClientSession): | |||
|
|||
self.headers: Optional[Dict[str, str]] = None | |||
|
|||
self.cloudfuse_mount_manager = ReadOnlyCloudfuseManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a shutdown method for the mount_manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. When the worker exits, all jobs that were running on the worker should have gone through their cleanup
step and unmounted their respective buckets, so there should be no remaining work for the ReadOnlyCloudfuseManager
to do.
I left the changes to Query and Batch in separate commits for ease of review. I put these in the same PR because we don't really have standalone testing for JVM Jobs outside of Query-on-Batch so the FASTA use-case serves as a test here that cloudfuse is working properly for JVM Jobs. Would be great if Jackie you could review the batch commit and Tim could review the query commit.
Hail Query
FROM_FASTA_FILE
rpc and the service backend now passes sequence file information from RGs in every rpcFastaSequenceIndex
just loads the whole file on construction so might as well stream it in from whatever storage it's in.Hail Batch