Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] Support loading RGs from FASTA files #12736

Merged
merged 12 commits into from
Apr 10, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented Feb 27, 2023

I left the changes to Query and Batch in separate commits for ease of review. I put these in the same PR because we don't really have standalone testing for JVM Jobs outside of Query-on-Batch so the FASTA use-case serves as a test here that cloudfuse is working properly for JVM Jobs. Would be great if Jackie you could review the batch commit and Tim could review the query commit.

Hail Query

  • Added support for the FROM_FASTA_FILE rpc and the service backend now passes sequence file information from RGs in every rpc
  • Refactored the liftover handling in service_backend to not redundantly store liftover maps and just take them from the ReferenceGenome objects like I did for sequence files. This means that add/remove liftover/sequence functions on the Backend are just intended to sync up the backend with python, which is a no-op for the service backend.
  • Don't localize the index file on fromFASTAFile/addSequence before creating the index object. FastaSequenceIndex just loads the whole file on construction so might as well stream it in from whatever storage it's in.
  • FASTA caching is left alone because those files will be mounted and unmounted from the jvm container over the life of the job. JVM doesn't have to worry about disk usage because that's handled by Batch XFS quotas, so long as the service backend requests enough storage to fit the FASTA file. Batch will make sure that a given bucket (and therefore a given FASTA file) is mounted once per-user on a batch worker.

Hail Batch

  • Added support for read-only cloudfuse mounts for JVM jobs
  • These mounts are shared between jobs on the same machine from the same user
  • I did not change DockerJobs, but they could be very easily adapted to use this new mount-sharing code.

jigold
jigold previously requested changes Mar 17, 2023
Copy link
Contributor

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Will give it another look on Monday.

batch/batch/worker/worker.py Show resolved Hide resolved
batch/batch/worker/worker.py Show resolved Hide resolved
batch/batch/worker/worker.py Show resolved Hide resolved
batch/batch/worker/worker.py Show resolved Hide resolved
@daniel-goldstein
Copy link
Contributor Author

daniel-goldstein commented Mar 21, 2023

@tpoterba I just realized I forgot to propagate the FUSE config through to worker jobs. Should I be and I got lucky that the singular test is just doing everything driver-side? Or is there a test we can write to ensure that worker jobs access the FASTA data?

@danking
Copy link
Contributor

danking commented Mar 27, 2023

@daniel-goldstein, BackendUtils.scala collectDArray has an optimization for single partition jobs. They're run on the driver.

@danking
Copy link
Contributor

danking commented Mar 27, 2023

Admittedly a debatable choice but worthwhile in the context of tests which generate tiny pipelines dominated by CDA overhead

@daniel-goldstein
Copy link
Contributor Author

@tpoterba fixed the config issue and changed n_partitions to ensure workers are scheduled for the FASTA reading. I tested this on a single batch worker so the jobs overlapped and flexed the shared mount code, but we don't really have a guarantee in our test setup because batch has no way to force collocation of jobs (and even so we can't exactly force that the runtimes will overlap). I suppose if there's an issue here it will bubble up as a nondeterministic failure. Not great but perhaps good enough for now?

@tpoterba
Copy link
Contributor

tpoterba commented Apr 6, 2023

Right now we run dataproc tests only on release, not on every commit, because they're too expensive/slow. That way we never release a version that can't pass. I wonder if that's also the right strategy here -- adding QoB release tests for things that only go wrong at scale.

That said, I don't want to block on that. Awesome change, thank you!

Copy link
Contributor

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me a day to look this over one more time.

jigold
jigold previously requested changes Apr 7, 2023
Copy link
Contributor

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small things.

@@ -157,6 +159,9 @@ class ServiceBackend(
if (backendContext.workerMemory != "None") {
resources = resources.merge(JObject(("memory" -> JString(backendContext.workerMemory))))
}
if (backendContext.storageRequirement != "0Gi") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking -- do we control this number and are certain it will be "0Gi" and not "0", "0Mi", etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this number is set in service_backend.py. There is also no need for any compatibility requirements between service_backend.py and ServiceBackend.scala so this can change as we wish as long as they agree with each other in the same commit.

# and there is no persistent backend to keep in sync.
# Sequence and liftover information are passed on RPC
def add_sequence(self, name, fasta_file, index_file): # pylint: disable=unused-argument
# FIXME Not only should this be in the cloud, it should be in the *right* cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had made it easy to check this. Can we modify self.validate_file_scheme or add a new method that asserts certain URI schemes for a particular cloud? My VEP branch exposes a cloud endpoint to the front end so maybe we do this separately....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I don't think we can do this until the client knows which cloud it is submitting jobs to, so I think best to come back to this after your VEP PR.

if config['mounted']:
bucket = config['bucket']
assert bucket
mount_path = self.cloudfuse_data_path(bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set config['mounted'] to False here?

@@ -2509,6 +2668,8 @@ def __init__(self, client_session: httpx.ClientSession):

self.headers: Optional[Dict[str, str]] = None

self.cloudfuse_mount_manager = ReadOnlyCloudfuseManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a shutdown method for the mount_manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. When the worker exits, all jobs that were running on the worker should have gone through their cleanup step and unmounted their respective buckets, so there should be no remaining work for the ReadOnlyCloudfuseManager to do.

@danking danking merged commit 39b1606 into hail-is:main Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants