[batch] Dont use Batch identity on workers #12611

daniel-goldstein · 2023-01-19T22:28:11Z

Use the worker VMs credentials for the FS and compute client instead of requesting credentials from the batch driver
This allows creating the FS before worker activation, meaning that same FS can be passed to JVMContainers, instead of needing to create a new LocalAsyncFS for JVM objects
Thread the task_manager through more of worker.py so the worker can properly cancel all tasks on shutdown

jigold · 2023-01-23T21:14:28Z

batch/batch/cloud/gcp/worker/disk.py

+        instance_name: str,
+        size_in_gb: int,
+        mount_path: str,
+        compute_client: aiogoogle.GoogleComputeClient,  # BORROWED


Can you explain the changes you made for borrowing the compute client for creating the attached GCPDisks? I thought we created a new compute client per disk for a reason. I can try and go back through the git history to find it.

I wasn't aware that there was a specific reason to create a new ComputeClient per disk. The reason I made this change was so that we wouldn't have O(disks) GoogleSessions each with their own access tokens. If you can find out the reason why I'll undo these change and add a comment.

I'm still looking, but the very original implementation had a borrowed compute_client:
https://github.com/hail-is/hail/pull/10090/files#diff-f8ba97f763395908a5b67f47a630c98e8d223ca5914f18d588f405d629d52197R958

Found it! #10811

Oh look it's me!

I think this was happening because we at the time (and maybe until this PR), had a lot of job cleanup that wasn't getting waited on by the task manager during worker shutdown. So the worker object closed the compute client that was still being used by Disks. I think the proper fix here is to make sure that the Worker properly waits for usages of the compute client to finish up before deleting it, instead of just using more clients.

I'm pretty sure you are correct that this change will be fine, but any changes to the Disk handling code gives me anxiety because of the potential financial costs if we mess up and overlook something. Can we make sure the disk monitoring code with an alert is up to date in Grafana? At one point, I had a chart where we were alerted if there was more than X amount of storage in disks being used or X number of disks. I think it's $0.17 / GB / month, so we probably want to be alerted if we are spending more than $20 per hour on disk costs which is 87 TB (please double check the math). So we probably can do an alert on 25 TB or 1k disks as that will catch slower accumulations over time. What do you think?

I'm all for making us feel confident in the change, so I'll do whatever metric we need to put in. Could you elaborate though on what could go wrong? I'd rather have an alert mean that something is wrong in our system and not that our scale is large.

What went wrong before is we were not cleaning up disks and they were slowly accumulating until we were paying for exabytes of disks -- I think we had over 100k 10Gi disks and didn't realize it. I'm worried that if the compute client suddenly doesn't work because the client session is closed (or some unknown issue with this change), then we're going to slowly accumulate orphaned disks. There's a garbage collection loop on the driver that also checks for orphaned disks and deletes them so there is redundancy here. Maybe we put an alert on how many orphaned disks we are cleaning up?

Maybe we put an alert on how many orphaned disks we are cleaning up?

This sounds great, I'll set up this alert.

responded in comment

jigold · 2023-01-24T16:10:33Z

gear/gear/clients.py

@@ -23,21 +23,6 @@ def get_identity_client(credentials_file: Optional[str] = None):
    return aiogoogle.GoogleIAmClient(project, credentials_file=credentials_file)


-def get_compute_client(credentials_file: Optional[str] = None):


What is your rationale for refactoring this code? I thought it would be good to have all of the cloud abstractions for getting a compute client in the same place as the storage client. Otherwise, I thought it would be prone to error to have this same code called in multiple places.

I thought it would be good to have all of the cloud abstractions for getting a compute client in the same place as the storage client.

I think I personally don't see the benefit here. I'm not sure what you mean by "this same code", or why the storage and compute client getters need to be colocated. If there's some similarity between the two, I'd rather codify it in the types of the code rather than the position of the code. get_compute_client is only called in the worker process, so I don't see a reason at the moment for this code to be at a higher level of visibility than that.

However, my main motivations for moving this are the following two reasons:

I personally find this pattern of repeatedly checking get_global_config()['cloud'] very noisy. It increases the surface area that this environment variable reaches and in the case of the worker it is mixing two design choices:

Having an object-oriented WorkerAPI class that knows cloud-specific implementation details

A more function-based approach like this one that switches on global state to produce different implementations

I'd rather have only one of these approaches, and I like the former in the case of the worker because worker.py is already very object-oriented and because we only have to check the cloud once. When I trace the flow of the program from the worker main method through to some small detail, it is very noisy to me to be switching between cloud-specific and cloud-agnostic methods in the same callstack.

I wanted to share the cloud credentials object between the FS and the compute client. I think it would be awkward to introduce a CloudSession input to this method because it would involve an upcast followed by a downcast, which worsens my issue with multiple spots on the same codepath trying to figure out what cloud we're in. So at that point, I didn't see much value in having the function and stuck the appropriate parts of this code closer to where the corresponding inputs are.

ok if this makes more sense to you. I think if we have multiple applications (not just batch) in the future that need a compute client, then we can revisit what the right abstraction is.

jigold · 2023-01-24T16:12:36Z

batch/batch/worker/worker.py

@@ -2104,15 +2104,6 @@ def __str__(self):
        )


-@contextmanager


Was this not used at all?

Not that I could find by grepping around.

jigold · 2023-01-24T16:16:52Z

batch/batch/worker/worker.py

+    assert instance_config.cores == CORES
+    assert instance_config.cloud == CLOUD
+
+    task_manager = aiotools.BackgroundTaskManager()


I don't see where the task manager is closed on shutdown.

Ah, it used to be owned by the worker so it's shutdown in Worker.shutdown, but now that I moved it out I should have the worker borrow it instead and shut it down in the main method.

Ok so the reason I had initially moved the task_manager out of the Worker was so that I could give it to the NetworkAllocator as well. However, that created a mess because the TaskManager shutdown was hard to separate properly from the Worker shutdown because the order is so important. So I undid my change of moving the TaskManager out of the Worker and instead just created a separate task manager for the NetworkAllocator.

I'm going to follow this up with another PR that sorts out the whole shutdown situation and properly use 1 task manager with context managers, but that got pretty big to also be in this PR.

jigold · 2023-01-24T16:21:31Z

batch/batch/cloud/gcp/worker/disk.py

+        instance_name: str,
+        size_in_gb: int,
+        mount_path: str,
+        compute_client: aiogoogle.GoogleComputeClient,  # BORROWED


I'm pretty sure you are correct that this change will be fine, but any changes to the Disk handling code gives me anxiety because of the potential financial costs if we mess up and overlook something. Can we make sure the disk monitoring code with an alert is up to date in Grafana? At one point, I had a chart where we were alerted if there was more than X amount of storage in disks being used or X number of disks. I think it's $0.17 / GB / month, so we probably want to be alerted if we are spending more than $20 per hour on disk costs which is 87 TB (please double check the math). So we probably can do an alert on 25 TB or 1k disks as that will catch slower accumulations over time. What do you think?

addressed

daniel-goldstein force-pushed the worker-credentials branch from b5574b9 to 90e9ee8 Compare January 19, 2023 22:28

daniel-goldstein changed the title ~~use one session everywhere~~ [batch] Dont use Batch identity on workers Jan 19, 2023

[batch] Dont use Batch identity on workers

cb6707d

daniel-goldstein force-pushed the worker-credentials branch from 8b74424 to cb6707d Compare January 23, 2023 14:24

no longer create one fs per JVM container

d7be500

daniel-goldstein assigned jigold Jan 23, 2023

jigold previously requested changes Jan 23, 2023

View reviewed changes

jigold previously requested changes Jan 24, 2023

View reviewed changes

make the worker own its own task manager

425a09b

jigold approved these changes Jan 27, 2023

View reviewed changes

danking merged commit 9f03cf4 into hail-is:main Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Dont use Batch identity on workers #12611

[batch] Dont use Batch identity on workers #12611

daniel-goldstein commented Jan 19, 2023 •

edited

Loading

jigold Jan 23, 2023 •

edited

Loading

daniel-goldstein Jan 23, 2023

jigold Jan 23, 2023

jigold Jan 23, 2023

daniel-goldstein Jan 23, 2023

daniel-goldstein Jan 23, 2023

jigold Jan 24, 2023

daniel-goldstein Jan 24, 2023

jigold Jan 24, 2023

daniel-goldstein Jan 24, 2023 •

edited

Loading

jigold Jan 24, 2023

daniel-goldstein Jan 24, 2023

jigold Jan 24, 2023

jigold Jan 24, 2023

daniel-goldstein Jan 24, 2023

jigold Jan 24, 2023

daniel-goldstein Jan 24, 2023

daniel-goldstein Jan 24, 2023

jigold Jan 24, 2023

		@@ -23,21 +23,6 @@ def get_identity_client(credentials_file: Optional[str] = None):
		return aiogoogle.GoogleIAmClient(project, credentials_file=credentials_file)


		def get_compute_client(credentials_file: Optional[str] = None):

[batch] Dont use Batch identity on workers #12611

[batch] Dont use Batch identity on workers #12611

Conversation

daniel-goldstein commented Jan 19, 2023 • edited Loading

jigold Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein commented Jan 19, 2023 •

edited

Loading

jigold Jan 23, 2023 •

edited

Loading

daniel-goldstein Jan 24, 2023 •

edited

Loading