Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Dont use Batch identity on workers #12611

Merged
merged 3 commits into from
Jan 27, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented Jan 19, 2023

  • Use the worker VMs credentials for the FS and compute client instead of requesting credentials from the batch driver
  • This allows creating the FS before worker activation, meaning that same FS can be passed to JVMContainers, instead of needing to create a new LocalAsyncFS for JVM objects
  • Thread the task_manager through more of worker.py so the worker can properly cancel all tasks on shutdown

@daniel-goldstein daniel-goldstein changed the title use one session everywhere [batch] Dont use Batch identity on workers Jan 19, 2023
jigold
jigold previously requested changes Jan 23, 2023
instance_name: str,
size_in_gb: int,
mount_path: str,
compute_client: aiogoogle.GoogleComputeClient, # BORROWED
Copy link
Contributor

@jigold jigold Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the changes you made for borrowing the compute client for creating the attached GCPDisks? I thought we created a new compute client per disk for a reason. I can try and go back through the git history to find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that there was a specific reason to create a new ComputeClient per disk. The reason I made this change was so that we wouldn't have O(disks) GoogleSessions each with their own access tokens. If you can find out the reason why I'll undo these change and add a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still looking, but the very original implementation had a borrowed compute_client:
https://github.com/hail-is/hail/pull/10090/files#diff-f8ba97f763395908a5b67f47a630c98e8d223ca5914f18d588f405d629d52197R958

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it! #10811

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh look it's me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was happening because we at the time (and maybe until this PR), had a lot of job cleanup that wasn't getting waited on by the task manager during worker shutdown. So the worker object closed the compute client that was still being used by Disks. I think the proper fix here is to make sure that the Worker properly waits for usages of the compute client to finish up before deleting it, instead of just using more clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you are correct that this change will be fine, but any changes to the Disk handling code gives me anxiety because of the potential financial costs if we mess up and overlook something. Can we make sure the disk monitoring code with an alert is up to date in Grafana? At one point, I had a chart where we were alerted if there was more than X amount of storage in disks being used or X number of disks. I think it's $0.17 / GB / month, so we probably want to be alerted if we are spending more than $20 per hour on disk costs which is 87 TB (please double check the math). So we probably can do an alert on 25 TB or 1k disks as that will catch slower accumulations over time. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for making us feel confident in the change, so I'll do whatever metric we need to put in. Could you elaborate though on what could go wrong? I'd rather have an alert mean that something is wrong in our system and not that our scale is large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What went wrong before is we were not cleaning up disks and they were slowly accumulating until we were paying for exabytes of disks -- I think we had over 100k 10Gi disks and didn't realize it. I'm worried that if the compute client suddenly doesn't work because the client session is closed (or some unknown issue with this change), then we're going to slowly accumulate orphaned disks. There's a garbage collection loop on the driver that also checks for orphaned disks and deletes them so there is redundancy here. Maybe we put an alert on how many orphaned disks we are cleaning up?

Copy link
Contributor Author

@daniel-goldstein daniel-goldstein Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we put an alert on how many orphaned disks we are cleaning up?

This sounds great, I'll set up this alert.

jigold
jigold previously requested changes Jan 24, 2023
@@ -23,21 +23,6 @@ def get_identity_client(credentials_file: Optional[str] = None):
return aiogoogle.GoogleIAmClient(project, credentials_file=credentials_file)


def get_compute_client(credentials_file: Optional[str] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your rationale for refactoring this code? I thought it would be good to have all of the cloud abstractions for getting a compute client in the same place as the storage client. Otherwise, I thought it would be prone to error to have this same code called in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be good to have all of the cloud abstractions for getting a compute client in the same place as the storage client.

I think I personally don't see the benefit here. I'm not sure what you mean by "this same code", or why the storage and compute client getters need to be colocated. If there's some similarity between the two, I'd rather codify it in the types of the code rather than the position of the code. get_compute_client is only called in the worker process, so I don't see a reason at the moment for this code to be at a higher level of visibility than that.

However, my main motivations for moving this are the following two reasons:

  1. I personally find this pattern of repeatedly checking get_global_config()['cloud'] very noisy. It increases the surface area that this environment variable reaches and in the case of the worker it is mixing two design choices:
  • Having an object-oriented WorkerAPI class that knows cloud-specific implementation details
  • A more function-based approach like this one that switches on global state to produce different implementations

I'd rather have only one of these approaches, and I like the former in the case of the worker because worker.py is already very object-oriented and because we only have to check the cloud once. When I trace the flow of the program from the worker main method through to some small detail, it is very noisy to me to be switching between cloud-specific and cloud-agnostic methods in the same callstack.

  1. I wanted to share the cloud credentials object between the FS and the compute client. I think it would be awkward to introduce a CloudSession input to this method because it would involve an upcast followed by a downcast, which worsens my issue with multiple spots on the same codepath trying to figure out what cloud we're in. So at that point, I didn't see much value in having the function and stuck the appropriate parts of this code closer to where the corresponding inputs are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if this makes more sense to you. I think if we have multiple applications (not just batch) in the future that need a compute client, then we can revisit what the right abstraction is.

@@ -2104,15 +2104,6 @@ def __str__(self):
)


@contextmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this not used at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I could find by grepping around.

assert instance_config.cores == CORES
assert instance_config.cloud == CLOUD

task_manager = aiotools.BackgroundTaskManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the task manager is closed on shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it used to be owned by the worker so it's shutdown in Worker.shutdown, but now that I moved it out I should have the worker borrow it instead and shut it down in the main method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so the reason I had initially moved the task_manager out of the Worker was so that I could give it to the NetworkAllocator as well. However, that created a mess because the TaskManager shutdown was hard to separate properly from the Worker shutdown because the order is so important. So I undid my change of moving the TaskManager out of the Worker and instead just created a separate task manager for the NetworkAllocator.

I'm going to follow this up with another PR that sorts out the whole shutdown situation and properly use 1 task manager with context managers, but that got pretty big to also be in this PR.

instance_name: str,
size_in_gb: int,
mount_path: str,
compute_client: aiogoogle.GoogleComputeClient, # BORROWED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you are correct that this change will be fine, but any changes to the Disk handling code gives me anxiety because of the potential financial costs if we mess up and overlook something. Can we make sure the disk monitoring code with an alert is up to date in Grafana? At one point, I had a chart where we were alerted if there was more than X amount of storage in disks being used or X number of disks. I think it's $0.17 / GB / month, so we probably want to be alerted if we are spending more than $20 per hour on disk costs which is 87 TB (please double check the math). So we probably can do an alert on 25 TB or 1k disks as that will catch slower accumulations over time. What do you think?

@danking danking merged commit 9f03cf4 into hail-is:main Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants