-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify documentation for GPU-based AWS jobs #13969
Comments
hey @DbCrWk
any updates to the existing guides + a specialized guide with more of your exact situation would be super appreciated! let us know if you need any help with the contribution process or have any questions! |
How should I provide provide an update? Do you want documentation + a terraform template? |
@discdiver cool, I'll work on it over the next week or so! |
@DbCrWk did you manage to get this written up somewhere? It's exceedingly tricky to find information about this. I would have thought running GPU workloads like this would be more prevalent! |
@JamiePlace I wrote this up internally, but didn't close the loop and post it here. I need to remove the stuff that's specific to us, but I'll have bandwidth to post it this week! It's a bit tricky because we have to:
I also would've assumed more GPU workloads, too, though I'm not sure how many people launch GPUs via ECS. Maybe more common to do EC2 or custom workstations/clusters/machines? |
First check
Describe the issue
We use prefect to launch ECS-based jobs. Our CPU-only jobs work great on the push work queues. However, we also have several jobs that need GPUs. It has been very difficult to set this up properly and there is no single source of documentation that covers the end-to-end flow. In particular, our expectation is that we should be able to easily spin up GPU-based jobs on ECS and correctly autoscale the number of EC2 instances, including winding down to 0 instances if there are no jobs.
Here's what we've figured out (please correct us if there's a better way). We're happy to contribute documentation, sample code, and terraform templates for our solution:
a. AWS Fargate does not support GPU-based machines, see: AWS Fargate GPU Support: When is GPU support coming to fargate? aws/containers-roadmap#88
b. AWS ECS Push work pools only support Fargate, see: https://docs.prefect.io/latest/concepts/work-pools/
You have to set up the following resources:
a. An ECS cluster for a prefect worker. We recommend setting up a dedicated ECS cluster for just this prefect worker.
b. An appropriate autoscaling group (ASG) that spins up very carefully configured EC2 instances. This ASG has to be set up exactly correctly with the right AMIs because of the vagaries of ECS, see here, here, and here. The desired capacity should be set to 0.
c. An ECS cluster for GPU-based jobs with the previous ASG set up as a capacity provider, also with a very specific configuration.
Most importantly, you have to set a capacity provider strategy and not a launch type. You can set this on the work pool or a deployment itself. This fact is not documented directly, and instead is a logical consequence of the fact that the AWS RunTask API will, for whatever reason, ignore a capacity provider if the launch type is set, see: here and here.
However, it is unclear from the relevant prefect documentation that you actually cannot specify a launch type, and will otherwise get errors in submitting the flow to the infrastructure.
The final fact ^ seems to have cause a lot of confusion:
Describe the proposed change
We would recommend that:
Additional context
No response
The text was updated successfully, but these errors were encountered: