Clarify documentation for GPU-based AWS jobs #13969

DbCrWk · 2024-06-12T18:14:02Z

First check

I added a descriptive title to this issue.
I used GitHub search to find a similar request and didn't find it 😇

Describe the issue

We use prefect to launch ECS-based jobs. Our CPU-only jobs work great on the push work queues. However, we also have several jobs that need GPUs. It has been very difficult to set this up properly and there is no single source of documentation that covers the end-to-end flow. In particular, our expectation is that we should be able to easily spin up GPU-based jobs on ECS and correctly autoscale the number of EC2 instances, including winding down to 0 instances if there are no jobs.

Here's what we've figured out (please correct us if there's a better way). We're happy to contribute documentation, sample code, and terraform templates for our solution:

You cannot use a serverless push work pool, and instead must use the hybrid AWS ECS pool. This fact is only gently hinted at because of the logical consequence of two things:
a. AWS Fargate does not support GPU-based machines, see: AWS Fargate GPU Support: When is GPU support coming to fargate? aws/containers-roadmap#88
b. AWS ECS Push work pools only support Fargate, see: https://docs.prefect.io/latest/concepts/work-pools/

AWS Elastic Container Service - Push: Execute flow runs within containers on AWS ECS. Works with existing ECS clusters and serverless execution via AWS Fargate.

You have to set up the following resources:
a. An ECS cluster for a prefect worker. We recommend setting up a dedicated ECS cluster for just this prefect worker.
b. An appropriate autoscaling group (ASG) that spins up very carefully configured EC2 instances. This ASG has to be set up exactly correctly with the right AMIs because of the vagaries of ECS, see here, here, and here. The desired capacity should be set to 0.
c. An ECS cluster for GPU-based jobs with the previous ASG set up as a capacity provider, also with a very specific configuration.
Most importantly, you have to set a capacity provider strategy and not a launch type. You can set this on the work pool or a deployment itself. This fact is not documented directly, and instead is a logical consequence of the fact that the AWS RunTask API will, for whatever reason, ignore a capacity provider if the launch type is set, see: here and here.

When you use cluster auto scaling, you must specify capacityProviderStrategy and not launchType.

However, it is unclear from the relevant prefect documentation that you actually cannot specify a launch type, and will otherwise get errors in submitting the flow to the infrastructure.

The final fact ^ seems to have cause a lot of confusion:

Describe the proposed change

We would recommend that:

There should be a dedicated page for best practices with GPU-based jobs on AWS.
The fact that EC2-based jobs need a hybrid work pool should be made more explicit.
The fact that you need a capacity provider strategy and not a launch type should be made very clear in the relevant pages on work pools and the AWS integration.
The sample terraform templates should be updated to include an end-to-end setup for GPU-based jobs.

Additional context

No response

zzstoatzz · 2024-06-12T18:32:59Z

hey @DbCrWk

We're happy to contribute documentation

any updates to the existing guides + a specialized guide with more of your exact situation would be super appreciated!

let us know if you need any help with the contribution process or have any questions!

DbCrWk · 2024-06-12T18:35:42Z

How should I provide provide an update? Do you want documentation + a terraform template?
@zzstoatzz

discdiver · 2024-06-12T22:24:59Z

Thank you @DbCrWk! Re: the docs, we're updating our contributing section, so the README here is probably most useful at the moment.

DbCrWk · 2024-06-20T02:37:15Z

@discdiver cool, I'll work on it over the next week or so!

JamiePlace · 2024-11-28T17:06:27Z

@DbCrWk did you manage to get this written up somewhere? It's exceedingly tricky to find information about this. I would have thought running GPU workloads like this would be more prevalent!

DbCrWk · 2024-12-02T02:48:01Z

@DbCrWk did you manage to get this written up somewhere? It's exceedingly tricky to find information about this. I would have thought running GPU workloads like this would be more prevalent!

@JamiePlace I wrote this up internally, but didn't close the loop and post it here. I need to remove the stuff that's specific to us, but I'll have bandwidth to post it this week! It's a bit tricky because we have to:

Set up an ASG for GPU instances
Use a custom ECS task definition

I also would've assumed more GPU workloads, too, though I'm not sure how many people launch GPUs via ECS. Maybe more common to do EC2 or custom workstations/clusters/machines?

DbCrWk added docs needs:triage labels Jun 12, 2024

zzstoatzz added great writeup This is a wonderful example of our standards integrations Related to integrations with other services and removed needs:triage labels Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify documentation for GPU-based AWS jobs #13969

Clarify documentation for GPU-based AWS jobs #13969

DbCrWk commented Jun 12, 2024

zzstoatzz commented Jun 12, 2024

DbCrWk commented Jun 12, 2024 •

edited

Loading

discdiver commented Jun 12, 2024

DbCrWk commented Jun 20, 2024

JamiePlace commented Nov 28, 2024

DbCrWk commented Dec 2, 2024

Clarify documentation for GPU-based AWS jobs #13969

Clarify documentation for GPU-based AWS jobs #13969

Comments

DbCrWk commented Jun 12, 2024

First check

Describe the issue

Describe the proposed change

Additional context

zzstoatzz commented Jun 12, 2024

DbCrWk commented Jun 12, 2024 • edited Loading

discdiver commented Jun 12, 2024

DbCrWk commented Jun 20, 2024

JamiePlace commented Nov 28, 2024

DbCrWk commented Dec 2, 2024

DbCrWk commented Jun 12, 2024 •

edited

Loading