Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify documentation for GPU-based AWS jobs #13969

Open
2 tasks done
DbCrWk opened this issue Jun 12, 2024 · 6 comments
Open
2 tasks done

Clarify documentation for GPU-based AWS jobs #13969

DbCrWk opened this issue Jun 12, 2024 · 6 comments
Labels
docs great writeup This is a wonderful example of our standards integrations Related to integrations with other services

Comments

@DbCrWk
Copy link

DbCrWk commented Jun 12, 2024

First check

  • I added a descriptive title to this issue.
  • I used GitHub search to find a similar request and didn't find it 😇

Describe the issue

We use prefect to launch ECS-based jobs. Our CPU-only jobs work great on the push work queues. However, we also have several jobs that need GPUs. It has been very difficult to set this up properly and there is no single source of documentation that covers the end-to-end flow. In particular, our expectation is that we should be able to easily spin up GPU-based jobs on ECS and correctly autoscale the number of EC2 instances, including winding down to 0 instances if there are no jobs.

Here's what we've figured out (please correct us if there's a better way). We're happy to contribute documentation, sample code, and terraform templates for our solution:

  1. You cannot use a serverless push work pool, and instead must use the hybrid AWS ECS pool. This fact is only gently hinted at because of the logical consequence of two things:
    a. AWS Fargate does not support GPU-based machines, see: AWS Fargate GPU Support: When is GPU support coming to fargate? aws/containers-roadmap#88
    b. AWS ECS Push work pools only support Fargate, see: https://docs.prefect.io/latest/concepts/work-pools/

AWS Elastic Container Service - Push: Execute flow runs within containers on AWS ECS. Works with existing ECS clusters and serverless execution via AWS Fargate.

  1. You have to set up the following resources:
    a. An ECS cluster for a prefect worker. We recommend setting up a dedicated ECS cluster for just this prefect worker.
    b. An appropriate autoscaling group (ASG) that spins up very carefully configured EC2 instances. This ASG has to be set up exactly correctly with the right AMIs because of the vagaries of ECS, see here, here, and here. The desired capacity should be set to 0.
    c. An ECS cluster for GPU-based jobs with the previous ASG set up as a capacity provider, also with a very specific configuration.

  2. Most importantly, you have to set a capacity provider strategy and not a launch type. You can set this on the work pool or a deployment itself. This fact is not documented directly, and instead is a logical consequence of the fact that the AWS RunTask API will, for whatever reason, ignore a capacity provider if the launch type is set, see: here and here.

When you use cluster auto scaling, you must specify capacityProviderStrategy and not launchType.

However, it is unclear from the relevant prefect documentation that you actually cannot specify a launch type, and will otherwise get errors in submitting the flow to the infrastructure.

The final fact ^ seems to have cause a lot of confusion:

Describe the proposed change

We would recommend that:

  1. There should be a dedicated page for best practices with GPU-based jobs on AWS.
  2. The fact that EC2-based jobs need a hybrid work pool should be made more explicit.
  3. The fact that you need a capacity provider strategy and not a launch type should be made very clear in the relevant pages on work pools and the AWS integration.
  4. The sample terraform templates should be updated to include an end-to-end setup for GPU-based jobs.

Additional context

No response

@zzstoatzz zzstoatzz added great writeup This is a wonderful example of our standards integrations Related to integrations with other services and removed needs:triage labels Jun 12, 2024
@zzstoatzz
Copy link
Collaborator

hey @DbCrWk

We're happy to contribute documentation

any updates to the existing guides + a specialized guide with more of your exact situation would be super appreciated!

let us know if you need any help with the contribution process or have any questions!

@DbCrWk
Copy link
Author

DbCrWk commented Jun 12, 2024

How should I provide provide an update? Do you want documentation + a terraform template?
@zzstoatzz

@discdiver
Copy link
Contributor

Thank you @DbCrWk! Re: the docs, we're updating our contributing section, so the README here is probably most useful at the moment.

@DbCrWk
Copy link
Author

DbCrWk commented Jun 20, 2024

@discdiver cool, I'll work on it over the next week or so!

@JamiePlace
Copy link

@DbCrWk did you manage to get this written up somewhere? It's exceedingly tricky to find information about this. I would have thought running GPU workloads like this would be more prevalent!

@DbCrWk
Copy link
Author

DbCrWk commented Dec 2, 2024

@DbCrWk did you manage to get this written up somewhere? It's exceedingly tricky to find information about this. I would have thought running GPU workloads like this would be more prevalent!

@JamiePlace I wrote this up internally, but didn't close the loop and post it here. I need to remove the stuff that's specific to us, but I'll have bandwidth to post it this week! It's a bit tricky because we have to:

  1. Set up an ASG for GPU instances
  2. Use a custom ECS task definition

I also would've assumed more GPU workloads, too, though I'm not sure how many people launch GPUs via ECS. Maybe more common to do EC2 or custom workstations/clusters/machines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs great writeup This is a wonderful example of our standards integrations Related to integrations with other services
Projects
None yet
Development

No branches or pull requests

4 participants