Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize on container images instead of machine images #146

Open
0x2b3bfa0 opened this issue Jun 17, 2021 · 6 comments
Open

Standardize on container images instead of machine images #146

0x2b3bfa0 opened this issue Jun 17, 2021 · 6 comments
Labels
gpu Inexplicably convoluted drivers machine-image resource-task iterative_task TF resource

Comments

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jun 17, 2021

Follow-up of #127 (comment)

It would be nice to offer a single, consistent environment on every platform, and we can ship default container images as part of the machine images to avoid pull delays and costs.

This proposal assumes that:

  • The user–provided code is intended to (or at least can) run on Linux.
  • Users who have on–premises GPU farms are able to install Docker.

I'm inclined to think that those assumptions are pretty reasonable, and a good compromise between impact and effort on our side.

@0x2b3bfa0
Copy link
Member Author

If future versions of CRIU support loading/restoring the internal state of CUDA devices, standardizing on containers could have the additional advantage of allowing us to perform live migrations between spot instances. The advantages versus data-based checkpoints aren't especially obvious, but it looks like the next cool technology. 😄 See also #176 (comment)

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Sep 30, 2021

Blockers for containerized cml runner

From all the continuous integration systems we support,1 GitHub Actions is the only that doesn't play nicely with containerized self-hosted runners:

Footnotes

  1. Namely, GitHub Actions, GitLab CI/CD and Bitbucket Pipelines.

@0x2b3bfa0 0x2b3bfa0 added the resource-task iterative_task TF resource label Nov 24, 2021
@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Nov 24, 2021

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience.

Custom images are the only alternative to provisioning instances on the fly, but forcing users to run tasks in a fixed environment could be unwise. Especially when it implies committing to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

Moved from the experimental XPD library.

@casperdcl
Copy link
Contributor

do you mean allow resource "iterative_task" { image = "docker://..." }?

@0x2b3bfa0
Copy link
Member Author

This issue predates the iterative_task resource, but yes.

@0x2b3bfa0
Copy link
Member Author

0x2b3bfa0 commented Apr 21, 2022

allow resource "iterative_task" { image = "docker://..." }

🪓

terraform {
  required_providers {
    iterative = { source = "iterative/iterative" }
  }
}

provider "iterative" {}

resource "iterative_task" "example" {
  cloud   = "aws"
  image   = "nvidia"
  machine = "g4dn.xlarge"

  script = <<-END
    #!/usr/bin/env -S sh -c 'docker run --rm -iv "$(realpath "$0"):/file" alpine sh /file'
    cat /etc/alpine-release
  END
}

@casperdcl casperdcl added the gpu Inexplicably convoluted drivers label Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu Inexplicably convoluted drivers machine-image resource-task iterative_task TF resource
Projects
None yet
Development

No branches or pull requests

2 participants