Standardize on container images instead of machine images #146

0x2b3bfa0 · 2021-06-17T18:24:39Z

Follow-up of #127 (comment)

It would be nice to offer a single, consistent environment on every platform, and we can ship default container images as part of the machine images to avoid pull delays and costs.

This proposal assumes that:

The user–provided code is intended to (or at least can) run on Linux.
Users who have on–premises GPU farms are able to install Docker.

I'm inclined to think that those assumptions are pretty reasonable, and a good compromise between impact and effort on our side.

0x2b3bfa0 · 2021-08-16T19:57:36Z

If future versions of CRIU support loading/restoring the internal state of CUDA devices, standardizing on containers could have the additional advantage of allowing us to perform live migrations between spot instances. The advantages versus data-based checkpoints aren't especially obvious, but it looks like the next cool technology. 😄 See also #176 (comment)

0x2b3bfa0 · 2021-09-30T13:12:00Z

Blockers for containerized `cml runner`

From all the continuous integration systems we support,¹ GitHub Actions is the only that doesn't play nicely with containerized self-hosted runners:

Namely, GitHub Actions, GitLab CI/CD and Bitbucket Pipelines. ↩

0x2b3bfa0 · 2021-11-24T16:02:05Z

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience.

Custom images are the only alternative to provisioning instances on the fly, but forcing users to run tasks in a fixed environment could be unwise. Especially when it implies committing to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

Implement GPU passthrough functionality to provide hw-acceleration to inner-containers nestybox/sysbox#50

https://github.com/kata-containers/documentation/blob/master/use-cases/Nvidia-GPU-passthrough-and-Kata.md

[Devices] Offer support for hardware-accelerated inference in Firecracker firecracker-microvm/firecracker#1179

https://blog.cloudkernels.net/posts/vaccel_v2/

support for nvidia-docker GPU container sandboxing google/gvisor#14

Moved from the experimental XPD library.

casperdcl · 2022-04-21T04:23:26Z

do you mean allow resource "iterative_task" { image = "docker://..." }?

0x2b3bfa0 · 2022-04-21T04:30:44Z

This issue predates the iterative_task resource, but yes.

0x2b3bfa0 · 2022-04-21T06:52:30Z

allow resource "iterative_task" { image = "docker://..." }

🪓

terraform {
  required_providers {
    iterative = { source = "iterative/iterative" }
  }
}

provider "iterative" {}

resource "iterative_task" "example" {
  cloud   = "aws"
  image   = "nvidia"
  machine = "g4dn.xlarge"

  script = <<-END
    #!/usr/bin/env -S sh -c 'docker run --rm -iv "$(realpath "$0"):/file" alpine sh /file'
    cat /etc/alpine-release
  END
}

0x2b3bfa0 added the machine-image label Jun 17, 2021

0x2b3bfa0 mentioned this issue Aug 5, 2021

Upgrade CUDA on machine images to 11+ #174

Closed

0x2b3bfa0 mentioned this issue Aug 25, 2021

Readiness Mechanism #175

Closed

0x2b3bfa0 mentioned this issue Sep 17, 2021

📚 Epic: TPI basic scenario #208

Closed

0x2b3bfa0 added the resource-task iterative_task TF resource label Nov 24, 2021

0x2b3bfa0 mentioned this issue May 17, 2022

runner tempdir patch #582

Closed

casperdcl added the gpu Inexplicably convoluted drivers label Aug 3, 2022

0x2b3bfa0 mentioned this issue Dec 14, 2022

CML and Kubernetes iterative/cml#1285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize on container images instead of machine images #146

Standardize on container images instead of machine images #146

0x2b3bfa0 commented Jun 17, 2021 •

edited

Loading

0x2b3bfa0 commented Aug 16, 2021

0x2b3bfa0 commented Sep 30, 2021 •

edited

Loading

0x2b3bfa0 commented Nov 24, 2021 •

edited

Loading

casperdcl commented Apr 21, 2022

0x2b3bfa0 commented Apr 21, 2022

0x2b3bfa0 commented Apr 21, 2022 •

edited

Loading

Standardize on container images instead of machine images #146

Standardize on container images instead of machine images #146

Comments

0x2b3bfa0 commented Jun 17, 2021 • edited Loading

Follow-up of #127 (comment)

0x2b3bfa0 commented Aug 16, 2021

0x2b3bfa0 commented Sep 30, 2021 • edited Loading

Blockers for containerized cml runner

Footnotes

0x2b3bfa0 commented Nov 24, 2021 • edited Loading

casperdcl commented Apr 21, 2022

0x2b3bfa0 commented Apr 21, 2022

0x2b3bfa0 commented Apr 21, 2022 • edited Loading

0x2b3bfa0 commented Jun 17, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 30, 2021 •

edited

Loading

Blockers for containerized `cml runner`

0x2b3bfa0 commented Nov 24, 2021 •

edited

Loading

0x2b3bfa0 commented Apr 21, 2022 •

edited

Loading