Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Transparent Spot Instances #176

Closed
DavidGOrtega opened this issue Aug 9, 2021 · 2 comments
Closed

Machine Transparent Spot Instances #176

DavidGOrtega opened this issue Aug 9, 2021 · 2 comments
Assignees
Labels
cloud-new New cloud support request discussion Waiting for team decision dvc-remote enhancement New feature or request p1-important High priority resource-machine iterative_machine TF resource

Comments

@DavidGOrtega
Copy link
Contributor

Dvc executors will need the machine resource to be able to have this feature.
This is a discussion thread.

Proposal

@DavidGOrtega DavidGOrtega added dvc-remote enhancement New feature or request resource-machine iterative_machine TF resource p1-important High priority cloud-new New cloud support request discussion Waiting for team decision labels Aug 9, 2021
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Aug 9, 2021

We can help with storage, be either through DVC or independently, but can't migrate machines without interrupting the training process and jobs still need to be checkpoint-aware.

Machine-level spot instance transparent migration doesn't seem like an easy feat. While we can use docker checkpoint for containers or vendor-specific solutions for machines, none of these methods support GPU state checkpointing:

See also

@0x2b3bfa0
Copy link
Member

Closed with #237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-new New cloud support request discussion Waiting for team decision dvc-remote enhancement New feature or request p1-important High priority resource-machine iterative_machine TF resource
Projects
None yet
Development

No branches or pull requests

2 participants