-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRIU Cuda support #534
Comments
No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this. With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible. |
isn't cuda-gdb supposed to be able to access all info in the GPU? |
also, there's this https://www.nvidia.com/en-us/design-visualization/solutions/vgpu-migration/ |
Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on? |
I might need to start working on in in near future, yes. |
@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible? |
This is a complex project but there is some relevant work that has been done in the past: |
Hi, just wanted to ask if there have been any updates on this? |
No updates. We are still looking for volunteers who will implement this. |
In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU. |
@AHEADer thank you for sharing this. |
Indeed, thanks for pointing out that paper. I just had a look and they write that they do not checkpoint the GPU part only the CPU part. Seems to be an unmodified CRIU without any GPU support. |
So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status? |
You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process. As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware. |
Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
Do you have any updates? |
JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM. BTW, I don't think this issue should be closed. |
You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now. |
The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: |
Are there any open-source resources on this project or do we have to implement it from scratch ourselves |
I am not aware of open-source resources on this project. However, there are a few patents related to this work. |
|
It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU. |
Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks |
|
Thank you for pointing to the link. I did read that Microsoft paper. |
Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did. |
Got it. Thank you |
Hello, in a paper from NVIDIA [GPU snapshot: checkpoint offloading for GPU-dense systems] (https://dl.acm.org/doi/pdf/10.1145/3330345.3330361), I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware. I'm not a expert on this but I'm wondering whether this work can be integrated into CRIU :) |
Thanks for that information about the paper. As it works with AMD GPUs I am pretty confident that it can also work with Nvidia GPUs. I personally think it needs to be implemented by Nvidia. I am not aware of anybody around CRIU having enough expertise, so Nvidia has to step up. |
This paper proposes hardware and driver components for checkpointing, but these are not implemented in real hardware, only simulated. |
From @aidan-gibson as mentioned in #2369:
and my answer:
I am also adding this video from Microsoft to the list of existing work with nvidia GPUs: That seems to mention CRIU and GPUs. |
any update? @edenbuaa |
NVIDIA have released CUDA checkpoint and restore utility that can be combined with CRIU: |
Any plans for supporting C/R for cuda applications?
The text was updated successfully, but these errors were encountered: