Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU Cuda support #534

Closed
montekki opened this issue Jul 31, 2018 · 37 comments
Closed

CRIU Cuda support #534

montekki opened this issue Jul 31, 2018 · 37 comments
Labels
new feature no-auto-close Don't auto-close as a stale issue

Comments

@montekki
Copy link

Any plans for supporting C/R for cuda applications?

@adrianreber
Copy link
Member

No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this.

With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible.

@montekki
Copy link
Author

isn't cuda-gdb supposed to be able to access all info in the GPU?

@montekki
Copy link
Author

@adrianreber
Copy link
Member

Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on?

@montekki
Copy link
Author

montekki commented Aug 1, 2018

I might need to start working on in in near future, yes.

@pavanagrawal123
Copy link

@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible?

@rst0git
Copy link
Member

rst0git commented Jan 3, 2019

This is a complex project but there is some relevant work that has been done in the past:
(2018) CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory
(2013) A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
(2009) CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

https://github.com/tbrand/CRCUDA

@Muks14x
Copy link

Muks14x commented Aug 7, 2020

Hi, just wanted to ask if there have been any updates on this?

@avagin
Copy link
Member

avagin commented Aug 26, 2020

No updates. We are still looking for volunteers who will implement this.

@AHEADer
Copy link

AHEADer commented Oct 17, 2020

In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU.

@rst0git
Copy link
Member

rst0git commented Oct 17, 2020

@adrianreber
Copy link
Member

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

@AHEADer
Copy link

AHEADer commented Oct 17, 2020

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

@adrianreber
Copy link
Member

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

@AHEADer
Copy link

AHEADer commented Oct 17, 2020

Indeed, thanks for pointing out that paper.
I just had a look and they write that they do not checkpoint the GPU part only the CPU part.
Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@adrianreber adrianreber added no-auto-close Don't auto-close as a stale issue and removed stale-issue labels Jan 15, 2021
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Apr 2, 2021

A friendly reminder that this issue had no activity for 30 days.

@LuYilei
Copy link

LuYilei commented Apr 16, 2022

Do you have any updates?

@andronat
Copy link

JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM.

BTW, I don't think this issue should be closed.

@adrianreber adrianreber reopened this Feb 21, 2023
@adrianreber
Copy link
Member

BTW, I don't think this issue should be closed.

You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now.

@rst0git
Copy link
Member

rst0git commented Feb 21, 2023

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above:
Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

@jsun-m
Copy link

jsun-m commented Feb 21, 2023

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

@rst0git
Copy link
Member

rst0git commented Feb 22, 2023

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

I am not aware of open-source resources on this project. However, there are a few patents related to this work.

@0x2b3bfa0
Copy link

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

@adrianreber
Copy link
Member

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU.

@Sharathmk99
Copy link

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

@adrianreber
Copy link
Member

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

#534 (comment)

@Sharathmk99
Copy link

Thank you for pointing to the link. I did read that Microsoft paper.
Any plans to add support to CRIU itself? Thanks

@adrianreber
Copy link
Member

Any plans to add support to CRIU itself?

Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did.

@Sharathmk99
Copy link

Got it. Thank you

@lllukehuang
Copy link

Hello, in a paper from NVIDIA [GPU snapshot: checkpoint offloading for GPU-dense systems] (https://dl.acm.org/doi/pdf/10.1145/3330345.3330361), I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware. I'm not a expert on this but I'm wondering whether this work can be integrated into CRIU :)

@adrianreber
Copy link
Member

Hello, in a paper from NVIDIA [GPU snapshot: checkpoint offloading for GPU-dense systems] (https://dl.acm.org/doi/pdf/10.1145/3330345.3330361), I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware. I'm not a expert on this but I'm wondering whether this work can be integrated into CRIU :)

Thanks for that information about the paper.

As it works with AMD GPUs I am pretty confident that it can also work with Nvidia GPUs. I personally think it needs to be implemented by Nvidia. I am not aware of anybody around CRIU having enough expertise, so Nvidia has to step up.

@rst0git
Copy link
Member

rst0git commented Oct 25, 2023

I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware.

This paper proposes hardware and driver components for checkpointing, but these are not implemented in real hardware, only simulated.

@adrianreber
Copy link
Member

From @aidan-gibson as mentioned in #2369:

https://arxiv.org/pdf/2103.04916.pdf

They extended CRIU to OpenGL and suspended / restored Autodesk Maya. No code published tho ):

and my answer:

Thanks. The following also seems to be from memverge. Unfortunately memverge does not really talk to upstream CRIU about their work. This sounds slightly related to:

I am also adding this video from Microsoft to the list of existing work with nvidia GPUs:

That seems to mention CRIU and GPUs.

@edenbuaa
Copy link

any update? @edenbuaa

@rst0git
Copy link
Member

rst0git commented Apr 25, 2024

NVIDIA have released CUDA checkpoint and restore utility that can be combined with CRIU:
https://github.com/NVIDIA/cuda-checkpoint

@rst0git rst0git closed this as completed Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

No branches or pull requests