CRIU Cuda support #534

montekki · 2018-07-31T15:59:31Z

Any plans for supporting C/R for cuda applications?

adrianreber · 2018-07-31T16:04:05Z

No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this.

With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible.

montekki · 2018-07-31T16:23:12Z

isn't cuda-gdb supposed to be able to access all info in the GPU?

montekki · 2018-07-31T16:27:56Z

also, there's this https://www.nvidia.com/en-us/design-visualization/solutions/vgpu-migration/

adrianreber · 2018-07-31T21:23:21Z

Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on?

montekki · 2018-08-01T09:08:24Z

I might need to start working on in in near future, yes.

pavanagrawal123 · 2019-01-02T23:07:46Z

@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible?

rst0git · 2019-01-03T00:22:23Z

This is a complex project but there is some relevant work that has been done in the past:
(2018) CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory
(2013) A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
(2009) CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

https://github.com/tbrand/CRCUDA

Muks14x · 2020-08-07T09:10:04Z

Hi, just wanted to ask if there have been any updates on this?

avagin · 2020-08-26T17:24:06Z

No updates. We are still looking for volunteers who will implement this.

AHEADer · 2020-10-17T07:46:04Z

In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU.

rst0git · 2020-10-17T08:27:07Z

@AHEADer thank you for sharing this.
(2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

adrianreber · 2020-10-17T08:44:19Z

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

AHEADer · 2020-10-17T09:52:33Z

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

adrianreber · 2020-10-17T10:50:22Z

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

AHEADer · 2020-10-17T13:52:05Z

Indeed, thanks for pointing out that paper.
I just had a look and they write that they do not checkpoint the GPU part only the CPU part.
Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways.

github-actions · 2021-01-15T00:34:32Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-02-15T00:15:12Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-04-02T00:13:08Z

A friendly reminder that this issue had no activity for 30 days.

LuYilei · 2022-04-16T02:56:40Z

Do you have any updates?

andronat · 2023-02-21T13:55:40Z

JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM.

BTW, I don't think this issue should be closed.

adrianreber · 2023-02-21T14:02:10Z

BTW, I don't think this issue should be closed.

You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now.

rst0git · 2023-02-21T14:11:18Z

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above:
Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

jsun-m · 2023-02-21T23:13:51Z

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

rst0git · 2023-02-22T08:41:36Z

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

I am not aware of open-source resources on this project. However, there are a few patents related to this work.

0x2b3bfa0 · 2023-06-02T14:14:06Z

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

adrianreber · 2023-06-02T15:46:34Z

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU.

Sharathmk99 · 2023-06-04T08:30:46Z

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

adrianreber · 2023-06-04T08:36:07Z

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

#534 (comment)

Sharathmk99 · 2023-06-04T08:38:38Z

Thank you for pointing to the link. I did read that Microsoft paper.
Any plans to add support to CRIU itself? Thanks

adrianreber · 2023-06-04T09:03:19Z

Any plans to add support to CRIU itself?

Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did.

Sharathmk99 · 2023-06-04T09:14:33Z

Got it. Thank you

lllukehuang · 2023-10-25T10:33:34Z

Hello, in a paper from NVIDIA [GPU snapshot: checkpoint offloading for GPU-dense systems] (https://dl.acm.org/doi/pdf/10.1145/3330345.3330361), I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware. I'm not a expert on this but I'm wondering whether this work can be integrated into CRIU :)

adrianreber · 2023-10-25T10:45:00Z

Hello, in a paper from NVIDIA [GPU snapshot: checkpoint offloading for GPU-dense systems] (https://dl.acm.org/doi/pdf/10.1145/3330345.3330361), I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware. I'm not a expert on this but I'm wondering whether this work can be integrated into CRIU :)

Thanks for that information about the paper.

As it works with AMD GPUs I am pretty confident that it can also work with Nvidia GPUs. I personally think it needs to be implemented by Nvidia. I am not aware of anybody around CRIU having enough expertise, so Nvidia has to step up.

rst0git · 2023-10-25T12:08:36Z

I noticed that NVIDIA GPUs appear to implement GPU snapshot preservation through hardware.

This paper proposes hardware and driver components for checkpointing, but these are not implemented in real hardware, only simulated.

adrianreber · 2024-03-24T09:11:15Z

From @aidan-gibson as mentioned in #2369:

https://arxiv.org/pdf/2103.04916.pdf

They extended CRIU to OpenGL and suspended / restored Autodesk Maya. No code published tho ):

and my answer:

Thanks. The following also seems to be from memverge. Unfortunately memverge does not really talk to upstream CRIU about their work. This sounds slightly related to:

https://www.nvidia.com/gtc/posters/#/session/1705106137731001cNAN

DCCDCCS_1_P63184_Bernie Wu_web_1710777201737001ioQN.pdf

https://youtu.be/p-zNA1AN37M

I am also adding this video from Microsoft to the list of existing work with nvidia GPUs:

https://youtu.be/Rk3nTUfRZmo?t=585

That seems to mention CRIU and GPUs.

edenbuaa · 2024-04-19T08:33:34Z

any update? @edenbuaa

rst0git · 2024-04-25T02:34:01Z

NVIDIA have released CUDA checkpoint and restore utility that can be combined with CRIU:
https://github.com/NVIDIA/cuda-checkpoint

avagin added the new feature label Aug 1, 2018

cbachhuber mentioned this issue Nov 20, 2018

NVIDIA_VISIBLE_DEVICES makes checkpointing impossible NVIDIA/nvidia-container-runtime#44

Closed

github-actions bot added the stale-issue label Jan 15, 2021

adrianreber added no-auto-close Don't auto-close as a stale issue and removed stale-issue labels Jan 15, 2021

github-actions bot added the stale-issue label Feb 15, 2021

avagin removed the stale-issue label Mar 2, 2021

github-actions bot added the stale-issue label Apr 2, 2021

0x2b3bfa0 mentioned this issue Aug 9, 2021

Machine Transparent Spot Instances iterative/terraform-provider-iterative#176

Closed

Srinidhi2301 mentioned this issue Aug 12, 2021

CRIU installation fails on ubuntu 18.04 #1569

Closed

github-actions bot closed this as completed Apr 2, 2022

adrianreber reopened this Feb 21, 2023

rst0git removed the stale-issue label Feb 21, 2023

adrianreber mentioned this issue Mar 24, 2024

Y'all seen this?? #2369

Closed

adrianreber mentioned this issue Apr 10, 2024

[question about criu] Will it work on steam deck? #2385

Closed

rst0git mentioned this issue Apr 19, 2024

How disable plugin for nvidia gpu #2394

Closed

rst0git closed this as completed Apr 25, 2024

CRIU Cuda support #534

CRIU Cuda support #534

Comments

montekki commented Jul 31, 2018

adrianreber commented Jul 31, 2018

montekki commented Jul 31, 2018

montekki commented Jul 31, 2018

adrianreber commented Jul 31, 2018

montekki commented Aug 1, 2018

pavanagrawal123 commented Jan 2, 2019

rst0git commented Jan 3, 2019 • edited Loading

Muks14x commented Aug 7, 2020

avagin commented Aug 26, 2020

AHEADer commented Oct 17, 2020

rst0git commented Oct 17, 2020

adrianreber commented Oct 17, 2020

AHEADer commented Oct 17, 2020

adrianreber commented Oct 17, 2020

AHEADer commented Oct 17, 2020

github-actions bot commented Jan 15, 2021

github-actions bot commented Feb 15, 2021

github-actions bot commented Apr 2, 2021

LuYilei commented Apr 16, 2022

andronat commented Feb 21, 2023

adrianreber commented Feb 21, 2023

rst0git commented Feb 21, 2023

jsun-m commented Feb 21, 2023

rst0git commented Feb 22, 2023

0x2b3bfa0 commented Jun 2, 2023

adrianreber commented Jun 2, 2023

Sharathmk99 commented Jun 4, 2023

adrianreber commented Jun 4, 2023

Sharathmk99 commented Jun 4, 2023

adrianreber commented Jun 4, 2023

Sharathmk99 commented Jun 4, 2023

lllukehuang commented Oct 25, 2023

adrianreber commented Oct 25, 2023

rst0git commented Oct 25, 2023

adrianreber commented Mar 24, 2024

edenbuaa commented Apr 19, 2024

rst0git commented Apr 25, 2024

rst0git commented Jan 3, 2019 •

edited

Loading