Are GPU workloads supported? #384

hamann · 2017-09-08T09:08:45Z

Is CRIU supposed to work with GPU workloads? We're playing with tensorflow and tried to checkpoint a task which uses tensorflow, and this is what we get then

600000-0xffffffffff601000 (4K) prot 0x5 flags 0x22 fdflags 0 st 0x204 off 0 vsys ap  shmid: 0
(00.155972) Obtaining task auvx ...
(00.156110) Dumping path for -3 fd via self 14 [/]
(00.156132) Dumping task cwd id 0x7 root id 0x7
(00.156192) ========================================
(00.156194) Dumping task (pid: 5587)
(00.156196) ========================================
(00.156197) Obtaining task stat ... 
(00.156223) 
(00.156225) Collecting mappings (pid: 5587)
(00.156226) ----------------------------------------
(00.156598) Dumping path for -3 fd via self 11 [/opt/conda/bin/python2.7]
(00.156614) vma 600000 borrows vfi from previous 400000
(00.156654) Error (criu/proc_parse.c:553): Can't handle non-regular mapping on 5587's map 200000000
(00.156673) Error (criu/cr-dump.c:1221): Collect mappings (pid: 5587) failed with -1
(00.156730) Unlock network
(00.156744) Running network-unlock scripts
(00.156746) \tRPC
(00.174386) Unfreezing tasks into 1
(00.174519) Error (criu/cr-dump.c:1644): Dumping FAILED.

The text was updated successfully, but these errors were encountered:

xemul · 2017-09-08T13:33:45Z

@hamann , GPU workloads are always about opening and mapping a GPU device file. Since after this part of the running state may be in the device, it's needed to be checkpointed and restore too. The latter task looks to be very driver-specific and cannot be solved in a generic manner. That's why we've put an explicit check for "unknown" device being mapped, and this is what CRIU complains about in this case.

If you can help us and explain what the GPU-state looks like and how to deal with it, the problem can be resolved.

mk · 2017-10-24T09:23:57Z

Hi @xemul and sorry for only getting back to you on this now! We are using CRIU to do time-travel and session persistence on arbitrary language processes and (e.g. python, Julia and R). Users can install arbitrary libraries and compute on GPUs with e.g. Tensorflow, Keras, etc. At the moment we only use the Nvidia K80 GPU. Do I understand you correctly that CRIU could work for these kinds of workloads if specific support for this device was added? Do you have any pointers on how we could attempt this? Thanks a lot for all your work on CRIU btw, it's amazing to me what's possible with it.

adrianreber · 2017-10-24T12:49:24Z

The complicated part is getting the state of the device, in your case the nvidia GPU, extracted. As far as I know this is not possible as nvidia offers no interface to get the state of the device. You are now at the point where every checkpoint/restore solution fails if you do not have help from the hardware vendor.

If you can get the state out of the device then CRIU should be able to be extended to handle it, but I am pretty sure without support from the hardware vendor this is impossible.

mk · 2017-10-25T09:34:41Z

@adrianreber alright, thanks a lot for the clarification. Feel free to close this issue.

xemul added the new feature label Sep 8, 2017

avagin closed this as completed Oct 25, 2017

phil294 mentioned this issue Jan 10, 2018

criu with VirtualGL #436

Closed

rppt mentioned this issue Jun 21, 2020

"Can't dump file 12 of that type" upon dumping a simple test.sh #1112

Closed

0x2b3bfa0 mentioned this issue Aug 9, 2021

Machine Transparent Spot Instances iterative/terraform-provider-iterative#176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are GPU workloads supported? #384

Are GPU workloads supported? #384

hamann commented Sep 8, 2017

xemul commented Sep 8, 2017

mk commented Oct 24, 2017

adrianreber commented Oct 24, 2017

mk commented Oct 25, 2017

Are GPU workloads supported? #384

Are GPU workloads supported? #384

Comments

hamann commented Sep 8, 2017

xemul commented Sep 8, 2017

mk commented Oct 24, 2017

adrianreber commented Oct 24, 2017

mk commented Oct 25, 2017