Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are GPU workloads supported? #384

Closed
hamann opened this issue Sep 8, 2017 · 4 comments
Closed

Are GPU workloads supported? #384

hamann opened this issue Sep 8, 2017 · 4 comments

Comments

@hamann
Copy link

hamann commented Sep 8, 2017

Is CRIU supposed to work with GPU workloads? We're playing with tensorflow and tried to checkpoint a task which uses tensorflow, and this is what we get then

600000-0xffffffffff601000 (4K) prot 0x5 flags 0x22 fdflags 0 st 0x204 off 0 vsys ap  shmid: 0
(00.155972) Obtaining task auvx ...
(00.156110) Dumping path for -3 fd via self 14 [/]
(00.156132) Dumping task cwd id 0x7 root id 0x7
(00.156192) ========================================
(00.156194) Dumping task (pid: 5587)
(00.156196) ========================================
(00.156197) Obtaining task stat ... 
(00.156223) 
(00.156225) Collecting mappings (pid: 5587)
(00.156226) ----------------------------------------
(00.156598) Dumping path for -3 fd via self 11 [/opt/conda/bin/python2.7]
(00.156614) vma 600000 borrows vfi from previous 400000
(00.156654) Error (criu/proc_parse.c:553): Can't handle non-regular mapping on 5587's map 200000000
(00.156673) Error (criu/cr-dump.c:1221): Collect mappings (pid: 5587) failed with -1
(00.156730) Unlock network
(00.156744) Running network-unlock scripts
(00.156746) \tRPC
(00.174386) Unfreezing tasks into 1
(00.174519) Error (criu/cr-dump.c:1644): Dumping FAILED.
@xemul
Copy link
Member

xemul commented Sep 8, 2017

@hamann , GPU workloads are always about opening and mapping a GPU device file. Since after this part of the running state may be in the device, it's needed to be checkpointed and restore too. The latter task looks to be very driver-specific and cannot be solved in a generic manner. That's why we've put an explicit check for "unknown" device being mapped, and this is what CRIU complains about in this case.

If you can help us and explain what the GPU-state looks like and how to deal with it, the problem can be resolved.

@mk
Copy link

mk commented Oct 24, 2017

Hi @xemul and sorry for only getting back to you on this now! We are using CRIU to do time-travel and session persistence on arbitrary language processes and (e.g. python, Julia and R). Users can install arbitrary libraries and compute on GPUs with e.g. Tensorflow, Keras, etc. At the moment we only use the Nvidia K80 GPU. Do I understand you correctly that CRIU could work for these kinds of workloads if specific support for this device was added? Do you have any pointers on how we could attempt this? Thanks a lot for all your work on CRIU btw, it's amazing to me what's possible with it.

@adrianreber
Copy link
Member

The complicated part is getting the state of the device, in your case the nvidia GPU, extracted. As far as I know this is not possible as nvidia offers no interface to get the state of the device. You are now at the point where every checkpoint/restore solution fails if you do not have help from the hardware vendor.

If you can get the state out of the device then CRIU should be able to be extended to handle it, but I am pretty sure without support from the hardware vendor this is impossible.

@mk
Copy link

mk commented Oct 25, 2017

@adrianreber alright, thanks a lot for the clarification. Feel free to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants