-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are GPU workloads supported? #384
Comments
@hamann , GPU workloads are always about opening and mapping a GPU device file. Since after this part of the running state may be in the device, it's needed to be checkpointed and restore too. The latter task looks to be very driver-specific and cannot be solved in a generic manner. That's why we've put an explicit check for "unknown" device being mapped, and this is what CRIU complains about in this case. If you can help us and explain what the GPU-state looks like and how to deal with it, the problem can be resolved. |
Hi @xemul and sorry for only getting back to you on this now! We are using CRIU to do time-travel and session persistence on arbitrary language processes and (e.g. python, Julia and R). Users can install arbitrary libraries and compute on GPUs with e.g. Tensorflow, Keras, etc. At the moment we only use the Nvidia K80 GPU. Do I understand you correctly that CRIU could work for these kinds of workloads if specific support for this device was added? Do you have any pointers on how we could attempt this? Thanks a lot for all your work on CRIU btw, it's amazing to me what's possible with it. |
The complicated part is getting the state of the device, in your case the nvidia GPU, extracted. As far as I know this is not possible as nvidia offers no interface to get the state of the device. You are now at the point where every checkpoint/restore solution fails if you do not have help from the hardware vendor. If you can get the state out of the device then CRIU should be able to be extended to handle it, but I am pretty sure without support from the hardware vendor this is impossible. |
@adrianreber alright, thanks a lot for the clarification. Feel free to close this issue. |
Is CRIU supposed to work with GPU workloads? We're playing with tensorflow and tried to checkpoint a task which uses
tensorflow
, and this is what we get thenThe text was updated successfully, but these errors were encountered: