-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to deal with device loss #1624
Comments
I've done some work supporting DeviceLoss on Windows with DirectX. Here are a couple notes about my experiences with that: There were a couple cases of Device Loss: In my experience actual physical device loss/removal was rare (ex: SurfaceBook GPU hot plug when running on remote GPU and opted in to support it), but other things cause that error value as well (some cases of crashes, driver updates, etc). We found treating the different device removed reasons differently wasn't a good idea since they were often inaccurate, and the desired response to all of them was the same, but recording the reason was important for debugging (anything you can get for debugging non-deterministic driver crashes your app triggers is high value). Overall approach for device loss in directX was pretty nice: any API which effectively reads back from the GPU can fail with an error, which might be device loss. The produced error should be marked to indicate if recreating the device will recover from it (ex: its a device lost error). I found having a way to enable random device loss great for testing (ex: set some random percentage of API calls to trigger it) in the hardware abstraction layer. It was also important to be able to trigger it randomly at the actual driver level: One pattern I implemented was disposing of a device which was in a failed state was an error unless you acknowledged/cleared the error: that made it hard to accidentally have a test ignore an error. In wgpu it might make sense to put this behind a feature flag since panicking in drop (the robust way to do that) is pretty nasty, but nice for testing. If possible, I'd love for rust-lifetimes to help app authors ensure they recreate all device resources: missing one was a common source of bugs. Also, watch for any place where the device interacts with other stuff, mainly swap-chain: Its not always clear if the swap-chain needs to be recreated or not. We should make it very clear at the wgpu layer if we can (ideally via lifetimes) Also I'm not clear on if the wgpu::Adapter should be lost, or if it should just fail to create devices. Either way it would just produce a device lost error when used to create a device: the difference would be if you need to recreate the adapter if the same device is available again. |
A prerequisite for safely using a lot of APIs is that you (ideally) need to handle device loss somewhat gracefully. However you obviously can't make sure a GPU doesn't crash between when you check it and when you submit your command (without high precision timers or synchronous communication there is a theorem about this!), so presumably what's needed is to be able to recover gracefully where possible, and abort if not. Apparently wgpu is not really doing this yet.
From @kvark
So let's get a clearer picture (and figure out what the challenges are on different platforms).
The text was updated successfully, but these errors were encountered: