Figure out how to deal with device loss #1624

pythonesque · 2021-07-10T04:27:56Z

A prerequisite for safely using a lot of APIs is that you (ideally) need to handle device loss somewhat gracefully. However you obviously can't make sure a GPU doesn't crash between when you check it and when you submit your command (without high precision timers or synchronous communication there is a theorem about this!), so presumably what's needed is to be able to recover gracefully where possible, and abort if not. Apparently wgpu is not really doing this yet.

From @kvark

I think it's a combination of things. Like at least:

if we previously got an error code for device lost, we are no longer trying to use it

we set up the context parameters in a way that handles device loss gracefully, i.e. OpenGL robust context
But I don't have a fully clear picture about this yet.

So let's get a clearer picture (and figure out what the challenges are on different platforms).

Craig-Macomber · 2021-07-19T19:23:53Z

I've done some work supporting DeviceLoss on Windows with DirectX. Here are a couple notes about my experiences with that:

There were a couple cases of Device Loss:
https://docs.microsoft.com/en-us/windows/win32/api/d3d11/nf-d3d11-id3d11device-getdeviceremovedreason

In my experience actual physical device loss/removal was rare (ex: SurfaceBook GPU hot plug when running on remote GPU and opted in to support it), but other things cause that error value as well (some cases of crashes, driver updates, etc).
In practice, device removal mostly happened when our app hit driver bugs (ex: some drivers crashed if you used too much address space, some crashed with specific shaders etc).

We found treating the different device removed reasons differently wasn't a good idea since they were often inaccurate, and the desired response to all of them was the same, but recording the reason was important for debugging (anything you can get for debugging non-deterministic driver crashes your app triggers is high value).

Overall approach for device loss in directX was pretty nice: any API which effectively reads back from the GPU can fail with an error, which might be device loss. The produced error should be marked to indicate if recreating the device will recover from it (ex: its a device lost error).

I found having a way to enable random device loss great for testing (ex: set some random percentage of API calls to trigger it) in the hardware abstraction layer. It was also important to be able to trigger it randomly at the actual driver level: DXCap.exe -forcetdr did it for DirectX.

One pattern I implemented was disposing of a device which was in a failed state was an error unless you acknowledged/cleared the error: that made it hard to accidentally have a test ignore an error. In wgpu it might make sense to put this behind a feature flag since panicking in drop (the robust way to do that) is pretty nasty, but nice for testing.

If possible, I'd love for rust-lifetimes to help app authors ensure they recreate all device resources: missing one was a common source of bugs.

Also, watch for any place where the device interacts with other stuff, mainly swap-chain: Its not always clear if the swap-chain needs to be recreated or not. We should make it very clear at the wgpu layer if we can (ideally via lifetimes)

Also I'm not clear on if the wgpu::Adapter should be lost, or if it should just fail to create devices. Either way it would just produce a device lost error when used to create a device: the difference would be if you need to recreate the adapter if the same device is available again.

pythonesque mentioned this issue Jul 10, 2021

start-capture (and probably stop-capture) are unsound #1625

Open

cwfitzgerald added area: validation Issues related to validation, diagnostics, and error handling help required We need community help to make this happen. type: enhancement New feature or request labels Jul 10, 2021

teoxoy added this to the WebGPU Specification V1 milestone Feb 24, 2023

teoxoy mentioned this issue May 25, 2023

Issues running BERT on Windows webonnx/wonnx#166

Open

teoxoy added this to WebGPU for Firefox Dec 15, 2023

jimblandy self-assigned this May 20, 2024

jimblandy assigned teoxoy and unassigned jimblandy Jul 26, 2024

teoxoy moved this to Todo in WebGPU for Firefox Jul 27, 2024

teoxoy removed the status in WebGPU for Firefox Jul 27, 2024

teoxoy moved this to Todo in WebGPU for Firefox Jul 27, 2024

teoxoy mentioned this issue Sep 6, 2024

Invalidate the device when we encounter driver-induced device loss or on unexpected errors #6229

Merged

teoxoy closed this as completed in #6229 Sep 9, 2024

github-project-automation bot moved this from Todo to Done in WebGPU for Firefox Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to deal with device loss #1624

Figure out how to deal with device loss #1624

pythonesque commented Jul 10, 2021

Craig-Macomber commented Jul 19, 2021

Figure out how to deal with device loss #1624

Figure out how to deal with device loss #1624

Comments

pythonesque commented Jul 10, 2021

Craig-Macomber commented Jul 19, 2021