You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We finally figured out all the unknowns about Dota performance, and the simplified gist of it is that we need to go through recording faster, much faster than we do. Deferred recording allows us to create this illusion of recording very fast, only to delay the actual work to the submission time. The trick here, as indicated in #2232, is the fact the driver and HW are able to do the work at the same time as we chew through the recording.
I was thinking... if there is a way for us to both get to the submission point earlier AND benefit from online command buffer recording. What if we create deferred recording lists but instead of them to wait for the submission time, we start actually recording them into metal command buffers somewhere on a hidden thread? The submission would then just make sure to wait for that thread to finish working on each submitted command buffer before calling commit on it.
What that would give us? Looks like a hacky solution in place, introducing implicit threading, just to squeeze more performance from Dota. It is indeed, but it would be interesting to see if it gives us a solid advantage ;)
The text was updated successfully, but these errors were encountered:
2260: Remote command sink in Metal r=grovesNL a=kvark
Fixes#2259
The results so far are not super promising - highly unstable (presumably, because of the dispatch), with performance around `Immediate` mark. We are still missing the most important follow-up here - to avoid any heap allocations when recording commands. Currently, it just goes with `Vec::new()` and grows it for each pass, which shows up in the profile quite a bit.
The PR also has a bunch of stuff in general optimizations:
- HAL change in the descriptor allocation API to avoid the heap
- lighten up Metal descriptor binding path (a bit) by making sure there is enough state slots in advance
- simplification and refactoring of `CommandSink` implementations
PR checklist:
- [ ] `make` succeeds (on *nix)
- [ ] `make reftests` succeeds
- [x] tested examples with the following backends: Metal
- [ ] `rustfmt` run on changed code
Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
We finally figured out all the unknowns about Dota performance, and the simplified gist of it is that we need to go through recording faster, much faster than we do. Deferred recording allows us to create this illusion of recording very fast, only to delay the actual work to the submission time. The trick here, as indicated in #2232, is the fact the driver and HW are able to do the work at the same time as we chew through the recording.
I was thinking... if there is a way for us to both get to the submission point earlier AND benefit from online command buffer recording. What if we create deferred recording lists but instead of them to wait for the submission time, we start actually recording them into metal command buffers somewhere on a hidden thread? The submission would then just make sure to wait for that thread to finish working on each submitted command buffer before calling
commit
on it.What that would give us? Looks like a hacky solution in place, introducing implicit threading, just to squeeze more performance from Dota. It is indeed, but it would be interesting to see if it gives us a solid advantage ;)
The text was updated successfully, but these errors were encountered: