Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal enqueue() advantage #2232

Open
kvark opened this issue Jul 13, 2018 · 3 comments
Open

Metal enqueue() advantage #2232

kvark opened this issue Jul 13, 2018 · 3 comments

Comments

@kvark
Copy link
Member

kvark commented Jul 13, 2018

Looking at Metal System Trace, I realized the real value of enqueuing the command buffers earlier than submitting it. It's not documented much, but the behavior change of the driver is drastic.

Theory

When a command buffer is enqueued, once we call endEncoding on a pass, it gets instantly passed down to the driver, since it knows it doesn't need to wait for anything and is expecting the work. Consequently, the GPU starts chewing on the work right away. Thus, the commit() becomes a simple message saying "I'm done with this", leaving the submission queue for other things to use.

Basically, I call this paragraph BS:

The enqueue method does not make the command buffer eligible for execution. To submit the command buffer to the device for execution, a subsequent commit method invocation is required.

Now, let's look at what happens if we don't enqueue anything explicitly:

If the command buffer has not previously been enqueued, it is enqueued implicitly.

Sounds pretty harmless, doesn't it? Well, what really happens is that the driver doesn't want to do anything with our encoded passes until the command buffer gets committed. The passes get stacked on the command buffer internally (much like our software commands) and then dropped like a bomb to the driver upon commit() call.

Practice

Let's get more concrete. Suppose gfx-portability spends X amount of time recording an application's (one-time) command buffer. The driver does some work, but it's able to propagate it to GPU gradually, and doesn't take longer than the GPU itself, so we should only take into account it's latency L. And the GPU takes G amount of time to finish the work on those commands. Let's see how the work flows:

  1. The user records a command buffer, spends X time
  2. The command buffer gets submitted, flows to the driver, which starts propagating the work on GPU after L time.
  3. The GPU takes G time to execute the work.

Total time: X + L + G
Encoding thread time: X
Submission thread time: ~0

Now, let's look at MoltenVK:

  1. The user "records" a command buffer, spending 0.5X time. Molten just copies over the commands internally, not touching Metal yet.
  2. The command buffer get's submitted. Here is when Molten starts actually recording those commands, spending an extra 0.75X time. The trick is that Molten enqueues the command buffer right before doing any work, so the driver is ready to receive those encoded passes.
  3. GPU starts the work earlier, right at the first encoded pass by the submission. Let's say it's the slowest one here, so it will try to keep up with the work and take G time in total, like in our previous case.

Total time: 0.5X + max(0.75X, L + G)
Encoding thread time: 0.5X
Submission thread time: 0.75X

See what happened here? There is more work in total, but it's spread over threads, and actually completes faster because the GPU gets stuff to work on earlier. Now X here can be logically extended to the total recording time (instead of a single command buffer), given that it's the submission cut-off that matters, and you can see how this can drastically affect performance in the end.

Solutions

In an ideal world, Vulkan would have some sort of API to tell the driver (earlier than at submission time) which order the on-time encoded command buffers are going to be submitted in. This isn't going to happen though.

A more practical alternative would be to try forcing the deferred command buffer recording on our side, and see how this affects frame scheduling. This would technically zero out one of our major advantages, and it will be a race over whose software command buffers are lighter.

Finally, at the application side, we'd benefit from more granular submissions. Dota makes about a thousand command buffers, but only submits them in 2 chunks per frame. So we are being delayed by roughly a quarter of the frame time here.

@zakarumych
Copy link

zakarumych commented Jul 13, 2018

Shouldn't applications that uses Vulkan submit command buffers as soon as they are recorded?
In this case GPU has work to do as soon as first command buffer is recorded.

I have no idea what Dota records in thousands of command buffers.
AFAIK render pass can't start in one command buffer and end in another.
So either

  • those are secondary buffers.
  • there are thousands of render passes (I doubt it).
  • It reenters same render passes multiple times.

@kvark
Copy link
Member Author

kvark commented Jul 13, 2018

I looked a bit more at the Metal System Trace numbers, added more stats to our backend locally to estimate the total pipeline length, and the numbers now started to match up and make more sense to me.

In an average frame, Dota submits 34 immediate command buffers in two fenced batches (thus, 2 completion handlers and 2 extra temporary command buffers). The total active command buffers in our pool is around 880, from which we can conclude the pipeline length to roughly be 22 frames. This matches the frame IDs reported by the system trace for window presentation exactly. Edit: no, the pipeline length is still 2 frames, guarded by the fences. Dota just probably recorded a bunch of command buffers it just sits on without submitting.

Yes, it looks like we are able to record 22 frames in advance, which means we are not really CPU limited, and neither adding more threads or deferring command buffer recording would help. What's stopping us from producing frames faster on screen is some sort of a synchronization issue between the application/driver/GPU. I can see that the GPU is not nearly as busy as it could be, so there is a huge potential here to run faster ;)

@kvark
Copy link
Member Author

kvark commented Jul 14, 2018

As an experiment, I forced deferred command buffer recording in gfx-portability, and got a steady 15% boost of performance (from 72 to 85). Metal system trace show the GPU work nicely overlapping with recording, as predicted in the subject of this issue. We might want to focus on optimizing this path a bit, if it's so beneficial for some cases.

Interestingly, there are short periods (~0.5 sec) where the FPS jumps up significantly (not single frames). Perhaps, what we are limited here is some sort of contention on device access (e.g. for descriptor allocation and writing) that just goes away at times because dota figures out to re-use existing descriptors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants