Metal enqueue() advantage #2232

kvark · 2018-07-13T14:41:54Z

Looking at Metal System Trace, I realized the real value of enqueuing the command buffers earlier than submitting it. It's not documented much, but the behavior change of the driver is drastic.

Theory

When a command buffer is enqueued, once we call endEncoding on a pass, it gets instantly passed down to the driver, since it knows it doesn't need to wait for anything and is expecting the work. Consequently, the GPU starts chewing on the work right away. Thus, the commit() becomes a simple message saying "I'm done with this", leaving the submission queue for other things to use.

Basically, I call this paragraph BS:

The enqueue method does not make the command buffer eligible for execution. To submit the command buffer to the device for execution, a subsequent commit method invocation is required.

Now, let's look at what happens if we don't enqueue anything explicitly:

If the command buffer has not previously been enqueued, it is enqueued implicitly.

Sounds pretty harmless, doesn't it? Well, what really happens is that the driver doesn't want to do anything with our encoded passes until the command buffer gets committed. The passes get stacked on the command buffer internally (much like our software commands) and then dropped like a bomb to the driver upon commit() call.

Practice

Let's get more concrete. Suppose gfx-portability spends X amount of time recording an application's (one-time) command buffer. The driver does some work, but it's able to propagate it to GPU gradually, and doesn't take longer than the GPU itself, so we should only take into account it's latency L. And the GPU takes G amount of time to finish the work on those commands. Let's see how the work flows:

The user records a command buffer, spends X time
The command buffer gets submitted, flows to the driver, which starts propagating the work on GPU after L time.
The GPU takes G time to execute the work.

Total time: X + L + G
Encoding thread time: X
Submission thread time: ~0

Now, let's look at MoltenVK:

The user "records" a command buffer, spending 0.5X time. Molten just copies over the commands internally, not touching Metal yet.
The command buffer get's submitted. Here is when Molten starts actually recording those commands, spending an extra 0.75X time. The trick is that Molten enqueues the command buffer right before doing any work, so the driver is ready to receive those encoded passes.
GPU starts the work earlier, right at the first encoded pass by the submission. Let's say it's the slowest one here, so it will try to keep up with the work and take G time in total, like in our previous case.

Total time: 0.5X + max(0.75X, L + G)
Encoding thread time: 0.5X
Submission thread time: 0.75X

See what happened here? There is more work in total, but it's spread over threads, and actually completes faster because the GPU gets stuff to work on earlier. Now X here can be logically extended to the total recording time (instead of a single command buffer), given that it's the submission cut-off that matters, and you can see how this can drastically affect performance in the end.

Solutions

In an ideal world, Vulkan would have some sort of API to tell the driver (earlier than at submission time) which order the on-time encoded command buffers are going to be submitted in. This isn't going to happen though.

A more practical alternative would be to try forcing the deferred command buffer recording on our side, and see how this affects frame scheduling. This would technically zero out one of our major advantages, and it will be a race over whose software command buffers are lighter.

Finally, at the application side, we'd benefit from more granular submissions. Dota makes about a thousand command buffers, but only submits them in 2 chunks per frame. So we are being delayed by roughly a quarter of the frame time here.

The text was updated successfully, but these errors were encountered:

zakarumych · 2018-07-13T14:51:46Z

Shouldn't applications that uses Vulkan submit command buffers as soon as they are recorded?
In this case GPU has work to do as soon as first command buffer is recorded.

I have no idea what Dota records in thousands of command buffers.
AFAIK render pass can't start in one command buffer and end in another.
So either

those are secondary buffers.
there are thousands of render passes (I doubt it).
It reenters same render passes multiple times.

kvark · 2018-07-13T15:56:31Z

I looked a bit more at the Metal System Trace numbers, added more stats to our backend locally to estimate the total pipeline length, and the numbers now started to match up and make more sense to me.

In an average frame, Dota submits 34 immediate command buffers in two fenced batches (thus, 2 completion handlers and 2 extra temporary command buffers). The total active command buffers in our pool is around 880, from which we can conclude the pipeline length to roughly be 22 frames. ~~This matches the frame IDs reported by the system trace for window presentation exactly.~~ Edit: no, the pipeline length is still 2 frames, guarded by the fences. Dota just probably recorded a bunch of command buffers it just sits on without submitting.

Yes, it looks like we are able to record 22 frames in advance, which means we are not really CPU limited, and neither adding more threads or deferring command buffer recording would help. What's stopping us from producing frames faster on screen is some sort of a synchronization issue between the application/driver/GPU. I can see that the GPU is not nearly as busy as it could be, so there is a huge potential here to run faster ;)

kvark · 2018-07-14T12:13:18Z

As an experiment, I forced deferred command buffer recording in gfx-portability, and got a steady 15% boost of performance (from 72 to 85). Metal system trace show the GPU work nicely overlapping with recording, as predicted in the subject of this issue. We might want to focus on optimizing this path a bit, if it's so beneficial for some cases.

Interestingly, there are short periods (~0.5 sec) where the FPS jumps up significantly (not single frames). Perhaps, what we are limited here is some sort of contention on device access (e.g. for descriptor allocation and writing) that just goes away at times because dota figures out to re-use existing descriptors.

kvark added type: optimization status: ready for work difficulty: average type: discussion platform: OSX value: high backend: Metal api: ll client: request feature request from a known client labels Jul 13, 2018

kvark mentioned this issue Jul 13, 2018

[meta] Dota performance issues on Metal #2161

Closed

22 tasks

This was referenced Jul 16, 2018

Backend-specific configuration #2236

Open

Optimize the deferred command recording path of Metal #2238

Closed

Background recording in Metal #2259

Closed

kvark mentioned this issue Mar 3, 2019

[mtl]Performance among values of OnlineRecording #2666

Open

kvark removed the api: hal label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal enqueue() advantage #2232

Metal enqueue() advantage #2232

kvark commented Jul 13, 2018

zakarumych commented Jul 13, 2018 •

edited

Loading

kvark commented Jul 13, 2018 •

edited

Loading

kvark commented Jul 14, 2018

Metal enqueue() advantage #2232

Metal enqueue() advantage #2232

Comments

kvark commented Jul 13, 2018

Theory

Practice

Solutions

zakarumych commented Jul 13, 2018 • edited Loading

kvark commented Jul 13, 2018 • edited Loading

kvark commented Jul 14, 2018

zakarumych commented Jul 13, 2018 •

edited

Loading

kvark commented Jul 13, 2018 •

edited

Loading