Reducing inference time by almost 2 seconds #1592

CapsAdmin · 2023-07-06T18:59:15Z

CapsAdmin
Jul 6, 2023

Most of the performance optimization efforts I've seen is focused on the actual inference, or "Time taken: **s" or "*.**it/s", but I feel this is a little misleading as there's a lot more that can contribute to performance problems.

So in this post I will change the metrics to "time between clicking generate and seeing the final image" or click to image time.

I'm on Linux using Firefox. My GPU is AMD 6900XT, I start with the default settings, no additional extensions and change to euler sampler, with 1 step to make the GPU do as little work as possible.

To measure the click to image time, we run the following javascript code in the browser console

{
  // get the generate button
  let button = document.getElementById("txt2img_generate");

  // remove the callback if already exists so we don't have
  // to refresh the page and get multiple callbacks running
  if (window.perf_callback) {
    button.removeEventListener("mousedown", window.perf_callback);
  }

  let mouseDown = () => {
    console.log("start!");
    let startTime = performance.now();

    // mutation observe will watch when the text inside the button changes
    let mo;
    mo = new MutationObserver((d) => {
      // wait for the button to its text to generate
      let buttonText = d[0].addedNodes[0].nodeValue;
      if (buttonText.includes("Generate")) {
        let diff = performance.now() - startTime;
        console.log("seconds: " + diff / 1000);

        mo.disconnect();
        mo = undefined;
      }
    });

    mo.observe(button, {
      childList: true,
    });
  };

  // store the callback so it can be caught next time we run this script with modifications
  window.perf_callback = mouseDown;

  // use mousedown as it's instantaneous
  button.addEventListener("mousedown", mouseDown);
}

Right off the bat, we see that the click to image time seems to be around 2.2 seconds. However "Time taken" reports that it took 1.22 seconds. So there's a whole 1 second overhead somewhere that seems to be doing something outside of inference. Moreover the 1.22 seconds is also a bit suspicious given that we only used 1 step.

gc.collect

So ignoring the mysterious 1 second and focusing on the 1.22 seconds. My investigation with the profiler revealed that most the time is spent in calling gc.collect()

So if we add the following code at the top of launch.py we can see where this happens:

import gc
import time
oldGCCollect = gc.collect
def newCollect(*args, **kwargs):
    start = time.time()
    oldGCCollect(*args, **kwargs)
    end = time.time()
    # getframe(1) because gc.collect is called from a function that does gc.collect and torch.gc
    print(f"gc.collect() took { end-start }s called from {sys._getframe(1).f_back.f_code.co_filename}:{sys._getframe(1).f_back.f_lineno}")

gc.collect = newCollect

Which results in:

gc.collect() took 0.34053707122802734s called from /home/caps/projects/stable-diffusion/automatic/modules/shared.py:154
gc.collect() took 0.3344266414642334s called from /home/caps/projects/stable-diffusion/automatic/modules/processing.py:754
gc.collect() took 0.3325066566467285s called from /home/caps/projects/stable-diffusion/automatic/modules/shared.py:160

Which all originates from devices.torch_gc in modules/shared.py

But that's a total of 1 second (0.33*3)! This can somewhat be solved by checking "Disable Torch memory garbage collection" in settings, because the function will do nothing if that's enabled. But this function does 2 things, it does garbage collection the python side and on the GPU side, perhaps we want the GPU to collect garbage to avoid OOM issues but not the CPU?

Turning this option on won't stop other extensions from calling gc.collect() manually. So it's possible to just override gc.collect entirely as I've shown here and just comment out oldGCCollect. One such extension is controlnet which calls gc.collect directly when controlnet is enabled. I created an issue about this here: Mikubill/sd-webui-controlnet#1462

So disabling python garbage collection completely we're down to "Time taken: 0.22s" as opposed to 1.22s, however the click to image time is still 1.5 seconds

ControlNet units

There's something about controlnet and how it seems to affect the click to image time by the amount of units it adds, even when not in use. In SDNext, 3 units are added by default.

The cause seems to be that control net adds a .click callback to the generate button to fix some obscure bug. I've made an issue about it here Mikubill/sd-webui-controlnet#1461

As a workaround (which may have side effects, but I have yet to find them) you can remove click call entirely.

https://github.com/Mikubill/sd-webui-controlnet/blob/dd766de8629ee6035a734217e08c26cd1b08b2ab/scripts/controlnet_ui/controlnet_ui_group.py#L921-L930

After removing the code, the click to image time is now down to 1 second as opposed to 1.5 seconds. If you have more units enabled the before time should be even higher.

live preview polling

The next thing I found is that the "Progressbar/preview update period, in milliseconds" affects click to image time in some way. I believe in a1111 this is set a little bit high by default, but basically it seems that on average the update period will be added to the click to image time. This makes sense because it only checks if it's finished based on this interval.

Setting it to 1 as opposed to 250 does seem to reduce the time a little bit, now I'm down to 0.8-0.9 seconds.

--disable-queue launch parameter

Disabling gradio queues also seem to improve the click to image time a lot. Disabling the queue brings me down to 0.6 seconds.

I suspect there's just overhead in how the queuing system in gradio works?

Chrome vs Firefox

I'm a firefox user, but I was curious to see how this performed in chrome to see if the frontend code could be the cause. Using chrome the click to image time is actually down to 0.42 seconds whereas in firefox it's 0.6 seconds, or 0.2 seconds overhead in chrome and 0.4 seconds overhead in firefox (subtracting the "Time taken: 0.2s")

Remaining 0.2 / 0.4 seconds?

After doing all of this, generating images feel a lot more snappier, however I'm not sure what the remaining time is. The UI reports 0.2 seconds but the click to image time is 0.2 - 0.6 seconds, so I can only assume it must be something in gradio or how gradio is setup.

vladmandic · 2023-07-06T20:45:04Z

vladmandic
Jul 6, 2023
Maintainer

yup, all valid and known.

re: gc
btw, its safe to disable gc in settings as it still runs before/after model load, its just skipped during image generation. i personally run with gc disabled pretty much always.
its enabled by default as there are a lot of users with low-end gpus, but i'm thinking of switiching it to disabled by default.

re: live preview
well, its because to get preview image it needs to be interpolated from raw data. type of live preview has big impact. full > taesd > approx nn > simple.

re: queue - it shouldn't

re: controlnet js click handler - thanks for pointing that out

btw, you've forgot one simplest and built-in method to intrument where time is being spend - start webui --debug and it will tell you.
for example:

16:32:53-927564 DEBUG Script process: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']
16:32:53-928517 DEBUG Script before-process-batch: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']
16:32:53-929193 DEBUG Script process-batch: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']

0 replies

CapsAdmin · 2023-07-06T20:52:56Z

CapsAdmin
Jul 6, 2023
Author

re: gc btw, its safe to disable gc in settings as it still runs before/after model load, its just skipped during image generation. i personally run with gc disabled pretty much always. its enabled by default as there are a lot of users with low-end gpus, but i'm thinking of switiching it to disabled by default.

I'm not very familiar with Python but I've experienced something similar with Lua in garry's mod. In this game we only had 2gb of system ram available, and many of the "extensions" to the game made a lot of Lua objects every frame causing the garbage to go up and then a full spike. Out of fear some people started doing full manual garbage collection when they really shouldn't be creating so much garbage in the first place.

One strategy was to just override collectgarbage so no extensions could call it and instead collect smaller amounts of garbage every other second. Perhaps something similar could be done here by just running gc.collect() after a generation so you won't notice.

However I kinda feel this is done because some users run out of system memory when swapping models. Perhaps it could only be done on low/medvram instead. (however you can't control what extensions do)

1 reply

vladmandic Jul 6, 2023
Maintainer

gc is aways done before/after model load regardless of gc being disabled in settings. disabling gc in settings disables it running during each generate. i'll edit the settings description to make it clearer.
and with that, i really don't see the need to have it run always, i'll change the default.
but true, can't control what extensions do.

CapsAdmin · 2023-07-06T21:05:43Z

CapsAdmin
Jul 6, 2023
Author

btw, you've forgot one simplest and built-in method to intrument where time is being spend - start webui --debug and it will tell you. for example:

16:32:53-927564 DEBUG Script process: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']
16:32:53-928517 DEBUG Script before-process-batch: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']
16:32:53-929193 DEBUG Script process-batch: ['Dynamic Thresholding (CFG Scale Fix):0.0s', 'Agent Scheduler:0.0s']

This doesn't really show anything useful in this case, everything takes 0.0s. If I enable controlnet with canny it says 0.04s, but in this post I'm just testing with euler a, 1 sample and no prompt or anything else enabled. Moreover, here I'm optimizing the actual time it takes, all the way from frontend to backend to frontend.

0 replies

CapsAdmin · 2023-07-06T21:08:31Z

CapsAdmin
Jul 6, 2023
Author

re: live preview well, its because to get preview image it needs to be interpolated from raw data. type of live preview has big impact. full > taesd > approx nn > simple.

I've realized this has more to do with the Generate button than the image you get. You can disable live previews entirely, set the polling rate to 5000 and it won't let you generate a new image unless 5 seconds have passed. If you set it to 250, you can sometimes see the final image a little bit before you can click on generate again.

0 replies

CapsAdmin · 2023-07-06T21:13:50Z

CapsAdmin
Jul 6, 2023
Author

re: queue - it shouldn't

But it does, I've doubled checked now, on and off, it consistently reduced the time. Again, not the time you measure in the backend when doing the heavy lifting, but the whole roundtrip excluding that.

2 replies

CapsAdmin Jul 6, 2023
Author

I've confirmed that with this simple gradio app that generates random noise on the GPU. Enabling the gradio queue and clicking on generate in chrome looking at the network tab, it takes around 140ms, and disabling it takes around 40ms.

There are obvious benefits from having a queue and you could argue that in this case, the added delay is worth it because no one is generating images with 1 sample.

vladmandic Jul 6, 2023
Maintainer

thanks for confirming - and having a test app is great!
now i'd like to ask for a follow-up and create an issue with gradio? i'll chime in there as well once created.

YukiSakuma · 2023-08-26T19:33:30Z

YukiSakuma
Aug 26, 2023

Disabling the gc.collect worked wonders, I can generate 3 batches of images and the overhead before showing the images is less than a second, BUT after I updated to the current commit 594f033, the overhead before showing the images has increased to 5 seconds. Why? I double checked that the garbage collect is still disabled.

1 reply

YukiSakuma Aug 26, 2023

I reverted back to commit 8ef4aa7 and it's back to near instant image show

vladmandic · 2023-08-27T08:01:58Z

vladmandic
Aug 27, 2023
Maintainer

should be fixed now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing inference time by almost 2 seconds #1592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reducing inference time by almost 2 seconds #1592

CapsAdmin Jul 6, 2023

gc.collect

ControlNet units

live preview polling

--disable-queue launch parameter

Chrome vs Firefox

Remaining 0.2 / 0.4 seconds?

Replies: 7 comments · 4 replies

vladmandic Jul 6, 2023 Maintainer

CapsAdmin Jul 6, 2023 Author

vladmandic Jul 6, 2023 Maintainer

CapsAdmin Jul 6, 2023 Author

CapsAdmin Jul 6, 2023 Author

CapsAdmin Jul 6, 2023 Author

CapsAdmin Jul 6, 2023 Author

vladmandic Jul 6, 2023 Maintainer

YukiSakuma Aug 26, 2023

YukiSakuma Aug 26, 2023

vladmandic Aug 27, 2023 Maintainer

CapsAdmin
Jul 6, 2023

Replies: 7 comments 4 replies

vladmandic
Jul 6, 2023
Maintainer

CapsAdmin
Jul 6, 2023
Author

vladmandic Jul 6, 2023
Maintainer

CapsAdmin
Jul 6, 2023
Author

CapsAdmin
Jul 6, 2023
Author

CapsAdmin
Jul 6, 2023
Author

CapsAdmin Jul 6, 2023
Author

vladmandic Jul 6, 2023
Maintainer

YukiSakuma
Aug 26, 2023

vladmandic
Aug 27, 2023
Maintainer