Replies: 7 comments 4 replies
-
yup, all valid and known. re: gc re: live preview re: queue - it shouldn't re: controlnet js click handler - thanks for pointing that out btw, you've forgot one simplest and built-in method to intrument where time is being spend - start
|
Beta Was this translation helpful? Give feedback.
-
I'm not very familiar with Python but I've experienced something similar with Lua in garry's mod. In this game we only had 2gb of system ram available, and many of the "extensions" to the game made a lot of Lua objects every frame causing the garbage to go up and then a full spike. Out of fear some people started doing full manual garbage collection when they really shouldn't be creating so much garbage in the first place. One strategy was to just override collectgarbage so no extensions could call it and instead collect smaller amounts of garbage every other second. Perhaps something similar could be done here by just running gc.collect() after a generation so you won't notice. However I kinda feel this is done because some users run out of system memory when swapping models. Perhaps it could only be done on low/medvram instead. (however you can't control what extensions do) |
Beta Was this translation helpful? Give feedback.
-
This doesn't really show anything useful in this case, everything takes 0.0s. If I enable controlnet with canny it says 0.04s, but in this post I'm just testing with euler a, 1 sample and no prompt or anything else enabled. Moreover, here I'm optimizing the actual time it takes, all the way from frontend to backend to frontend. |
Beta Was this translation helpful? Give feedback.
-
I've realized this has more to do with the Generate button than the image you get. You can disable live previews entirely, set the polling rate to 5000 and it won't let you generate a new image unless 5 seconds have passed. If you set it to 250, you can sometimes see the final image a little bit before you can click on generate again. |
Beta Was this translation helpful? Give feedback.
-
But it does, I've doubled checked now, on and off, it consistently reduced the time. Again, not the time you measure in the backend when doing the heavy lifting, but the whole roundtrip excluding that. |
Beta Was this translation helpful? Give feedback.
-
Disabling the gc.collect worked wonders, I can generate 3 batches of images and the overhead before showing the images is less than a second, BUT after I updated to the current commit 594f033, the overhead before showing the images has increased to 5 seconds. Why? I double checked that the garbage collect is still disabled. |
Beta Was this translation helpful? Give feedback.
-
should be fixed now. |
Beta Was this translation helpful? Give feedback.
-
Most of the performance optimization efforts I've seen is focused on the actual inference, or "Time taken: **s" or "*.**it/s", but I feel this is a little misleading as there's a lot more that can contribute to performance problems.
So in this post I will change the metrics to "time between clicking generate and seeing the final image" or click to image time.
I'm on Linux using Firefox. My GPU is AMD 6900XT, I start with the default settings, no additional extensions and change to euler sampler, with 1 step to make the GPU do as little work as possible.
To measure the click to image time, we run the following javascript code in the browser console
Right off the bat, we see that the click to image time seems to be around 2.2 seconds. However "Time taken" reports that it took 1.22 seconds. So there's a whole 1 second overhead somewhere that seems to be doing something outside of inference. Moreover the 1.22 seconds is also a bit suspicious given that we only used 1 step.
gc.collect
So ignoring the mysterious 1 second and focusing on the 1.22 seconds. My investigation with the profiler revealed that most the time is spent in calling gc.collect()
So if we add the following code at the top of launch.py we can see where this happens:
Which results in:
Which all originates from
devices.torch_gc
in modules/shared.pyBut that's a total of 1 second (0.33*3)! This can somewhat be solved by checking "Disable Torch memory garbage collection" in settings, because the function will do nothing if that's enabled. But this function does 2 things, it does garbage collection the python side and on the GPU side, perhaps we want the GPU to collect garbage to avoid OOM issues but not the CPU?
Turning this option on won't stop other extensions from calling gc.collect() manually. So it's possible to just override gc.collect entirely as I've shown here and just comment out oldGCCollect. One such extension is controlnet which calls gc.collect directly when controlnet is enabled. I created an issue about this here: Mikubill/sd-webui-controlnet#1462
So disabling python garbage collection completely we're down to "Time taken: 0.22s" as opposed to 1.22s, however the click to image time is still 1.5 seconds
ControlNet units
There's something about controlnet and how it seems to affect the click to image time by the amount of units it adds, even when not in use. In SDNext, 3 units are added by default.
The cause seems to be that control net adds a .click callback to the generate button to fix some obscure bug. I've made an issue about it here Mikubill/sd-webui-controlnet#1461
As a workaround (which may have side effects, but I have yet to find them) you can remove click call entirely.
https://github.com/Mikubill/sd-webui-controlnet/blob/dd766de8629ee6035a734217e08c26cd1b08b2ab/scripts/controlnet_ui/controlnet_ui_group.py#L921-L930
After removing the code, the click to image time is now down to 1 second as opposed to 1.5 seconds. If you have more units enabled the before time should be even higher.
live preview polling
The next thing I found is that the "Progressbar/preview update period, in milliseconds" affects click to image time in some way. I believe in a1111 this is set a little bit high by default, but basically it seems that on average the update period will be added to the click to image time. This makes sense because it only checks if it's finished based on this interval.
Setting it to 1 as opposed to 250 does seem to reduce the time a little bit, now I'm down to 0.8-0.9 seconds.
--disable-queue launch parameter
Disabling gradio queues also seem to improve the click to image time a lot. Disabling the queue brings me down to 0.6 seconds.
I suspect there's just overhead in how the queuing system in gradio works?
Chrome vs Firefox
I'm a firefox user, but I was curious to see how this performed in chrome to see if the frontend code could be the cause. Using chrome the click to image time is actually down to 0.42 seconds whereas in firefox it's 0.6 seconds, or 0.2 seconds overhead in chrome and 0.4 seconds overhead in firefox (subtracting the "Time taken: 0.2s")
Remaining 0.2 / 0.4 seconds?
After doing all of this, generating images feel a lot more snappier, however I'm not sure what the remaining time is. The UI reports 0.2 seconds but the click to image time is 0.2 - 0.6 seconds, so I can only assume it must be something in gradio or how gradio is setup.
Beta Was this translation helpful? Give feedback.
All reactions