-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Upscaling very large images with SD Ultimate Upscale has VERY SLOW preprocessing (75% of total execution time) #1648
Comments
I don't think the tile resample preprocessor takes that long. Neither does the model loading. From the log timestamps you can see these tasks are completed relatively fast. However, we can add more debug logs to pinpoint the issue. |
@huchenlei most are definitely very short -- especially the loading preprocessor. But these two lines: 2023-06-15 20:33:24,653 - ControlNet - INFO - Loading model from cache: control_v11f1e_sd15_tile [a371b31b] This indicates 8 seconds to load the model from the cache -- am I reading that correctly? Thanks. |
Sorry I misread the timestmaps. There can be other things between these 2 log statements that is taking long. The loading model from cache is simply accessing an item in a dict, which should'nt be the culprit. I am going to add more debug logs and try reproduce the issue. |
After adding some timing logs:
The issue seems to be in the A1111 mask handling code: if 'reference' not in unit.module and issubclass(type(p), StableDiffusionProcessingImg2Img) \
and p.inpaint_full_res and a1111_mask_image is not None:
logger.debug("A1111 inpaint mask START")
input_image = [input_image[:, :, i] for i in range(input_image.shape[2])]
input_image = [Image.fromarray(x) for x in input_image]
mask = prepare_mask(a1111_mask_image, p)
crop_region = masking.get_crop_region(np.array(mask), p.inpaint_full_res_padding)
crop_region = masking.expand_crop_region(crop_region, p.width, p.height, mask.width, mask.height)
input_image = [
images.resize_image(resize_mode.int_value(), i, mask.width, mask.height)
for i in input_image
]
input_image = [x.crop(crop_region) for x in input_image]
input_image = [
images.resize_image(external_code.ResizeMode.OUTER_FIT.int_value(), x, p.width, p.height)
for x in input_image
]
input_image = [np.asarray(x)[:, :, 0] for x in input_image]
input_image = np.stack(input_image, axis=2)
logger.debug("A1111 inpaint mask END") I am not sure why running tile preprocessor in img2img (Not inpaint), with ultimate SD upscaler triggers this logic. I think you can add more logs to pinpoint which line is causing the problem. I cannot reproduce the 8s cost on my local setup though. Let's assign this to @lllyasviel as I am really not familiar with this masking code. |
Interesting -- thanks for looking into this. It looks like two processes take up a large portion of the execution time in your example,. the inpaint mask and the detectmap_proc both take around 2 seconds. Since you didn't mention the second, I'm guessing this is unavoidable and essential code. I am surprised, like you, that the inpaint mask code is running -- it seems p.inpaint_full_res and a1111_mask_image should both be false/none when doing an img2img upscale with CN Tile model, but I don't know the code well enough to know if either CN or SDUS relies on this logic for some reason. I might do a test where I force skip this logic to see if it has any adverse effects while speeding up the upscaling... |
OK, I think I have an answer to the masking question. Here is the code from Ultimate Upscaler -- it is using the mask functionality of A1111 to crop the tiles: def init_draw(self, p, width, height):
|
I think we can definitely do some improvements in the tiling /croping process, as currently if the image is big, the process of cropping can take significant amount of time. That should also be the reason why on my reproduction the mask code only runs for 2 sec, because I am using a much smaller input image (2048 x 3072). To further improve efficiency, I think all crop should be done in a single pass, instead of cropping a single tile off the input image, process and repeat for number of tiles. |
Exactly what I was thinking -- but I was concerned that this is an optimization that needs to happen on the SD Upscaler side... would be curious to hear how you think this optimization could happen. It would save a lot of rendering time at all stages. Thanks! |
@huchenlei How do I enable debug logging so I can investigate a few things? Thanks! |
You need to add If you want the debug logging of how long each function takes in
If you just want to understand how long a specific part of the code takes, you can just add some log messages to the part you are interested in. No need to do any of the things mentioned above. |
@lllyasviel nope -- i've not upgraded my nvidia drivers, still on 528. This is definitely related to cropping/masking code inefficiencies, and probably other things as well but would require investigation. |
I see you're using a VERY old version of Torch. If you do these upgrades, please post your new speeds. It would be interesting to see the difference. -V |
@Vendaciousness -- thanks for the comment. Interesting that you got such a great performance improvement. Back when A1111 updated the torch version, I did indeed update to 2 and had a variety of nasty side effects. One was the forever hanging generations that required noodling with live preview settings, and the other was that my performance actually got slower with torch 2. Perhaps I didn't have the correct pairing with nVidia drivers and CuDNN binaries, but honestly when I encounter issues like that in the middle of an important project in production (which I have been working on for 9 weeks now), I hesitate to go "all in" on an upgrade process that may end in tears and frustration and the need to go all the way back to square one to retrieve an old configuration. As such, I forked the A1111 repo and have my own custom configuration, which I am willing to abandon pretty soon, as I am reaching the end of a pretty intense production run and can afford to get off track for a bit. I will give it a shot (this time backing up my venv folder) and I will post results here! Incidentally, do you have any recommendations on specifics around updating the CuDNN binaries? I haven't done that in the A1111 environment yet, so I don't know if there are any idiosyncracies... for isntance, do I need the latest nvidia drivers? And if so, weren't there some issues with those recently that were forcing users to roll back? Marc |
Incidentally, here is where I am right now. See any red flags? +-----------------------------------------------------------------------------+ |
Yeah, so just upgraded to torch 2 and I'm getting pretty similar or slightly worse performance: Canva size: 13824x18432 Image gen denoising is trending more toward 13-14s instead of 12-13, I noticed the cudnn binaries in torch/lib are pretty old, and it's hard to tell but they seem like they were build on CUDA 11. My system has CUDA 12 installed. Could this be a problem? If I replace the dlls in the lib folder from nVidia's website, should I use the bin files from v11 or v12? Are there any other files that need to be replaced? TIA |
So after updating my binaries, there is similar performance on denoising and the problematic tile cropping code to my previous results on torch 1 -- however, some of the other intermediary processes seem to have been sped up, and my total time has gone down from 97m to about 89m. So that's positive. Plus, now I'm up to date with A1111 and torch 2, so it's a success in general if not a success related to the issue posted here. There is still a great opportunity to optimize the tile cropping process, it seems. @huchenlei -- anythoughts on how we might get this to proceed? I know there are a lot of priorities right now, but this could have a pretty great impact on performance system wide, even for large batches of smaller operations. I'm headed to Europe for a few weeks but I'm happy to help out where I can when I get back. Thanks all. Marc |
I am working on expanding test coverage right now. Without enough test coverage, I don't think I have enough confidence to tackle the convoluted mask code in |
Not sure if this is also related to the issue, but there was a time that I updated the checkpoint cache number and ControlNet model cache number, and the time that CN used to take loading the model and preprocessors between each tile was just gone, the tiles are processed immediately after each other, even with large images like 10000*10000, but then after a while, it went back to the same behaviour like it was before: waited quite a few seconds loading models between tiles. I can't reproduce that behaviour so there could be some information missing. Just thought maybe this info could be helpful? And also looking for solutions to this as well. Cheers I am using 4090 and torch 2+, CUDA 12.1. |
@marcsyp Sorry for the delay, I've been in the middle of a crazy project crunch time. Anyways, regarding the cudNN binaries, I would install the latest CUDA Toolkit, here: https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64 Automatic1111 already has a fairly new version of the CudNN binaries, nowadays, but if you want the latest ones, you need to create a developer account on Nvidia to download the latest binaries, here: https://developer.nvidia.com/rdp/cudnn-download Then you copy the files located in the bin folder and overwrite the ones in your venv folder, in [Auto1111 folder]\venv\site-packages\torch\Lib\ I would in general recommend you do what I do and create a second install of Auto1111, perhaps you might try Vlad's automatic, which is kind of like a souped-up race car version of Auto1111. It's usually faster than my install of Auto1111, but bear in mind it has more bugs as a result. The second install uses the models/checkpoint folders of the first, so there's only a few extra gigs of extra files on the drive. I also copy over my ui-config.json, config.json, params.txt and the webui-user.bat files from my main install, so both copies are set up the same way, unless I need something set up special for a project. Anyways, I'll usually run the second version, unless it keeps crashing, then I move to the backup primary and when the second, newer version works great for a month or more, I copy it over and make it my new primary. It's basically a fork, however, this fork is done for the purposes of redundancy and so that I'll always have a production copy that works. Lately, I've had 3 versions I run, an LTS, an LTS candidate and a bleeding-edge version (usually Vlad's automatic). Hope this helps! |
@huchenlei -- I am returning the original art project that surfaced this issue, having upgraded my webuiUI to 1.6.0 and controlnet to 1.1.443. The preprocessor performance has gotten significantly WORSE for the same exact 3x upscale described in the original issue: Canva size: 13824x18432 CN hook times vary wildly from 9s to 26s. Any idea what is going on here? Have there been any updates in A1111 past 1.6.0 that may help this issue? I'm reluctant to update in the middle of this project, but I would consider it. |
Can't blame you, there. If I wanted to process images to those specs, I wouldn't even use A1111 anymore, but rather Stable Forge, an optimized version of A1111 made by the creator of ControlNet,. Just be careful to create a new install for SF. Don't ruin your copy of A1111 by 'upgrading' it with Stable Forge update instructions they have there. It just broke my existing install and all the extensions were different, anyway. Stable Forge: https://github.com/lllyasviel/stable-diffusion-webui-forge Or, if I needed the images to be extremely detailed (performance over speed), maybe SUPIR, a new upscale method that gives the best results I've seen. |
Thanks @Vendaciousness -- a couple people have pointed me to Stable Forge and I will definitely consider it, particularly as a completely fresh install. The reason I haven't switched so far is because my Upscale workflow relies on a heavily developed fork of Ultimate Upscale, and given my cursory reading of the Forge documentation, I was concerned that some modification of my plugin would be required to get it working in SF, with no guarantee of success. I'm also stuck on torch 1.13 for this project because image reproduction changed significantly moving to torch 2 (and therefore destroying my upscale workflow, which relies on reproducible seeds), so I'm not sure if I'm ready to go down that experimentation route just yet. But I will keep it in the back pocket if I get frustrated. Thanks for the heads up on SUPIR, this is the first I'm seeing it. |
That's rough. I've had my own custom workflow for seamless looping and seamless tiling AI videos broken for many months as a result of version mismatches, so I fully relate. The most frustrating thing is it definitely worked in the past, until it broke one day and I was too busy to trace down the cause at the time. I'm trying to use SF for some super high res (20k+ per side) upscales right now, so if you want to add me on Discord, I'll share what I learn. Maybe I can adjust my workflow to replicate your requirements. I've used Ultimate SD Upscaler in the past, but had better results using the 'Multidiffusion and Tiled VAE' extension, which I think uses the same tiling method to slice up the job into bite-sized chunks. Illyasviel has built this into Stable Forge, along with HyperTile, Kohya's HR (high res) Fix and some other potentially useful tools, so it could be a good fit, but bear in mind it's a stale repo. The dev is brilliant, but super flaky. Can confirm extension compatibility is hit and miss, as you've read, though older (1.6-1.8) compatible versions may be fine. That is, I don't believe there is an inbuilt incompatibility. Add me if you want: digitalhitman@contractor.net -v |
Yeah, I made the mistake of not cloning my venv and keeping detailed records of plugins/settings early on in the project, so I have no way to return completely to my original state. Frustrating but then it was all back in the early days before configs were even properly implemented, and we were all just experimenting. I just didn't realize that the project would have legs and that I would continue to want to work on it for so long, and as a result I'm on a pseudo workable but not ideal solution right now. The biggest part of the workflow I built in my private fork is custom seed, CFG, and prompt extraction and manipulation as part of the upscale process, and I can't really live without it for at least parts of the process when doing batch operations -- but for final upscales on individual pieces, if I can find a good workflow for ultra large upscales, I'm open to it, so yeah I'd love the hear about what you learn. I'll add you on discord in the off hours. Cheers |
Here's a more recent ComfyUI-based upscale method that combines SUPIR and some others, so if you'd rather learn ComfyUI (it's a steep learning curve, but much more capable), check it out: |
Is there an existing issue for this?
What happened?
I am doing very large upscales using SD Ultimate Upscale and the ControlNet tile model. Upscaling 4608 x 6144 images 3x to 13824 x 18432 works quite well with my workflow, but it is VERRRRRY slow, particularly on the model loading between each tile. The actual tile rendering is quite fast at 0.24 denoising (roughly 12 seconds on a 3080Ti), but the preprocessing step between each tile is close to 29 seconds, which is roughly 70% of the total execution time.
From the time stamps, it looks like the loading of the tile model from cache may account for 9 of the 29 seconds, but I'm not sure what accounts for the remaining 20 seconds, perhaps some are consumed by the tiling process of SDUS itself (comparable upscale without CN enabled has about 8s of denoising, 17s of prep), so CN could be responsible for up to 12s.
These renders take about 1h40m each, but if there were some way to optimize the model loading, I feel like there is a LOT of opportunity for gains here, drastically reducing the time required for very large upscales. 8-12 seconds times 144 tiles is 25-38 minutes of time just spent on model loading/preprocessing.
(NOTE: Smaller upscales also have decently long model loading wait, but not unbearable. For a 2x of a 2304x3072 with 2304x768 tiles and 0.24 denoising (with more steps), I'm getting 17s of rendering and 9s of loading/preprocessing, with tile size that is slightly larger -- that's roughly 34% of the total execution time. Not sure why the model loading is faster with a smaller upscale when the tile size is actually larger. Total execution per render here is 6.75m, roughly, so it's less of an issue.)
Any thoughts here appreciated!
Steps to reproduce the problem
What should have happened?
Would be great if the preprocessing step were roughly equivalent to that of a single tile size image (1-5s, instead of 30s). Don't know if there are technical limitations that prevent this, or whether it's simply an inefficient algo that works fine at most smaller resolutions and nobody has questioned it for larger res.
Commit where the problem happens
webui:
python: 3.10.6 • torch: 1.13.1+cu117 • xformers: 0.0.16rc425 • gradio: 3.23.0 • commit: 22bcc7be • checkpoint: 9aba26abdf
controlnet:
1.1.224
What browsers do you use to access the UI ?
No response
Command Line Arguments
List of enabled extensions
Console logs
Additional information
This behaviour is not new, has been this way since the first time I started doing these large upscales, probably a month ago.
The text was updated successfully, but these errors were encountered: