-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more Upscalers #14794
base: dev
Are you sure you want to change the base?
Support more Upscalers #14794
Conversation
wow some nice improvements thank you |
Great work on this! This fixes the black squares/all-black images with non-ESRGAN upscalers. There are a couple issues to iron out though, independent of model type:
|
Another note about the second issue: it is caused by the VAE, as upscaling works as expected when VAE encode is set to TAESD. |
Cool – I have a WIP branch that'd refactor the entire upscaling infrastructure to something a little saner. That won't block merging this, but just a heads-up that there's a bit of duplicate work in here and in there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the other comment, I've been working on reworking the upscaler architecture to something saner, but that doesn't necessarily block this – I'll just take care to take all of this into account.
Beyond the inline comments, I think all of the new Spandrel upscalers would benefit from a common superclass or mixin, they're mostly copy-pasted now anyway?
modules/upscaler_utils.py
Outdated
def pil_image_to_torch_bgr(img: Image.Image) -> torch.Tensor: | ||
img = np.array(img.convert("RGB")) | ||
img = img[:, :, ::-1] # flip RGB to BGR | ||
img = np.transpose(img, (2, 0, 1)) # HWC to CHW | ||
img = np.ascontiguousarray(img) / 255 # Rescale to [0, 1] | ||
return torch.from_numpy(img) | ||
|
||
|
||
def torch_bgr_to_pil_image(tensor: torch.Tensor) -> Image.Image: | ||
if tensor.ndim == 4: | ||
# If we're given a tensor with a batch dimension, squeeze it out | ||
# (but only if it's a batch of size 1). | ||
if tensor.shape[0] != 1: | ||
raise ValueError(f"{tensor.shape} does not describe a BCHW tensor") | ||
tensor = tensor.squeeze(0) | ||
assert tensor.ndim == 3, f"{tensor.shape} does not describe a CHW tensor" | ||
# TODO: is `tensor.float().cpu()...numpy()` the most efficient idiom? | ||
arr = tensor.float().cpu().clamp_(0, 1).numpy() # clamp | ||
arr = 255.0 * np.moveaxis(arr, 0, 2) # CHW to HWC, rescale | ||
arr = arr.round().astype(np.uint8) | ||
arr = arr[:, :, ::-1] # flip BGR to RGB | ||
return Image.fromarray(arr, "RGB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like it if these function names were retained, even if they did just call the Torchvision transforms. The assertions and comments are there for a reason too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being I've just removed the Torchvision transforms. I noticed when I was testing a DAT model, there was a large color shift. I'm unclear if this is just because I had added FP32 conversions and clamps than your original code or something else entirely. I should probably check against ChaNNier later to see which was the closest match.
It would appear compared to your code, the Torchvision functions I used did the following internally, which I then clamped every time after converting to float32 Tensors before converting to PIL Image:
ToPILImage
...
pic = pic.permute((1, 2, 0))
pic = pic.numpy(force=True)
npimg = pic
if np.issubdtype(npimg.dtype, np.floating) and mode != "F":
npimg = (npimg * 255).astype(np.uint8)
if mode is None and npimg.dtype == np.uint8:
mode = "RGB"
return Image.fromarray(npimg, mode=mode)
...
PILToTensor
...
img = torch.as_tensor(np.array(pic, copy=True))
img = img.view(pic.size[1], pic.size[0], F_pil.get_image_num_channels(pic))
img = img.permute((2, 0, 1))
return img
...
ToDtype
...
if float_input:
# float to float
if float_output:
return image.to(dtype)
# float to int
eps = 1e-3
max_value = float(_max_value(dtype))
return image.mul(max_value + 1.0 - eps).to(dtype)
else:
# int to float
if float_output:
return image.to(dtype).mul_(1.0 / _max_value(image.dtype))
# int to int
num_value_bits_input = _num_value_bits(image.dtype)
num_value_bits_output = _num_value_bits(dtype)
if num_value_bits_input > num_value_bits_output:
return image.bitwise_right_shift(num_value_bits_input - num_value_bits_output).to(dtype)
else:
return image.to(dtype).bitwise_left_shift_(num_value_bits_output - num_value_bits_input)
...
modules/upscaler.py
Outdated
if img.width >= dest_w and img.height >= dest_h: | ||
if img.width > dest_w and img.height > dest_h: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this (and the other if
at the end of the loop) are necessary? 🤔 As I see it, this implementation will do an extra upscale, just to downscale back down with lanczos outside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line at bottom was part of 30b1bcc (1x scale models work with this) . While the new one 4a66638 just moved the line to the top which breaks the loop before upscale has a chance to running under the same conditions of the prior commit. This had the side effect of breaking 1x scale models, which would previously be run at which point the line at bottom would just return the output immediately?
I can only assume the lanczos downscale is intentional? Maybe it is what allows users to get custom output sizes out of 2x/3x/4x/8x upscaling models?
AUTOMATIC1111 would likely need to chime in here if you believe the entire loop is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as far as I can tell the Lanczos (down)scale is intentional, exactly to get fractional scales out of models that are generally just 2x/3x/4x.
I think at least a code comment is needed to explain the logic to support 1x models – and in the case of an 1x model, it might be better to do the Lanczos upscaling first, then refine the scaled output with an 1x model, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get what you are saying about Lanczos, but do people actually willing scale the image with a 1x models selected rather then just use them in Extras, considering how they are implemented currently?
This really comes back to if you are going to be refactoring the scaling code, consider adding something to parse and record the model scale somewhere like cache.json after the model is loaded for the first time. Or maybe another option would be to dump them in a separate sub-folder where all models there are implicitly considered as 1x post-processing models and can be handled differently.
Overall I think trying to further adapt this section to 1x models without actually knowing a 1x model is being used is impractical. I'll add a comment there though, explaining what is happening.
Pretty nice PR! Finally being able to use DAT and HAT upscalers without a black image with hires fix. I can confirm at least what @gel-crabs said, using different resolutions to upscale (like 1.3x) seems to have some noise or blurryness. (using base 896x1088, for example) Setting to 1.25x, 1.5x or 2x seems to work fine (with either ESRGAN, HAT, SwinIR or DAT upscalers I tried) |
Oh, and by the way @Cyberbeing, if you can, it would be cool if the fixes for black output etc. could be separated into a separate, easier-to-review-and-merge PR, and the new upscalers could be in here :) |
@gel-crabs I believe the first issue is as expected, since previously Float32 models which spandrel has marked as the arch not supporting Float16, were being autocast to Float16 anyway only during HiresFix (but not Extras) and occasionally causing NaNs because of it. Disabling autocast for upscaling now causes Float32 only models to upscale entirely in Float32, so a speed difference in HiresFix should be expected, which should now match speed in Extras. Model arch which Spandrel allows for Float16 such as ESRGAN shouldn't see any significant speed difference before and after. Spandrel Arch supporting Float32/BFloat16 only: CodeFormer, DAT, GFPGAN, GRL, HAT, SRFormer, SwinIR Spandrel Arch supporting Float32/Float16/BFloat16: Compact, ESRGAN/RealESRGAN, OmniSR, SPAN. As for that second issue, I've tried reproducing it but I'm not sure I see it exactly. My local setup is non-standard though, since I am running Torch 2.3.0 nightly and keep most dependencies up-to-date rather than using the pinned versions. Backing thorough the commits, I do see that the nearest-exact and torchvision commits each changed output somewhat, so I've force-pushed the branch without those two commits. @gel-crabs @Panchovix can you still reproduce the issue? |
The new branch without torchvision/nearest-exact works with full VAE! For whatever reason though, the tiled upscale is still only about half as fast with ESRGAN upscalers despite being in float16. Another thing to consider for the future is support for ONNX upscalers, as all the HAT models are also distributed with an ONNX version in float16. That probably lies with with Spandrel though. |
@Cyberbeing Just tested with odd resolutions and now works fine! It doesn't seem to give me artifacts or blurryness. I'm using torch nightly as well, but from the start of the month, so maybe it doesn't have some features you were using |
I'm unsure, since for the most part I just looked like output was different in some spots, If anything, I suspect it must have been changing the following line to nearest-exact, since it is related to supported multiple-of-8 dimensions but I'm unsure why it would result in broken output. @Panchovix Can you double-check if changing only that line back to nearest-exact re-triggers your issue?
|
Testing it changing to nearest-exact and effectively, it triggers the issue again. My torchvision version is Using with just "nearest" works perfectly. |
@gel-crabs I pushed a new commit now with the autocast disabled in a more limited fashion, as close to the source of the NaN as possible. Does that improve performance at all for you? I unfortunately can't reproduce the performance impact you are seeing with my RTX A4000, when I test the numbers before and after with float16 models are 100% identical. Maybe my system is too slow so the overhead is hidden. |
Yes, ESRGAN's performance is back to normal now |
@akx Now that the performance impact is fixed, I've split off the NaNs fix into a separate pull request. |
26fe815
to
1e8d756
Compare
I've now adjusted the default tile size and allowed step size in the UI based on the model architecture. If you've applied this pull request before, you'll need to delete ui-config.json for the new UI behavior to apply. What prompted this, was upon reading the HAT test examples, they contained a comment that Tile Size and Tile Padding must be a multiple of the Window Size, which is 16 in their case. This makes me wonder if input image size is also expected to be a multiple of 16 as well when not tiling, or otherwise quality degrades? Because of that, the Step Size for Tile Size & Tile Padding has now been changed to either the Window Size, Split Size, or Block Size of each architecture, following the recommendation of HAT. Those last two I'm unsure if they matter or not, but thought I'd play it safe. ESRGAN I was unsure about, since I couldn't find that info, so I set Tile Padding to 8 Step, and reverted Tile Size Step back to the original value of 16. I feel like both should be set to either 8 or 16, but unsure which. General Tile Default for most Architectures: Changed to 256px Tile, 32px Padding SRFormer Tile Default: Changed to 176px Tile, 32px Padding COMPACT: Changed to Tiling Disabled This is up for discussion if this make sense or not for the models other than HAT which explicitly states to do this. |
For some reason,
@akx @w-e-w Do either of you have a clue as to the cause? I didn't have this error on my local branch where I had swapped changed interrogate.py to use the clip_interrogate package, so it must somehow be related to the imports there. As a temporary measure to allow the new option to be tested in the meantime, I've pushed my local interrogate.py change which isn't really intended as part of this pull request, until the cause of this startup error is figured out. |
The webui is very finicky about import order, unfortunately. That means shared.opts hasn't been initialized yet - you'll probably need to turn that into a later import (or otherwise untangle the import graph). |
@akx Do you have a suggestion on the best location to move the sd_models import? Should I just place
Still a mystery to me as how importing sd_models at the top of upscaler.py was causing shared.opts to not be loaded for for sd_samplers.py, but I guess it doesn't matter. |
After failing to find a simple find a way to untangle the import mess between what seemed to be sd_models, upscaler, interrogate, shared_options, and modelloader, I ultimately decided to just refrain from importing sd_models at all, and offload/reload the weights manually:
|
Refactor DAT, HAT, ESRGAN model support
Match allowed Tile/Padding size to architecture Window Size (if it exists)
add bfloat16 placeholder
A few more things: I manually re-added the Torchvision replacement and it works fine either way. Perhaps there should be an option in the UI, or maybe automatic detection for the nearest-exact changes, because when the hires fix is at a proper multiplier it definitely seems to give higher quality images. SPAN models are about as large as COMPACT models, and run just fine without tiling as well. |
You could import it locally just before you call a function from it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this looks fine by me – I haven't had the time to work on my new-upscalers stuff for sd-webui in a while so I'd need to rebase & refactor it anyway :)
I wonder if you are interested in deploying our new Anime model from CVPR2024 (https://github.com/Kiteretsu77/APISR). |
I has downloaded 4xHFA2kLUDVAESRFormer_light upscaler. And want to use it T2I hires process. |
That's fine! It's just a "well, that's strange!" error in the logs – one side effect of us using Spandrel is that you can actually use any Spandrel-supported model, they just might pick up wrong configurations (e.g. a model discovered in the ESRGAN folder would use ESRGAN tile size) but should work. |
Description
DRAFT PULL REQUEST (for testing and discussion only)
Now that webui supports spandrel, we should consider if we want to support any other model types with community models available on https://openmodeldb.info/. This draft pull request contains the commits which I've testing locally for the past week, and thought others may be interested in testing these other model formats with webui as well.
Overall my favorite models thus far are Nomos8kHAT-L_otf for realistic content (imho superior to FaceUpDAT/FaceUpDatSharper) and 4xHFA2kLUDVAESRFormer_light (sharp texture detail, strong blur removal) / 4xHFA2kLUDVAEGRL_small (higher saturation, smoother texture detail, less artifacts) or Real_HAT_GAN_Sharper for anime. The problem with the HAT-L models (originally added in c756133) is despite their quality, they are slow and use more VRAM than any other model type. Keep this in mind when setting your Tile Sizes.
Add the upscaler models into folders with these names within \stable-diffusion-webui\models
Template format is based vaguely off merging the best parts of hat_model.py with esrgan_model.py and eliminating url loading which I see as unneeded.
bfloat16 is non-functional as-is, since passing a bfloat16 model to spandrel results in float/bfloat16 mismatch errors
if desired, it can be made to function by adding bfloat16 autocasts/casts within spandrel at the point of failure
* Switched upscaler_utils.py to use Torchvision transforms instead of Numpy directlyWhat prompted this was when I was comparing chaiNNer output to webui, the webui output had strange noise on output for unknown reasons. Upon switching to using Torchvision transforms in webui this noise went away. Likely requires more testing.
Adapt the Tile/Padding Size Step in the UI for each model architecture's recommendations to maximize quality.
I have not yet tested this
* Switched default latent scale mode to Latent (nearest-exact)I remember hearing a legacy argument that bugged "nearest" mode need to be continued to used since older models were trained off the incorrect behavior. Does this really matter? I also changed the default latent scale mode for I2I since there is currently no way to configure it there, and nearest-exact is my personal preference.
When not using --medvram or --lowvram, the Stable Diffusion model can take up a large amount of VRAM especially with SDXL and --no-half, which can frequently result in OOM with a constant tile size setting during Upscale. By enabling this option in settings, the SD model will be unloaded prior to upscaling to free up VRAM, and reloaded after upscaling completes. This allows the user to set the highest upscaling tile size possible with their VRAM, and not need to change it when using a SD model requiring more VRAM for itself.
[Bug]: DAT upscaler is not working in hires fix. #14710 (comment)
Split off and merged in Fix potential autocast NaNs in image upscale #14809
[Bug]: Scale by 1x in 1.7.0 does nothing on "Extras tab" and "SD Upscale" script #14738 (comment)
Checklist: