Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to resolve NaN issue with unstable VAEs while utilizing full precision (--no-half-vae) #12624

Closed
wants to merge 1 commit into from

Conversation

catboxanon
Copy link
Collaborator

@catboxanon catboxanon commented Aug 17, 2023

Update: I believe #12630 fixes this properly -- I will close this PR when that one or another is merged to resolve this.

Description

Attempts to solve a regression in cc53db6 (the previous commit a64fbe8 does not have this issue). I think this is also related to #12611. PR #12599 also still has this issue.

To preface: this only ever seems to happen with animevae.pt, and only for certain prompts. As such, it's difficult to find an easily reproducible scenario. This one is consistent for me, and I've verified it also works on somebody else's system. Also, this is absolutely not the correct way to fix this, because now it wastes time trying to decode the latent potentially twice, but I'm trying to wrap my head around what's going wrong here, and hopefully opening this PR brings that to discussion.

How to reproduce
  1. Checkout cc53db6 or later, launch with --no-half-vae
    Get this VAE, model, and these LoRAs
    VAE: https://huggingface.co/a1079602570/animefull-final-pruned/blob/main/animevae.pt
    Model: https://huggingface.co/AnonymousM/Based-mixes/blob/main/Based64mix-V3.safetensors
    LoRAs:
    (removed)

  2. Download and use the metadata from this image to set up the params
    74wkie

  3. (optionally) verify the image can be generated without hires fix.

  4. Attempt to generate the image with hires fix enabled. These are the settings I usually use but I tested this several other times and the only factor that seems to matter is the Upscale by value must be 1.15 or more.
    image

  5. (optionally) take the above image, and use the exact same parameters to upscale in the img2img tab. This will not produce the NaNs exception.


Side note: I got into the weeds and did some debugging, and this is why I'm also suspicious if this is related to the issue I linked above. This is the part of the code that produces the NaNs: https://github.com/Stability-AI/stablediffusion/blob/cf1d67a6fd5ea1aa600c4df58e5b47da45f6bdbf/ldm/modules/diffusionmodules/model.py#L634-L641

If I attempt to upscale the image in img2img, and use the initial value of z (the upscaled latent, before this line is executed), store that latent, and then attempt to use hires fix in txt2img, but with that value of z I stored earlier, it still produces NaNs in that function. However, if I do it the other way around, storing the upscaled latent from txt2img, and using that in img2img, I instead get the below:

RuntimeError: Input type (struct c10::Half) and bias type (float) should be the same

Just noting my specs here as well, incase this is somehow some pytorch bug.

python: 3.10.6  •  torch: 2.0.1+cu118  •  xformers: 0.0.20

Checklist:

@catboxanon catboxanon marked this pull request as draft August 17, 2023 19:04
@catboxanon catboxanon marked this pull request as ready for review August 17, 2023 19:56
@catboxanon catboxanon changed the title Attempt to resolve NaN issue with unstable VAEs in full precision Attempt to resolve NaN issue with unstable VAEs while utilizing full precision (--no-half-vae) Aug 17, 2023
@catboxanon catboxanon marked this pull request as draft August 17, 2023 22:13
@catboxanon
Copy link
Collaborator Author

PR that supercedes this has been merged (#12630) -- closing

@catboxanon catboxanon closed this Aug 19, 2023
@catboxanon catboxanon deleted the fix/nans-mk1 branch March 4, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant