Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community] Testing Stable Diffusion is hard 🥵 #937

Closed
patrickvonplaten opened this issue Oct 21, 2022 · 15 comments
Closed

[Community] Testing Stable Diffusion is hard 🥵 #937

patrickvonplaten opened this issue Oct 21, 2022 · 15 comments
Assignees
Labels
stale Issues that haven't received updates

Comments

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Oct 21, 2022

It's really difficult to test stable diffusion due to the following:

    1. Continous output: Diffusion models take float values as input and output float values. This is different from NLP models which tend to take int64 as inputs and int64 as outputs.
    1. Output dimensions are huge. If an image has a output size of (1, 512, 512, 3) this means that there are 512 * 512 * 3 ~ 800,000 values that need to be within a given range. Say if you want to test for a max difference of (pred - ref).abs() < 1e-3 we have roughly a million values where this has to hold true. This is quite different in NLP where we rather test things like text generation or final logit layers which usually aren't bigger then a dozen or so tensors of size 768 or 1024.
    1. Error propagation: We cannot simple test one forward pass for stable diffusion because in practice people use 50 forward passes. Error propagation becomes a real problem in this case. This again is different from say generation in NLP because in generation at every generation step errors can be somewhat "smoothed" out since a "argmax" of "softmax" operation is used after each step
    1. Composite systems: Stable Diffusion has three main components for inference: A Unet, a scheduler and a VAE decoder. The UNet and Scheduler are very entangled during the forward pass. Just because we know the forward pass of both the scheduler and unet work independently, it doesn't mean that using them together works.

=> Therefore, we need to do full integration tests, meaning we need to make sure that the output of a full denoising process stays within a given error range. At the moment, we're having quite some problems though to get full reproducible of results on different GPUs, CUDA versions etc... (especially for FP16).

That being said, it is extremely important to test stable diffusion to avoid issues like this in the future: #902 whereas we should still be able to improve speed with PRs like this: #371

At the moment, we're running multiple integration tests for all 50 diffusion steps every time a PR is merged to master, see:

Nevertheless, the tests weren't sufficient to detect: #902

Testing Puzzle 🧩: How can we find the best trade-off between fast & in-expensive test and best possible test coverage taking into account the above points?

We already looked quite a bit into: https://pytorch.org/docs/stable/notes/randomness.html

@patrickvonplaten patrickvonplaten added the help wanted Extra attention is needed label Oct 21, 2022
@patrickvonplaten
Copy link
Contributor Author

Some ideas for improvement:

  1. At the moment, we're not comparing against numpy float outputs but against the generated images:

    expected_image = np.array(expected_image, dtype=np.float32) / 255.0
    => here we automatically loose precision of our reference image. We should instead compare directly to numpy matrices

  2. We should probably set up a better "layered" testing structure:

    • a.) Have a test just for the UNet forward pass with a very low tolerance (maybe 1e-5 or 1e-4), same for VAE
    • b.) Run full integration tests on CPU as CPU has much less randomness than GPU. Problem here is obviously that the test will take very long - should we maybe set up nightly tests for this?
    • c.) (expensive - not sure if it's worth it) Test on both A100 and V100 and not just V100
    • d.) Just test more samples instead of just one. Let's maybe test Stable Diffusion on 5 different prompts (of different length) and each with 3 different seeds. Those tests should then run once per day only maybe
    • e.) Have at least one full integration test running at every push to master

@patrickvonplaten
Copy link
Contributor Author

Pinging some people that might have good ideas (maybe big NLP models have similar problems) - cc @stas00 @sgugger @ydshieh @thomasw21 @anton-l

@anton-l
Copy link
Member

anton-l commented Oct 21, 2022

Unit-testing individual parts of the pipeline + nightly CPU integration tests seems like a solid path to me at the moment.
For now we can check how spotty our line coverage is when disabling the slow integration tests: lots of room for improvement there.

@patrickvonplaten patrickvonplaten pinned this issue Oct 21, 2022
@patrickvonplaten patrickvonplaten self-assigned this Oct 21, 2022
@ydshieh
Copy link
Contributor

ydshieh commented Oct 21, 2022

Regarding 2), I see in transformers, we test just a slice of the outputs in some cases, something like

self.assertTrue(torch.allclose(logits[0, :3, :3, :3], expected_slice, atol=1e-4))

(well, this approach is probably not what diffusers want to take)

  • Question: in diffusers, is it possible to test against smaller input images (i.e. creating smaller models that accept smaller image inputs)?

Regarding different GPUs: I think we can focus on one type of GPU at this moment. In transformers, we only run on T4 so far. Once the testing become better/complete (i.e. when we are happy enough), we can enhance it.

@ydshieh
Copy link
Contributor

ydshieh commented Oct 21, 2022

If the pipeline used in that issue is tested in diffusers, it's probably a good idea to run that related test in the 2 mentioned diffusers versions and see if both pass. If so, we can try to figure out what the testing lacks to detect the regression in #902.

@thomasw21
Copy link

Continous output:

Maybe I'm naive, but in theory everything in discretized to bits. So if you can guarantee determinism (probably a bunch of flags in pytorch + blacklisting some operations), output can be compared bitwise, ie have perfect equality. That does mean that your test become somewhat setup specific (torch version, type of hardware etc ...) We have things like that on some unit tests in Megatron-DeepSpeed: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/4255845d0685dd59cb70c0026db45f92b7ae4848/tests/test_activations.py#L30

Output dimensions are huge

If you have a very strict test (typically torch_assert_equal), unless you have some inductive bias where some values are not affected, you could randomly sample elements (like 0.1% for example) and run those assertions there? Though I'm not sure I understand what's the issue with comparing 1M values, those operations are pretty fast on GPUs for example.

Error propagation

I think the best comparison with NLP is rather depth of a model, and indeed in this case NLP is much shallower usually. However concerning the argument of 50 passes instead of 1, I'd say it's just a matter of making your test more strict. If you expect that after 50 forward passes the result is EXACTLY the same as explained #902 then you should probably make sure that a single pass is EXACTLY the same. Concerning the tradeoff with speed, I guess it's fine to update the test if you're willing to accept the breaking change that the increase speed implies. In fact it's usually the thing you would do manually:

  • Make an optimization to speed up the process
  • Check if the output changes
  • If the output changes, figure out if the changes are acceptable

Composite systems

Is the issue that in the case where one of your test fails, it's hard to figure out where the bug is?

Concerning the issue where you're seeing issues with FP16. Maybe the issue is you're trying to ensure the model returns the same output regardless of setups (frameworks, hardware etc ...) I don't see how that's easily doable (how can you guarantee that LayerNorm implementations are the same across frameworks). My personal opinion is that you should rely on the APIs you're trying to compute the same thing?

@msaroufim
Copy link

msaroufim commented Oct 22, 2022

My 2c on this, ideally you can have some fast unit tests that will cover some baseline correctness issues but for the vast majority of issues you're highlighting you need time-consuming runs so there's 2 things that help dramatically in my experience.

  1. You can make benchmarks triggerable on some keyword in a PR if the PR's focus is performance improvements Add nightly benchmark and trigger in keyword pytorch/data#740 which removes the need for community members of having to show you logs of perf improvements. The idea would be if something makes things mostly green on your benchmark suite then it should be safe to merge. The alternative is nuanced performance PRs stay open for a long time which will discourage repeat contributions.
  2. The other problem though is some tests may be too time-consuming to run on every PR so you can run them at most once a day without spending way too much money. These kinds of tests are essential but they also change expectations because issues are detected late and may often be nuanced to debug, reverts will be a much bigger part of the workflow and that's not fun for anyone but unfortunately necessary which means you need to actively discourage all but the most dedicated contributors to find something to do besides improving performance

Finally, a lot of these problems become easier to handle if the models are faster either by changing the models or by using training compilers so focus on speed and test time will be crucial to make this process sane.

@patrickvonplaten
Copy link
Contributor Author

Thanks for the great feedback here!

@keturn
Copy link
Contributor

keturn commented Nov 14, 2022

And testing is logistically difficult for another reason that I'm rediscovering with this PR for InvokeAI: we have to do secret token management in order for tests to be able to run with the only known useful diffusion model.

and secrets management + public PRs really don't mix well on GitHub.

I suspect you've worked around that problem here by self-hosting your test runners, but are there any options for other projects that would like to avoid that burden?

Are there any releases of text-to-image models under a creative commons or open source license, so we can pass them around without worrying about redistribution clauses?

They wouldn't need to have the breadth of subjects of the big Stable Diffusion models, or as refined. Just something that works well enough that we can tell if schedulers and guidance functions and whatnot are behaving.

@keturn
Copy link
Contributor

keturn commented Nov 14, 2022

Ah, crowsonkb pointed out that earlier latent diffusion models were released under MIT or Apache licenses, e.g. https://huggingface.co/CompVis/ldm-text2im-large-256

That might be close enough for some purposes. Though I see LDMTextToImagePipeline uses an entirely different text encoder, which our app is sensitive to, so it'd still really help to have a release compatible with the StableDiffusionPipeline.

@patrickvonplaten
Copy link
Contributor Author

And testing is logistically difficult for another reason that I'm rediscovering with this PR for InvokeAI: we have to do secret token management in order for tests to be able to run with the only known useful diffusion model.

and secrets management + public PRs really don't mix well on GitHub.

I suspect you've worked around that problem here by self-hosting your test runners, but are there any options for other projects that would like to avoid that burden?

Are there any releases of text-to-image models under a creative commons or open source license, so we can pass them around without worrying about redistribution clauses?

They wouldn't need to have the breadth of subjects of the big Stable Diffusion models, or as refined. Just something that works well enough that we can tell if schedulers and guidance functions and whatnot are behaving.

That's a very good question - we also ran into this problem with @anton-l 😅

In short, for public PRs we don't test any models that require the auth_token verification. The reason here is that
Public PRs cannot have access to the GitHub secret token of our github repo which means that the PRs fail (please correct me if I'm wrong @anton-l)

when we merge to "main", we always have access to our secret GitHub token and then can run the tests on the models.

@patrickvonplaten
Copy link
Contributor Author

Honestly what I would do in your case is to clone the weights of stable diffusion under your repository on the Hub hf.co/invokeAI (you can do this easily with https://huggingface.co/spaces/osanseviero/repo_duplicator) and then not add a request access requirement there. Note that the CreativeML license doesn't force you to add a request access and 90% of stable diffusion models on the Hub don't have a request access feature: https://huggingface.co/models?library=diffusers

If the repo has no access feature, no need to pass a token :-)

@keturn
Copy link
Contributor

keturn commented Nov 15, 2022

Note that the CreativeML license doesn't force you to add a request access [requirement].

Well, this is what it says about redistribution:

Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION

  1. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g. software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof in any medium, with or without modifications, provided that You meet the following conditions:
    Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to, that the Model or Derivatives of the Model are subject to paragraph 5.

I Am Not A Lawyer, but I'm not sure how I can claim I include an "enforceable provision" on anything I make available by a publicly-advertised unauthenticated URL.

But hey, if 🤗 is saying that's fine, and it's huggingface.co doing the redistributing, then I'd be happy for you to take that liability!

Are you willing to do that repository clone so I don't have to put my name on it? 😇

@CarlosMFerr
Copy link

thanks for the comment @keturn !

The license does not require the licensor nor the licensee to set gated access/tokens as a compulsory mechanism to comply with it (that's a complementary tool on top of the license to ensure users will read it and/or accept it). The license works as an open source or CC license, by using/distributing the model you accept the license and the license applies to you.

@patrickvonplaten patrickvonplaten unpinned this issue Nov 28, 2022
@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 15, 2022
@huggingface huggingface deleted a comment from github-actions bot Dec 20, 2022
@patrickvonplaten patrickvonplaten removed help wanted Extra attention is needed stale Issues that haven't received updates labels Dec 20, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 13, 2023
PhaneeshB pushed a commit to nod-ai/diffusers that referenced this issue Mar 1, 2023
Co-authored-by: dan <dan@nod-labs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

7 participants