-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Community] Testing Stable Diffusion is hard 🥵 #937
Comments
Some ideas for improvement:
|
Pinging some people that might have good ideas (maybe big NLP models have similar problems) - cc @stas00 @sgugger @ydshieh @thomasw21 @anton-l |
Unit-testing individual parts of the pipeline + nightly CPU integration tests seems like a solid path to me at the moment. |
Regarding 2), I see in self.assertTrue(torch.allclose(logits[0, :3, :3, :3], expected_slice, atol=1e-4)) (well, this approach is probably not what
Regarding different GPUs: I think we can focus on one type of GPU at this moment. In |
If the pipeline used in that issue is tested in |
Maybe I'm naive, but in theory everything in discretized to bits. So if you can guarantee determinism (probably a bunch of flags in pytorch + blacklisting some operations), output can be compared bitwise, ie have perfect equality. That does mean that your test become somewhat setup specific (torch version, type of hardware etc ...) We have things like that on some unit tests in Megatron-DeepSpeed: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/4255845d0685dd59cb70c0026db45f92b7ae4848/tests/test_activations.py#L30
If you have a very strict test (
I think the best comparison with NLP is rather depth of a model, and indeed in this case NLP is much shallower usually. However concerning the argument of 50 passes instead of 1, I'd say it's just a matter of making your test more strict. If you expect that after 50 forward passes the result is EXACTLY the same as explained #902 then you should probably make sure that a single pass is EXACTLY the same. Concerning the tradeoff with speed, I guess it's fine to update the test if you're willing to accept the breaking change that the increase speed implies. In fact it's usually the thing you would do manually:
Is the issue that in the case where one of your test fails, it's hard to figure out where the bug is? Concerning the issue where you're seeing issues with FP16. Maybe the issue is you're trying to ensure the model returns the same output regardless of setups (frameworks, hardware etc ...) I don't see how that's easily doable (how can you guarantee that |
My 2c on this, ideally you can have some fast unit tests that will cover some baseline correctness issues but for the vast majority of issues you're highlighting you need time-consuming runs so there's 2 things that help dramatically in my experience.
Finally, a lot of these problems become easier to handle if the models are faster either by changing the models or by using training compilers so focus on speed and test time will be crucial to make this process sane. |
Thanks for the great feedback here! |
And testing is logistically difficult for another reason that I'm rediscovering with this PR for InvokeAI: we have to do secret token management in order for tests to be able to run with the only known useful diffusion model. and secrets management + public PRs really don't mix well on GitHub. I suspect you've worked around that problem here by self-hosting your test runners, but are there any options for other projects that would like to avoid that burden? Are there any releases of text-to-image models under a creative commons or open source license, so we can pass them around without worrying about redistribution clauses? They wouldn't need to have the breadth of subjects of the big Stable Diffusion models, or as refined. Just something that works well enough that we can tell if schedulers and guidance functions and whatnot are behaving. |
Ah, crowsonkb pointed out that earlier latent diffusion models were released under MIT or Apache licenses, e.g. https://huggingface.co/CompVis/ldm-text2im-large-256 That might be close enough for some purposes. Though I see LDMTextToImagePipeline uses an entirely different text encoder, which our app is sensitive to, so it'd still really help to have a release compatible with the StableDiffusionPipeline. |
That's a very good question - we also ran into this problem with @anton-l 😅 In short, for public PRs we don't test any models that require the auth_token verification. The reason here is that when we merge to "main", we always have access to our secret GitHub token and then can run the tests on the models. |
Honestly what I would do in your case is to clone the weights of stable diffusion under your repository on the Hub hf.co/invokeAI (you can do this easily with https://huggingface.co/spaces/osanseviero/repo_duplicator) and then not add a request access requirement there. Note that the CreativeML license doesn't force you to add a request access and 90% of stable diffusion models on the Hub don't have a request access feature: https://huggingface.co/models?library=diffusers If the repo has no access feature, no need to pass a token :-) |
Well, this is what it says about redistribution:
I Am Not A Lawyer, but I'm not sure how I can claim I include an "enforceable provision" on anything I make available by a publicly-advertised unauthenticated URL. But hey, if 🤗 is saying that's fine, and it's Are you willing to do that repository clone so I don't have to put my name on it? 😇 |
thanks for the comment @keturn ! The license does not require the licensor nor the licensee to set gated access/tokens as a compulsory mechanism to comply with it (that's a complementary tool on top of the license to ensure users will read it and/or accept it). The license works as an open source or CC license, by using/distributing the model you accept the license and the license applies to you. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Co-authored-by: dan <dan@nod-labs.com>
It's really difficult to test stable diffusion due to the following:
(1, 512, 512, 3)
this means that there are 512 * 512 * 3 ~ 800,000 values that need to be within a given range. Say if you want to test for a max difference of(pred - ref).abs() < 1e-3
we have roughly a million values where this has to hold true. This is quite different in NLP where we rather test things like text generation or final logit layers which usually aren't bigger then a dozen or so tensors of size 768 or 1024.=> Therefore, we need to do full integration tests, meaning we need to make sure that the output of a full denoising process stays within a given error range. At the moment, we're having quite some problems though to get full reproducible of results on different GPUs, CUDA versions etc... (especially for FP16).
That being said, it is extremely important to test stable diffusion to avoid issues like this in the future: #902 whereas we should still be able to improve speed with PRs like this: #371
At the moment, we're running multiple integration tests for all 50 diffusion steps every time a PR is merged to master, see:
diffusers/tests/pipelines/stable_diffusion/test_stable_diffusion.py
Line 518 in 31af4d1
diffusers/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py
Line 465 in 31af4d1
diffusers/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py
Line 262 in 31af4d1
Nevertheless, the tests weren't sufficient to detect: #902
Testing Puzzle 🧩: How can we find the best trade-off between fast & in-expensive test and best possible test coverage taking into account the above points?
We already looked quite a bit into: https://pytorch.org/docs/stable/notes/randomness.html
The text was updated successfully, but these errors were encountered: