-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Stable Diffusion #371
Optimize Stable Diffusion #371
Conversation
because lists don't get formatted to tensors in `self.set_format`
The documentation is not available anymore as the PR was closed or merged. |
src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py
Outdated
Show resolved
Hide resolved
I added a scheduler = LMSDiscreteScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear"
)
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-3",
scheduler=scheduler,
use_auth_token=True,
memory_format=torch.channels_last,
# revision="fp16",
).to("cuda") |
I think the channels last format deserve a PR on its own, so I'm only keeping changes to stable diffusion for now |
…stable_diff_opti
Hey @NouamaneTazi Not sure if we can have
so But the speed-ups are interesting, so rather than supporting it in the pipeline directly maybe we could add a doc page here https://github.com/huggingface/diffusers/tree/main/docs/source/optimization, that shows how to put the unet in |
Hello @patil-suraj. I agree that it's probably too early to add support for it in Channels last memory format is aimed to work for any models that handle 4D NCHW tensors. Since it's still a beta feature, not all operators are supported yet (you can find the list here). So in the case of a model which has some operators that don't support this memory format, the switching between channels last and the default format could result in a worst performance as explained here
|
Yes, agree! For now we could definitely add this in the docs. Feel free to open a PR for that :) Thanks for the information. |
… stable_diff_opti
Hello, I tested your branch with the code here https://github.com/NouamaneTazi/diffusers/blob/stable_diff_opti/docs/source/optimization/fp16.mdx#tracing, but it got error like this
|
self.sigmas = torch.from_numpy(sigmas).to(device=device) | ||
self.timesteps = torch.from_numpy(timesteps).to(device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only required for LMS scheduler ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that with LMS we don't pass timesteps
to the model but rather index of timesteps, cf
latents = self.scheduler.step(noise_pred, i, latents, **extra_step_kwargs).prev_sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only required for LMS scheduler ?
Because it's the only scheduler that has self.timesteps
as a torch tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that with LMS we don't pass timesteps to the model but rather index of timesteps, cf
The issue i was tackling comes from the UNet
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, my bad! Should we maybe add .to
method to schedulers which will will put all the scheduler state on the device
rather than manually handling it in multiple places. This will also avoid creating unused kwargs
args.
wdyt @patrickvonplaten @anton-l @pcuenca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anton-l also suggested that, but since there were other tensor .to()
operations in the same scheduler we postponed the decision: #534 (comment)
I think most schedulers don't need .to()
, but if it saves us from a kwargs
argument then I think that tips the scale.
Let's try to get this in. Three final things:
|
…pr/NouamaneTazi/371
…lers that accept it" This reverts commit 00d5a51.
- it shouldnt affect performance if timesteps are alrdy on correct device - it does slow down performance if they're on the wrong device
All tests but |
@yuananf can you retry, it should be working now :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments about other things we could optimize. Also lmk if I should add tests to run SD pipelines in fp16
exponent = -math.log(max_period) * torch.arange( | ||
start=0, end=half_dim, dtype=torch.float32, device=timesteps.device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this opereration to run in fp32 even when the pipeline runs in fp16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to try it out - let's maybe do it in a follow-up PR? :-)
exponent = exponent / (half_dim - downscale_freq_shift) | ||
|
||
emb = torch.exp(exponent).to(device=timesteps.device) | ||
emb = torch.exp(exponent) | ||
emb = timesteps[:, None].float() * emb[None, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment
@@ -230,16 +230,16 @@ def forward( | |||
# 1. time | |||
timesteps = timestep | |||
if not torch.is_tensor(timesteps): | |||
# TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can | |||
timesteps = torch.tensor([timesteps], dtype=torch.long, device=sample.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need int64 tensors for timesteps, no matter the pipeline's precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for all the Stable Diffusion applications I've seen so far, timesteps are ints in the range 0..1000.
Even if other diffusion models do several orders of magnitude more than that, you'd think torch.int
would be plenty.
Unless there's some byte alignment optimization reason to specifically make them 64-bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave timesteps as int32
if torch.is_tensor(self.scheduler.timesteps): | ||
timesteps_tensor = self.scheduler.timesteps.to(self.device) | ||
else: | ||
timesteps_tensor = torch.tensor(self.scheduler.timesteps.copy(), device=self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the dtype for timesteps int64
?
Unfortunately it failed with the same error. |
Cool merging! Let's monitor the slow tests for this one |
* initial commit * make UNet stream capturable * try to fix noise_pred value * remove cuda graph and keep NB * non blocking unet with PNDMScheduler * make timesteps np arrays for pndm scheduler because lists don't get formatted to tensors in `self.set_format` * make max async in pndm * use channel last format in unet * avoid moving timesteps device in each unet call * avoid memcpy op in `get_timestep_embedding` * add `channels_last` kwarg to `DiffusionPipeline.from_pretrained` * update TODO * replace `channels_last` kwarg with `memory_format` for more generality * revert the channels_last changes to leave it for another PR * remove non_blocking when moving input ids to device * remove blocking from all .to() operations at beginning of pipeline * fix merging * fix merging * model can run in other precisions without autocast * attn refactoring * Revert "attn refactoring" This reverts commit 0c70c0e. * remove restriction to run conv_norm in fp32 * use `baddbmm` instead of `matmul`for better in attention for better perf * removing all reshapes to test perf * Revert "removing all reshapes to test perf" This reverts commit 006ccb8. * add shapes comments * hardcore whats needed for jitting * Revert "hardcore whats needed for jitting" This reverts commit 2fa9c69. * Revert "remove restriction to run conv_norm in fp32" This reverts commit cec5928. * revert using baddmm in attention's forward * cleanup comment * remove restriction to run conv_norm in fp32. no quality loss was noticed This reverts commit cc9bc13. * add more optimizations techniques to docs * Revert "add shapes comments" This reverts commit 31c58ea. * apply suggestions * make quality * apply suggestions * styling * `scheduler.timesteps` are now arrays so we dont need .to() * remove useless .type() * use mean instead of max in `test_stable_diffusion_inpaint_pipeline_k_lms` * move scheduler timestamps to correct device if tensors * add device to `set_timesteps` in LMSD scheduler * `self.scheduler.set_timesteps` now uses device arg for schedulers that accept it * quick fix * styling * remove kwargs from schedulers `set_timesteps` * revert to using max in K-LMS inpaint pipeline test * Revert "`self.scheduler.set_timesteps` now uses device arg for schedulers that accept it" This reverts commit 00d5a51. * move timesteps to correct device before loop in SD pipeline * apply previous fix to other SD pipelines * UNet now accepts tensor timesteps even on wrong device, to avoid errors - it shouldnt affect performance if timesteps are alrdy on correct device - it does slow down performance if they're on the wrong device * fix pipeline when timesteps are arrays with strides
* initial commit * make UNet stream capturable * try to fix noise_pred value * remove cuda graph and keep NB * non blocking unet with PNDMScheduler * make timesteps np arrays for pndm scheduler because lists don't get formatted to tensors in `self.set_format` * make max async in pndm * use channel last format in unet * avoid moving timesteps device in each unet call * avoid memcpy op in `get_timestep_embedding` * add `channels_last` kwarg to `DiffusionPipeline.from_pretrained` * update TODO * replace `channels_last` kwarg with `memory_format` for more generality * revert the channels_last changes to leave it for another PR * remove non_blocking when moving input ids to device * remove blocking from all .to() operations at beginning of pipeline * fix merging * fix merging * model can run in other precisions without autocast * attn refactoring * Revert "attn refactoring" This reverts commit 0c70c0e. * remove restriction to run conv_norm in fp32 * use `baddbmm` instead of `matmul`for better in attention for better perf * removing all reshapes to test perf * Revert "removing all reshapes to test perf" This reverts commit 006ccb8. * add shapes comments * hardcore whats needed for jitting * Revert "hardcore whats needed for jitting" This reverts commit 2fa9c69. * Revert "remove restriction to run conv_norm in fp32" This reverts commit cec5928. * revert using baddmm in attention's forward * cleanup comment * remove restriction to run conv_norm in fp32. no quality loss was noticed This reverts commit cc9bc13. * add more optimizations techniques to docs * Revert "add shapes comments" This reverts commit 31c58ea. * apply suggestions * make quality * apply suggestions * styling * `scheduler.timesteps` are now arrays so we dont need .to() * remove useless .type() * use mean instead of max in `test_stable_diffusion_inpaint_pipeline_k_lms` * move scheduler timestamps to correct device if tensors * add device to `set_timesteps` in LMSD scheduler * `self.scheduler.set_timesteps` now uses device arg for schedulers that accept it * quick fix * styling * remove kwargs from schedulers `set_timesteps` * revert to using max in K-LMS inpaint pipeline test * Revert "`self.scheduler.set_timesteps` now uses device arg for schedulers that accept it" This reverts commit 00d5a51. * move timesteps to correct device before loop in SD pipeline * apply previous fix to other SD pipelines * UNet now accepts tensor timesteps even on wrong device, to avoid errors - it shouldnt affect performance if timesteps are alrdy on correct device - it does slow down performance if they're on the wrong device * fix pipeline when timesteps are arrays with strides
This PR aims to optimize the stable diffusion pipeline by using the following:
baddbmm
instead ofbmm
+matmul
resulting in 10% speedupRemove unnecessary synchronizations
Before this PR:
After this PR:
Full fp16 vs autocast
results were obtained using the following script: