-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516
Comments
bf16 can roughly double the CPU mode speedup, which is predictable. But how many people will benefit from this change is a question. So far, WebUI's support for bf16 is basically blank. |
I get this error when loading a random SD 2.1 model and
After doing some searching, I guess it's come from this bug: |
There is a workaround here. Edit file def attention(
self,
q_x: torch.Tensor,
k_x: Optional[torch.Tensor] = None,
v_x: Optional[torch.Tensor] = None,
attn_mask: Optional[torch.Tensor] = None,
):
k_x = k_x if k_x is not None else q_x
v_x = v_x if v_x is not None else q_x
- attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+ if torch.is_autocast_cpu_enabled():
+ attn_mask = attn_mask.to(torch.get_autocast_cpu_dtype())
+ else:
+ attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
return self.attn(
q_x, k_x, v_x, need_weights=False, attn_mask=attn_mask
)[0] Fix this:
Other effects have not been tested.
|
I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image. Now it started with 630s/it instead of 15s/it XD |
It's just a dirty hack to make sure other code keeps working.
I guess it depends on your hardware support and pytorch support. |
Intel i7-10710U |
Sadly, it looks like your hardware doesn't support avx512 and bfloat16. References: AVX-512 BFloat16 Instructions (BF16) - x86
Automatic Mixed Precision package
|
So, I moved to openvino and now the speed is tripple (5s/it) |
Is there an existing issue for this?
What would your feature do ?
Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.
By using autocast bfloat16 I doubled the performance.
Proposed workflow
Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.
Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.
Additional information
Other system informations:
COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"
python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978
OS Ubuntu 22.04
P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.
I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.
The text was updated successfully, but these errors were encountered: