[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

LynxPDA · 2023-05-18T14:42:50Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.

By using autocast bfloat16 I doubled the performance.

Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s

Proposed workflow

Change in ./modules/devices.py

Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.

def autocast(disable=False):
    from modules import shared

    if disable:
        contextlib.nullcontext()

    if dtype == torch.float32 or shared.cmd_opts.precision == "full":
        return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()

    return torch.autocast("cuda")

Change in ./modules/sd_samplers_common.py
Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.

def single_sample_to_image(sample, approximation=None):
    if approximation is None:
        approximation = approximation_indexes.get(opts.show_progress_type, 0)

    if approximation == 2:
        x_sample = sd_vae_approx.cheap_approximation(sample)
    elif approximation == 1:
        x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
    else:
        x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]

    x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
    if x_sample.dtype == torch.bfloat16:
        x_sample = x_sample.to(torch.float16)
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    return Image.fromarray(x_sample)

Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978

OS Ubuntu 22.04

P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.

The text was updated successfully, but these errors were encountered:

Sakura-Luna · 2023-05-19T17:08:08Z

bf16 can roughly double the CPU mode speedup, which is predictable. But how many people will benefit from this change is a question. So far, WebUI's support for bf16 is basically blank.

CatEricka · 2023-10-08T14:16:32Z

I get this error when loading a random SD 2.1 model and SD XL 1.0 base model:

version: [v1.6.0] • python: 3.11.2 • torch: 2.1.0+cpu • xformers: N/A • gradio: 3.41.2

Loading weights [e6bb9ea85b] from /pool/dev/sd-web/stable-diffusion-webui/models/Stable-diffusion/SD XL1.0/sdXL_v10VAEFix.safetensors
Creating model from config: /pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Applying attention optimization: InvokeAI... done.
changing setting sd_model_checkpoint to SD XL1.0/sdXL_v10VAEFix.safetensors [e6bb9ea85b]: RuntimeError
Traceback (most recent call last):
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/options.py", line 140, in set
    option.onchange()
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/initialize_util.py", line 170, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 752, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 650, in load_model
    sd_model.cond_stage_model_empty_prompt = get_empty_cond(sd_model)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 535, in get_empty_cond
    d = sd_model.get_learned_conditioning([""])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models_xl.py", line 31, in get_learned_conditioning
    c = self.conditioner(sdxl_conds, force_zero_embeddings=['txt'] if force_zero_negative_prompt else [])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 141, in forward
    emb_out = embedder(batch[embedder.input_key])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 234, in forward
    z = self.process_tokens(tokens, multipliers)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 273, in process_tokens
    z = self.encode_with_transformers(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_open_clip.py", line 57, in encode_with_transformers
    d = self.wrapped.encode_with_transformer(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 470, in encode_with_transformer
    x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 502, in text_transformer_forward
    x = r(x, attn_mask=attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 242, in forward
    x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 228, in attention
    return self.attn(
           ^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 486, in network_MultiheadAttention_forward
    return originals.MultiheadAttention_forward(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1241, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and  query.dtype: c10::BFloat16 instead.

After doing some searching, I guess it's come from this bug:

pytorch/pytorch#99012

CatEricka · 2023-10-08T15:10:51Z

There is a workaround here. Edit file $(your-stable-diffusion-webui-repo-path)/venv/lib/$(your-python-version)/site-packages/open_clip/transformer.py in library open_clip:

    def attention(
            self,
            q_x: torch.Tensor,
            k_x: Optional[torch.Tensor] = None,
            v_x: Optional[torch.Tensor] = None,
            attn_mask: Optional[torch.Tensor] = None,
    ):
        k_x = k_x if k_x is not None else q_x
        v_x = v_x if v_x is not None else q_x

-        attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+        if torch.is_autocast_cpu_enabled():
+            attn_mask = attn_mask.to(torch.get_autocast_cpu_dtype())
+        else:
+            attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
        return self.attn(
            q_x, k_x, v_x, need_weights=False, attn_mask=attn_mask
        )[0]

Fix this:

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model

Other effects have not been tested.

~~Also I noticed that the memory usage doubled, which is weird because shouldn't it be halved? (because of half precision)~~

sebaxakerhtc · 2024-01-08T08:51:28Z

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.
Why we still use --no-half if we want a half?

Now it started with 630s/it instead of 15s/it XD

CatEricka · 2024-01-08T09:11:43Z

Why we still use --no-half if we want a half?

It's just a dirty hack to make sure other code keeps working.

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.

Now it started with 630s/it instead of 15s/it XD

I guess it depends on your hardware support and pytorch support.

sebaxakerhtc · 2024-01-08T09:16:15Z

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

CatEricka · 2024-01-08T09:22:49Z

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

Sadly, it looks like your hardware doesn't support avx512 and bfloat16.

References:

AVX-512 BFloat16 Instructions (BF16) - x86

AVX-512 BFloat16 Instructions (AVX512_BF16) is an x86 extension, part of AVX-512, designed to accelerate neural network-based algorithms by performing dot-product on bfloat16.

Automatic Mixed Precision package

For CPU, only lower precision floating point datatype of torch.bfloat16 is supported for now.

sebaxakerhtc · 2024-01-08T13:07:53Z

So, I moved to openvino and now the speed is tripple (5s/it)
Maybe will be helpful for other intel CPU/GPU users

LynxPDA added the enhancement New feature or request label May 18, 2023

Sakura-Luna changed the title ~~[Feature Request]: Add support for autocast bfloat16 for inference on the latest CPUs~~ [Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs May 19, 2023

ClashSAN added the documentation Improvements or additions to documentation label Jul 4, 2023

catboxanon removed the documentation Improvements or additions to documentation label Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

LynxPDA commented May 18, 2023

Sakura-Luna commented May 19, 2023

CatEricka commented Oct 8, 2023 •

edited

Loading

CatEricka commented Oct 8, 2023 •

edited

Loading

sebaxakerhtc commented Jan 8, 2024 •

edited

Loading

CatEricka commented Jan 8, 2024 •

edited

Loading

sebaxakerhtc commented Jan 8, 2024

CatEricka commented Jan 8, 2024 •

edited

Loading

sebaxakerhtc commented Jan 8, 2024 •

edited

Loading

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

Comments

LynxPDA commented May 18, 2023

Is there an existing issue for this?

What would your feature do ?

Proposed workflow

Additional information

Sakura-Luna commented May 19, 2023

CatEricka commented Oct 8, 2023 • edited Loading

CatEricka commented Oct 8, 2023 • edited Loading

sebaxakerhtc commented Jan 8, 2024 • edited Loading

CatEricka commented Jan 8, 2024 • edited Loading

sebaxakerhtc commented Jan 8, 2024

CatEricka commented Jan 8, 2024 • edited Loading

sebaxakerhtc commented Jan 8, 2024 • edited Loading

CatEricka commented Oct 8, 2023 •

edited

Loading

CatEricka commented Oct 8, 2023 •

edited

Loading

sebaxakerhtc commented Jan 8, 2024 •

edited

Loading

CatEricka commented Jan 8, 2024 •

edited

Loading

CatEricka commented Jan 8, 2024 •

edited

Loading

sebaxakerhtc commented Jan 8, 2024 •

edited

Loading