Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs #10516

Open
1 task done
LynxPDA opened this issue May 18, 2023 · 8 comments
Open
1 task done
Labels
enhancement New feature or request

Comments

@LynxPDA
Copy link

LynxPDA commented May 18, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.

By using autocast bfloat16 I doubled the performance.

  • Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s

Proposed workflow

  1. Change in ./modules/devices.py

Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.

def autocast(disable=False):
    from modules import shared

    if disable:
        contextlib.nullcontext()

    if dtype == torch.float32 or shared.cmd_opts.precision == "full":
        return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()

    return torch.autocast("cuda")
  1. Change in ./modules/sd_samplers_common.py
    Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.
def single_sample_to_image(sample, approximation=None):
    if approximation is None:
        approximation = approximation_indexes.get(opts.show_progress_type, 0)

    if approximation == 2:
        x_sample = sd_vae_approx.cheap_approximation(sample)
    elif approximation == 1:
        x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
    else:
        x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]

    x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
    if x_sample.dtype == torch.bfloat16:
        x_sample = x_sample.to(torch.float16)
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    return Image.fromarray(x_sample)

Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6  •  torch: 2.1.0.dev20230506+cpu  •  xformers: N/A  •  gradio: 3.28.1  •  commit: 5ab7f213  •  checkpoint: b4391b7978

OS Ubuntu 22.04

P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.

@LynxPDA LynxPDA added the enhancement New feature or request label May 18, 2023
@Sakura-Luna Sakura-Luna changed the title [Feature Request]: Add support for autocast bfloat16 for inference on the latest CPUs [Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs May 19, 2023
@Sakura-Luna
Copy link
Collaborator

bf16 can roughly double the CPU mode speedup, which is predictable. But how many people will benefit from this change is a question. So far, WebUI's support for bf16 is basically blank.

@ClashSAN ClashSAN added the documentation Improvements or additions to documentation label Jul 4, 2023
@catboxanon catboxanon removed the documentation Improvements or additions to documentation label Aug 3, 2023
@CatEricka
Copy link

CatEricka commented Oct 8, 2023

I get this error when loading a random SD 2.1 model and SD XL 1.0 base model:

version: [v1.6.0]  •  python: 3.11.2  •  torch: 2.1.0+cpu  •  xformers: N/A  •  gradio: 3.41.2

Loading weights [e6bb9ea85b] from /pool/dev/sd-web/stable-diffusion-webui/models/Stable-diffusion/SD XL1.0/sdXL_v10VAEFix.safetensors
Creating model from config: /pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Applying attention optimization: InvokeAI... done.
changing setting sd_model_checkpoint to SD XL1.0/sdXL_v10VAEFix.safetensors [e6bb9ea85b]: RuntimeError
Traceback (most recent call last):
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/options.py", line 140, in set
    option.onchange()
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/initialize_util.py", line 170, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 752, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 650, in load_model
    sd_model.cond_stage_model_empty_prompt = get_empty_cond(sd_model)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 535, in get_empty_cond
    d = sd_model.get_learned_conditioning([""])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models_xl.py", line 31, in get_learned_conditioning
    c = self.conditioner(sdxl_conds, force_zero_embeddings=['txt'] if force_zero_negative_prompt else [])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 141, in forward
    emb_out = embedder(batch[embedder.input_key])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 234, in forward
    z = self.process_tokens(tokens, multipliers)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 273, in process_tokens
    z = self.encode_with_transformers(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_open_clip.py", line 57, in encode_with_transformers
    d = self.wrapped.encode_with_transformer(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 470, in encode_with_transformer
    x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 502, in text_transformer_forward
    x = r(x, attn_mask=attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 242, in forward
    x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 228, in attention
    return self.attn(
           ^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 486, in network_MultiheadAttention_forward
    return originals.MultiheadAttention_forward(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1241, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and  query.dtype: c10::BFloat16 instead.

After doing some searching, I guess it's come from this bug:

pytorch/pytorch#99012

@CatEricka
Copy link

CatEricka commented Oct 8, 2023

There is a workaround here. Edit file $(your-stable-diffusion-webui-repo-path)/venv/lib/$(your-python-version)/site-packages/open_clip/transformer.py in library open_clip:

    def attention(
            self,
            q_x: torch.Tensor,
            k_x: Optional[torch.Tensor] = None,
            v_x: Optional[torch.Tensor] = None,
            attn_mask: Optional[torch.Tensor] = None,
    ):
        k_x = k_x if k_x is not None else q_x
        v_x = v_x if v_x is not None else q_x

-        attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+        if torch.is_autocast_cpu_enabled():
+            attn_mask = attn_mask.to(torch.get_autocast_cpu_dtype())
+        else:
+            attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
        return self.attn(
            q_x, k_x, v_x, need_weights=False, attn_mask=attn_mask
        )[0]

Fix this:

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model

Other effects have not been tested.


Also I noticed that the memory usage doubled, which is weird because shouldn't it be halved? (because of half precision)

@sebaxakerhtc
Copy link

sebaxakerhtc commented Jan 8, 2024

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.
Why we still use --no-half if we want a half?

Now it started with 630s/it instead of 15s/it XD

@CatEricka
Copy link

CatEricka commented Jan 8, 2024

Why we still use --no-half if we want a half?

It's just a dirty hack to make sure other code keeps working.

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.

Now it started with 630s/it instead of 15s/it XD

I guess it depends on your hardware support and pytorch support.

@sebaxakerhtc
Copy link

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

@CatEricka
Copy link

CatEricka commented Jan 8, 2024

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

Sadly, it looks like your hardware doesn't support avx512 and bfloat16.

References:

AVX-512 BFloat16 Instructions (BF16) - x86

AVX-512 BFloat16 Instructions (AVX512_BF16) is an x86 extension, part of AVX-512, designed to accelerate neural network-based algorithms by performing dot-product on bfloat16.

Automatic Mixed Precision package

For CPU, only lower precision floating point datatype of torch.bfloat16 is supported for now.

@sebaxakerhtc
Copy link

sebaxakerhtc commented Jan 8, 2024

So, I moved to openvino and now the speed is tripple (5s/it)
Maybe will be helpful for other intel CPU/GPU users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants