Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on Integrating xformers with PyTorch 2.5.1 + CUDA 12.6.3 + cuDNN 9.5.1 + Flash Attention 2.7.0.post2 #1163

Closed
kksspoi opened this issue Nov 26, 2024 · 17 comments

Comments

@kksspoi
Copy link

kksspoi commented Nov 26, 2024

❓ Questions and Help

Currently, I have successfully built PyTorch 2.5.1 from source (with CUDA 12.6.3 and cuDNN 9.5.1) and installed it into a Miniconda3 Python 3.11.10 virtual environment. I also built Flash Attention 2.7.0.post2 with CUDA 12.6.3 and installed it into the same environment as a wheel file.

Next, I built and installed xformers from source into this environment. However, upon running python -m xformers.info, I see the following output indicating that fa2F@v2.5.7-pt and fa2B@v2.5.7-pt are being used. This appears to be a version mismatch, as I expected Flash Attention 2.7.0.post2 to integrate.

Previously, I used the following environment: PyTorch 2.5.1 (with CUDA 12.4.1 and cuDNN 9.5.1) and Flash Attention 2.6.3. In this case, xformers successfully integrated with the Flash Attention 2.6.3 that was installed in the virtual environment. However, in the current environment with CUDA 12.6.3, xformers appears unable to integrate with Flash Attention 2.7.0.post2, even though it was also built from source.

Is this a compatibility issue specific to CUDA 12.6.3? Or is there a known method to integrate xformers with Flash Attention 2.7.0.post2 in this setup? Any guidance or suggestions for enabling the integration would be greatly appreciated.

Below is the relevant output from python -m xformers.info:                            xFormers 0.0.29+6e10bd21.d20241126
memory_efficient_attention.ckF: unavailable
memory_efficient_attention.ckB: unavailable
memory_efficient_attention.ck_decoderF: unavailable
memory_efficient_attention.ck_splitKF: unavailable
memory_efficient_attention.cutlassF-pt: available
memory_efficient_attention.cutlassB-pt: available
memory_efficient_attention.fa2F@v2.5.7-pt: available
memory_efficient_attention.fa2B@v2.5.7-pt: available
memory_efficient_attention.fa3F@0.0.0: unavailable
memory_efficient_attention.fa3B@0.0.0: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
sequence_parallel_fused.write_values: available
sequence_parallel_fused.wait_values: available
sequence_parallel_fused.cuda_memset_32b_async: available
sp24.sparse24_sparsify_both_ways: available
sp24.sparse24_apply: available
sp24.sparse24_apply_dense_output: available
sp24._sparse24_gemm: available
sp24._cslt_sparse_mm_search@0.0.0: available
sp24._cslt_sparse_mm@0.0.0: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.5.1
pytorch.cuda: available
gpu.compute_capability: 8.9
gpu.name: NVIDIA GeForce RTX 4060 Ti
dcgm_profiler: unavailable
build.info: available
build.cuda_version: 1206
build.hip_version: None
build.python_version: 3.11.10
build.torch_version: 2.5.1
build.env.TORCH_CUDA_ARCH_LIST: 8.0;8.6;8.9
build.env.PYTORCH_ROCM_ARCH: None
build.env.XFORMERS_BUILD_TYPE: None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: None
build.nvcc_version: 12.6.85
source.privacy: open source
    

@lw
Copy link
Contributor

lw commented Nov 26, 2024

Apparently, we only support up to FlashAttention 2.6.3 (inclusive). If that's not met, we fall back onto the FlashAttention provided by PyTorch, which is what is happening for you.

FLASH_VER_LAST = (2, 6, 3) # last supported, inclusive

I'm not sure why this is the case, it's possible that it's just that we need to test compatibility before increasing that value. Would you be able to modify it manually and check if everything works (including running all the xFormers tests)? If so, we could consider raising that value.

@kksspoi
Copy link
Author

kksspoi commented Nov 27, 2024

Dear Luca Wehrstedt,

Thank you for your response yesterday. I tried the method you suggested, modifying line 66 of xformers/xformers/ops/fmha/flash.py to change LAST from 2.6.3 to 2.7.0. After applying this change, I checked the information using the command python -m xformers.info and saw that the outputs had been updated to:                             memory_efficient_attention.fa2F@v2.7.0 post2-pt: available
memory_efficient_attention.fa2B@v2.7.0 post2-pt: available                However, when I attempted to generate images in ComfyUI using PyTorch 2.5.1 (built from source with CUDA 12.6.3 and cuDNN 9.5.1), along with xFormers 0.0.29 and FlashAttention 2.7.0.post2, a ValueError occurred, and the generation failed.

To investigate further, I tested various library configurations and made the following observations:

Using PyTorch 2.5.1 + FlashAttention 2.7.0.post2 alone, image generation worked without errors.
Combining PyTorch 2.5.1 + FlashAttention 2.6.3 + xFormers 0.0.29 also allowed successful image generation.
If I reverted line 66 in xformers/xformers/ops/fmha/flash.py back to 2.6.3, I could successfully generate images even with FlashAttention 2.7.0 installed in the virtual environment, as PyTorch defaulted to using FlashAttention 2.5.7 when integrated with xFormers.

From a performance perspective—although I understand this is just my personal observation and might not be statistically significant—using PyTorch 2.5.1 + FlashAttention 2.7.0.post2 alone reduced image generation time by approximately 0.15 seconds compared to integrating FlashAttention 2.6.3 with xFormers.

Based on these results, I’m inclined to think that there may be a compatibility issue between xFormers and FlashAttention 2.7.0.post2. Do you think this is the case? If so, does it mean that, for now, xFormers cannot be integrated with FlashAttention 2.7.0.post2? While I find the pattern using xFormers to be more versatile, it seems that achieving this integration may not be feasible at the moment. I’d greatly appreciate any further advice you can provide.

Best regards, 
kkspoi                                                                   

@lw
Copy link
Contributor

lw commented Nov 29, 2024

You'll need to tell us exactly what error you got with FlashAttention 2.7.0 if you want us to help

@kksspoi
Copy link
Author

kksspoi commented Nov 29, 2024

!!! Exception during processing !!! not enough values to unpack (expected 8, got 4)
Traceback (most recent call last):
File "/home/sk/src/ComfyUI/execution.py", line 323, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/execution.py", line 198, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/execution.py", line 169, in _map_node_over_list
process_inputs(input_dict, i)
File "/home/sk/src/ComfyUI/execution.py", line 158, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/nodes.py", line 1457, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/nodes.py", line 1424, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/sample.py", line 43, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 855, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 753, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 740, in sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 719, in inner_sample
samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 624, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/k_diffusion/sampling.py", line 155, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 299, in call
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 706, in call
return self.predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 709, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 279, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
output = model.apply_model(input_x, timestep
, **c).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/model_base.py", line 145, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 857, in forward
h = forward_timestep_embed(module, h, emb, context, transformer_options, time_context=time_context, num_video_frames=num_video_frames, image_only_indicator=image_only_indicator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 44, in forward_timestep_embed
x = layer(x, context, transformer_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/attention.py", line 709, in forward
x = block(x, context=context[i], transformer_options=transformer_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/attention.py", line 596, in forward
n = self.attn1(n, context=context_attn1, value=value_attn1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/attention.py", line 490, in forward
out = optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/ComfyUI/comfy/ldm/modules/attention.py", line 383, in attention_xformers
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/xformers/xformers/ops/fmha/init.py", line 306, in memory_efficient_attention
return _memory_efficient_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/xformers/xformers/ops/fmha/init.py", line 467, in _memory_efficient_attention
return _memory_efficient_attention_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/xformers/xformers/ops/fmha/init.py", line 490, in memory_efficient_attention_forward
out, *
= op.apply(inp, needs_gradient=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/xformers/xformers/ops/fmha/flash.py", line 677, in apply
out, softmax_lse, rng_state = cls.OPERATOR(
^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 324, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sk/miniconda3/envs/tb/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/sk/src/xformers/xformers/ops/fmha/flash.py", line 139, in _flash_fwd
(
ValueError: not enough values to unpack (expected 8, got 4)

The above error occurs during image generation execution in ComfyUI when integrating PyTorch 2.5.1 (CUDA 12.6.3 + cuDNN 9.5.1) with FlashAttention 2.7.1post2 and xformers 0.0.29. However, this issue does not occur when using a different combination: PyTorch 2.5.1 with only FlashAttention 2.7.1post2. Since FlashAttention has been successfully tested and integrated with CUDA 12.6.3, the remaining potential cause seems to be either a failure in building xformers or that simply modifying line 66 of /home/sk/src/xformers/xformers/ops/fmha/flash.py is insufficient for xformers to function properly.

Could you advise on specific steps I should take to resolve this issue?

For reference, here are the commands and environment variables used during the build process:
  Environment Variables 
export MAX_JOBS=6
export USE_NINJA=1
export TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9"
export USE_CUDA=1
export CUDA_HOME=/usr/local/cuda-12.6
export CUDA_NVCC_EXECUTABLE=$CUDA_HOME/bin/nvcc
export CUDNN_LIB_DIR=/lib/x86_64-linux-gnu
export CUDNN_INCLUDE_DIR=/usr/include
export CUDNN_LIBRARY=$CUDNN_LIB_DIR/libcudnn.so
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CUDAHOSTCXX=/usr/bin/g++-12                 
Build Commands

First, I run:

python setup.py build_ext              
If the build completes successfully, I proceed with the following command to install the development version in the Miniconda3 (Python 3.11.10) virtual environment:        python setup.py develop                                            Any guidance or recommendations for resolving this issue would be greatly appreciated.

@lw
Copy link
Contributor

lw commented Dec 2, 2024

The error you reported shows that indeed FlashAttention v2.7.0 changed the API of some of their function (specifically, mha_fwd reduced the number of returned values from 8 to 4, see Dao-AILab/flash-attention#1139). This means that we'll need to change the xFormers code that calls into FlashAttention, and this will require some time and effort. I can't guarantee an ETA.

@lw
Copy link
Contributor

lw commented Dec 11, 2024

We updated FlashAttention to 2.7.2: 839c4ec

@ACGNnsj
Copy link

ACGNnsj commented Dec 27, 2024

We updated FlashAttention to 2.7.2: 839c4ec

Execuse me, will this help resolve the problem?
I've tried the latest commit, but it seems the same.

File "G:\pycharm\ComfyUI\comfy\ldm\modules\attention.py", line 396, in attention_xformers
    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=mask)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\xformers\ops\fmha\__init__.py", line 306, in memory_efficient_attention
    return _memory_efficient_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\xformers\ops\fmha\__init__.py", line 467, in _memory_efficient_attention
    return _memory_efficient_attention_forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\xformers\ops\fmha\__init__.py", line 490, in _memory_efficient_attention_forward
    out, *_ = op.apply(inp, needs_gradient=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\xformers\ops\fmha\flash.py", line 677, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
                                  ^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\torch\_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\torch\_library\custom_ops.py", line 324, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\torch\_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\torch\_dynamo\eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\torch\_library\custom_ops.py", line 367, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "G:\packages\poetry\virtualenvs\stable-diffusion-webui-_SfT44pY-py3.12\Lib\site-packages\xformers\ops\fmha\flash.py", line 139, in _flash_fwd
    (
ValueError: not enough values to unpack (expected 8, got 4)

@lw
Copy link
Contributor

lw commented Dec 27, 2024

Please make sure you reinstall all of Flash + xFormers from scratch, with full compilation. It looks like a version mismatch to me.

@ACGNnsj
Copy link

ACGNnsj commented Dec 27, 2024

Well, mha_fwd here returns 4 elements, while fwd here is expected to return 8 elements.

@abrahamezzeddine
Copy link

abrahamezzeddine commented Dec 30, 2024

Hello,

I am getting the exact same issue.
Error during memory_efficient_attention: not enough values to unpack (expected 8, got 4) with release 0.29 after the update.

Running this on Windows using the installation instructoins provided in this library;

xFormers 0.0.29
memory_efficient_attention.ckF: unavailable
memory_efficient_attention.ckB: unavailable
memory_efficient_attention.ck_decoderF: unavailable
memory_efficient_attention.ck_splitKF: unavailable
memory_efficient_attention.cutlassF-pt: available
memory_efficient_attention.cutlassB-pt: available
memory_efficient_attention.fa2F@v2.7.2.post1: available
memory_efficient_attention.fa2B@v2.7.2.post1: available
memory_efficient_attention.fa3F@0.0.0: unavailable
memory_efficient_attention.fa3B@0.0.0: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
sequence_parallel_fused.write_values: available
sequence_parallel_fused.wait_values: available
sequence_parallel_fused.cuda_memset_32b_async: available
sp24.sparse24_sparsify_both_ways: available
sp24.sparse24_apply: available
sp24.sparse24_apply_dense_output: available
sp24._sparse24_gemm: available
sp24._cslt_sparse_mm_search@0.0.0: available
sp24._cslt_sparse_mm@0.0.0: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.5.1
pytorch.cuda: available
gpu.compute_capability: 8.6
gpu.name: NVIDIA RTX A6000
dcgm_profiler: unavailable
build.info: available
build.cuda_version: 1204
build.hip_version: None
build.python_version: 3.10.11
build.torch_version: 2.5.1+cu124
build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
build.env.PYTORCH_ROCM_ARCH: None
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: -allow-unsupported-compiler
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.29
build.nvcc_version: 12.4.131
source.privacy: open source

Edit: I downgraded to 0.28 and it worked again so i guess there is something with 0.29 that have become broken? @lw

@danthe3rd
Copy link
Contributor

Hi,
Thanks for reporting it, this is indeed a bug in the new release. Let me get a fix out there for you

@danthe3rd danthe3rd reopened this Dec 30, 2024
@abrahamezzeddine
Copy link

Wonderful to hear a fix is on the way! This library is gold!

@danthe3rd
Copy link
Contributor

The fix is out: 46a02df
Can you verify it works for you?
I'll create a new version tomorrow (0.0.29.post1) to fix it for everyone

@abrahamezzeddine
Copy link

Currently running training at the moment but I will fix it as soon as it is done. I’ll report back to you tomorrow early hopefully.

@abrahamezzeddine
Copy link

@danthe3rd
Managed to try it out now. I modified the files directly to incorporate the changes because I am getting ninja errors with 260 lengths issues on Windows (even with latest ninja 1.12 that was supposed to handle similar length issues).

But I can confirm that it is working now, with release 0.29 with the changes you implemented!

Many thanks for taking such quick action on this matter. Happy new year!

@danthe3rd
Copy link
Contributor

The build is in progress for 0.0.29.post1, thanks for the confirmation :)
Happy new year to you too!

@rltgjqmcpgjadyd
Copy link

I had the same ValueError issue but it was fixed in 0.0.29.post1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants