Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Might be a solution to get built/compiles Flash Attention 2 on Windows #595

Open
Akatsuki030 opened this issue Oct 8, 2023 · 53 comments
Open

Comments

@Akatsuki030
Copy link

Akatsuki030 commented Oct 8, 2023

As a Windows user, I tried to compile this and found the problem was on these two files "flash_fwd_launch_template.h" and "flash_bwd_launch_template.h". below "./flash-attention/csrc/flash_attn/src". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:

First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "Headdim == XX", and some templates like this: run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure), the thing I did is change all the "Headdim" in these templates in the function. Take an example, if the function called run_mha_bwd_hdim128 and has a constant declaration
"Headdim == 128", you have to change Headdim as 128 in the templates, which likes run_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure), and I did the same thing to the functions "run_mha_fwd_hdimXX" and also the templates.

Second, another error is from the "flash_fwd_launch_template.h", line 107, also the problem of referencing the constant "kBlockM" in the below if-else statement, and I rewrote it to

		if constexpr(Kernel_traits::kHeadDim % 128 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 4 - 1) / 4);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else if constexpr(Kernel_traits::kHeadDim % 64 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 8 - 1) / 8);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else{
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 16 - 1) / 16);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}

Third, for the function"run_mha_fwd_splitkv_dispatch" in "flash_fwd_launch_template.h", line 194, you also have to change "kBlockM" in the template as 64. And then you can try to compile it.
These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used

  • python 3.11
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Anaconda
  • Windows 11 22H2
@Akatsuki030 Akatsuki030 changed the title Migbe be a solution to get built/compiles Flash Attention 2 on Windows Might be a solution to get built/compiles Flash Attention 2 on Windows Oct 8, 2023
@Panchovix
Copy link

Panchovix commented Oct 8, 2023

I did try replacing you files .h files on my venv, with

  • Python 3.10
  • Pytorch 2.2 Nightly
  • CUDA 12.1
  • Visual Studio 2022
  • Ninja

And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja?

Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it.

EDIT: Tried with CUDA 12.2, no luck either.

EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM.

It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work.

image

EDIT3: I can confirm it works for exllamav2 + FA v2

Without FA

-- Measuring token speed...
 ** Position     1 + 127 tokens:   13.5848 t/s
 ** Position   128 + 128 tokens:   13.8594 t/s
 ** Position   256 + 128 tokens:   14.1394 t/s
 ** Position   384 + 128 tokens:   13.8138 t/s
 ** Position   512 + 128 tokens:   13.4949 t/s
 ** Position   640 + 128 tokens:   13.6474 t/s
 ** Position   768 + 128 tokens:   13.7073 t/s
 ** Position   896 + 128 tokens:   12.3254 t/s
 ** Position  1024 + 128 tokens:   13.8960 t/s
 ** Position  1152 + 128 tokens:   13.7677 t/s
 ** Position  1280 + 128 tokens:   12.9869 t/s
 ** Position  1408 + 128 tokens:   12.1336 t/s
 ** Position  1536 + 128 tokens:   13.0463 t/s
 ** Position  1664 + 128 tokens:   13.2463 t/s
 ** Position  1792 + 128 tokens:   12.6211 t/s
 ** Position  1920 + 128 tokens:   13.1429 t/s
 ** Position  2048 + 128 tokens:   12.5674 t/s
 ** Position  2176 + 128 tokens:   12.5847 t/s
 ** Position  2304 + 128 tokens:   13.3471 t/s
 ** Position  2432 + 128 tokens:   12.9135 t/s
 ** Position  2560 + 128 tokens:   12.2195 t/s
 ** Position  2688 + 128 tokens:   11.6120 t/s
 ** Position  2816 + 128 tokens:   11.2545 t/s
 ** Position  2944 + 128 tokens:   11.5304 t/s
 ** Position  3072 + 128 tokens:   11.7982 t/s
 ** Position  3200 + 128 tokens:   11.8041 t/s
 ** Position  3328 + 128 tokens:   12.8038 t/s
 ** Position  3456 + 128 tokens:   12.7324 t/s
 ** Position  3584 + 128 tokens:   11.7733 t/s
 ** Position  3712 + 128 tokens:   10.7961 t/s
 ** Position  3840 + 128 tokens:   11.1014 t/s
 ** Position  3968 + 128 tokens:   10.8474 t/s

With FA

-- Measuring token speed...
** Position     1 + 127 tokens:   22.6606 t/s
** Position   128 + 128 tokens:   22.5140 t/s
** Position   256 + 128 tokens:   22.6111 t/s
** Position   384 + 128 tokens:   22.6027 t/s
** Position   512 + 128 tokens:   22.3392 t/s
** Position   640 + 128 tokens:   22.0570 t/s
** Position   768 + 128 tokens:   22.3728 t/s
** Position   896 + 128 tokens:   22.4983 t/s
** Position  1024 + 128 tokens:   21.9384 t/s
** Position  1152 + 128 tokens:   22.3509 t/s
** Position  1280 + 128 tokens:   22.3189 t/s
** Position  1408 + 128 tokens:   22.2739 t/s
** Position  1536 + 128 tokens:   22.4145 t/s
** Position  1664 + 128 tokens:   21.9608 t/s
** Position  1792 + 128 tokens:   21.7645 t/s
** Position  1920 + 128 tokens:   22.1468 t/s
** Position  2048 + 128 tokens:   22.3400 t/s
** Position  2176 + 128 tokens:   21.9830 t/s
** Position  2304 + 128 tokens:   21.8387 t/s
** Position  2432 + 128 tokens:   20.2306 t/s
** Position  2560 + 128 tokens:   21.0056 t/s
** Position  2688 + 128 tokens:   22.2157 t/s
** Position  2816 + 128 tokens:   22.1912 t/s
** Position  2944 + 128 tokens:   22.1835 t/s
** Position  3072 + 128 tokens:   22.1393 t/s
** Position  3200 + 128 tokens:   22.1182 t/s
** Position  3328 + 128 tokens:   22.0821 t/s
** Position  3456 + 128 tokens:   22.0308 t/s
** Position  3584 + 128 tokens:   22.0060 t/s
** Position  3712 + 128 tokens:   21.9909 t/s
** Position  3840 + 128 tokens:   21.9816 t/s
** Position  3968 + 128 tokens:   21.9757 t/s

@tridao
Copy link
Contributor

tridao commented Oct 8, 2023

This is very helpful, thanks @Akatsuki030 and @Panchovix.
@Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with constexpr static int instead of constexpr int? I've just pushed a commit that does it. Can you check if that compile on Windows?
A while back someone (I think it was Daniel Haziza from the xformers team) told me that they need constexpr static int for Windows compilation.

@Panchovix
Copy link

Panchovix commented Oct 9, 2023

@tridao just tested the compilation with your latest push, and now it works.

I did use

  • Python 3.10
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Visual Studio 2022
  • Ninja

@tridao
Copy link
Contributor

tridao commented Oct 9, 2023

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

@Panchovix
Copy link

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

@Panchovix
Copy link

Panchovix commented Oct 9, 2023

@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well.

We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile.

@tridao
Copy link
Contributor

tridao commented Oct 9, 2023

I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows.

@bdashore3
Copy link

bdashore3 commented Oct 9, 2023

Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1).

https://github.com/bdashore3/flash-attention/releases/tag/2.3.2

@bdashore3
Copy link

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

@tridao
Copy link
Contributor

tridao commented Oct 9, 2023

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible).

If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there.

@dunbin
Copy link

dunbin commented Oct 9, 2023

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

你真乃神人也!

@mattiamazzari
Copy link

mattiamazzari commented Oct 11, 2023

Works like a charm. I used:

  • CUDA 12.2
  • PyTorch 2.2.0.dev20231011+cu121 (installed with the command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121). Be sure you install this CUDA version and not the CPU version.

I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM.

If you get a "ninja: build stopped: subcommand failed" error, do this:
git clean -xdf
python setup.py clean
git submodule sync
git submodule deinit -f .
git submodule update --init --recursive
python setup.py install

@YuehChuan
Copy link

GOOD🎶
RTX4090 24GB RAM AMD7950X 64GM RAM
python3.8 python3.10 both work

python3.10
https://www.python.org/downloads/release/python-3100/
win11

python -m venv venv

cd venc/Scripts
activate
-----------------------

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention

pip install packaging 
pip install wheel

set MAX_JOBS=4
python setup.py install
flashattention2

@Nicoolodion2
Copy link

Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution?

@bdashore3
Copy link

@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8.

@neocao123
Copy link

neocao123 commented Oct 18, 2023

I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention
and rotary has been built by @bdashore3 's branch, while layer_norm in error.

I used py3.10, vs2019,cuda12.1

@tridao
Copy link
Contributor

tridao commented Oct 18, 2023

You don't have to use layer_norm.

@neocao123
Copy link

neocao123 commented Oct 18, 2023

You don't have to use layer_norm.

However, I made it work.

The trouble is in ln_bwd_kernels.cuh line 54

For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as

if(HasColscaleConst){
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  true,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
					}else{
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  false,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);

That's stupid way, but it works ,and now is compiling.

@havietisov
Copy link

Does it mean I can use FA2 on windows if build it from source?

@dunbin
Copy link

dunbin commented Dec 14, 2023 via email

@Piscabo
Copy link

Piscabo commented Jan 10, 2024

Any compiled wheel for Windows 11,
Python 3.11
Cuda 12.2
Torch 2.1.2

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: Could not build wheels for flash_attn, which is required to install pyproject.toml-based projects

@dunbin
Copy link

dunbin commented Jan 10, 2024 via email

@C0D3-BR3AK3R
Copy link

C0D3-BR3AK3R commented Jun 12, 2024

I am trying to install Flash Attention 2 on Windows 11, with Python 3.12.3, and here is my setup -
RTX 3050 Laptop
16 GB RAM
Core i7 12650H.

So I have setup MSVC Build Tools 2022, alongside MS VS Community 2022. Once I cloned the Flash Attention git repo, I ran python setup.py install and it gives error below -

running build_ext
D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py:384: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'flash_attn_2_cuda' extension
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn
creating D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src      
Emitting ninja build file D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\build.ninja...
Compiling objects...
Using envvar MAX_JOBS (1) as the number of workers...
[1/49] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
FAILED: D:/Github/Deep-Learning-Basics/LLM Testing/MultiModalAI/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/flash_api.obj
cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\src" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\cutlass\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\torch\csrc\api\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\TH" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\include\THC" "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" "-ID:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\include" -IC:\Python312\include -IC:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" -c "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\csrc\flash_attn\flash_api.cpp" /Fo"D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc/flash_attn/flash_api.obj" -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++17'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "C:\Python312\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\flash-attention\setup.py", line 311, in <module>
    setup(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 184, in setup     
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\core.py", line 200, in run_commands
    dist.run_commands()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 87, in run        
    self.do_egg_install()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install.py", line 139, in do_egg_install
    self.run_command('bdist_egg')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 167, in run     
    cmd = self.call_command('install_lib', warn_dir=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\install_lib.py", line 11, in run    
    self.build()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\install_lib.py", line 110, in build
    self.run_command('build_ext')
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\dist.py", line 968, in run_command
    super().run_command(command)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 91, in run      
    _build_ext.run(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 359, in run
    self.build_extensions()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 870, in build_extensions
    build_ext.build_extensions(self)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 479, in build_extensions
    self._build_extensions_serial()
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 505, in _build_extensions_serial
    self.build_extension(ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\command\build_ext.py", line 252, in build_extension
    _build_ext.build_extension(self, ext)
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 560, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 842, in win_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 1783, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "D:\Github\Deep-Learning-Basics\LLM Testing\MultiModalAI\Flash-env\Lib\site-packages\torch\utils\cpp_extension.py", line 2123, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I'm pretty new to this, so was hoping if someone could point me in the right direction. Couldn't find anyway to fix my issue elsewhere online. Any help would be appreciated. Thanks!

@dunbin
Copy link

dunbin commented Jun 12, 2024 via email

@dicksondickson
Copy link

dicksondickson commented Jun 12, 2024

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website
cuda

I recently recompiled mine with the following:
Windows 11
Python 3.12.4
pyTorch Nightly 2.4.0.dev20240606+cu124
Cuda 12.5.0_555.85
Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here:
batch file

@C0D3-BR3AK3R
Copy link

Seems like you are missing Cuda Toolkit

Download it from Nvidia's website cuda

I recently recompiled mine with the following: Windows 11 Python 3.12.4 pyTorch Nightly 2.4.0.dev20240606+cu124 Cuda 12.5.0_555.85 Nvidia v555.99 Drivers

If you wan to use my batch file, its hosted here: batch file

Oh sorry, I forgot to mention, I do have Cuda toolkit installed. Below is my nvcc -V

 nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

And below is my nvidia-smi

nvidia-smi
Wed Jun 12 13:05:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8              3W /   72W |      32MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     26140    C+G   ...8bbwe\SnippingTool\SnippingTool.exe      N/A      |
+-----------------------------------------------------------------------------------------+

@dicksondickson
Copy link

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

@C0D3-BR3AK3R
Copy link

C0D3-BR3AK3R commented Jun 12, 2024

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.40.33807\include\cstddef(11): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
ninja: build stopped: subcommand failed."

Have you tried installing Visual Studio 2022?

Yes, I had installed Visual Studio 2022 along with the Build Tools 2022. But the issue seemed to be stemming from Visual Studio itself, since I managed to build Flash Attention 2 after modifying the Visual Studio Community 2022 installation and adding the Windows 11 SDK (available under Desktop Development with C++ >> Optional).

Thanks!

@konan009
Copy link

konan009 commented Jun 12, 2024

Just sharing, I was able to build this repo on windows without the need for changes above with these settings :

  1. Python 3.11
  2. VS 2022 C++ (v14.38-17.9)
  3. CUDA 12.2

@d-kleine
Copy link

Seems like CUDA 12.4 and 12.5 not yet supported?

@fangyizhu
Copy link

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5
Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

@abgulati
Copy link

abgulati commented Jun 26, 2024

Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install (took about an hours time, system specs at the end):

pip install flash-attn --no-build-isolation

Python 3.11.5 & PIP 24.1.1
CUDA 12.4
PyTorch installed via:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

PIP dependencies:

pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2

System Specs:

Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)

@d-kleine
Copy link

took about an hours time

Windows roughly an 1 hour, Ubuntu (Linux) some seconds to a few minutes....

@NovaYear
Copy link

NovaYear commented Jul 4, 2024

Successfully install on Windows 11 23H2 (OS Build 22631.3737) via pip install (took about an hours time, system specs at the end):

pip install flash-attn --no-build-isolation

Python 3.11.5 & PIP 24.1.1 CUDA 12.4 PyTorch installed via:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

PIP dependencies:

pip install wheel==0.43.0
pip install ninja==1.11.1
pip install packaging==23.2

System Specs:

Intel Core i9 13900KF Nvidia RTX 3090FE 32GB DDR5 5600MT/s (16x2)

Thanks for the information. I compiled it as you said and it was successful. I set MAX_JOBS=8 as the parameter, other parameters are the same as yours. compilation information:
winver: w11 24h2 26100.836
ram: 32gb dd4 4000mhz
cpu: 5700g
gpu: rtx3090 24gb
runing: 8 compiling thread
cpu usage: ~70%
ram usage: ~31gb
time: ~50mins

@dicksondickson
Copy link

I've been installing flash attention on multiple system and made some batch files to clone and compile for convenience.
You can get them here: https://github.com/dicksondickson/ComfyUI-Clean-Install

@Julianvaldesv
Copy link

Julianvaldesv commented Jul 9, 2024

I have tried all kind of things, but still cannot make the Flash Attention to compile on my windows laptop. This is my settings, I do not know if I have to upgrade CUDA to 12.x. Any advice?
C:\Users\15023>nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Python 3.10.8
Intel(R) Core(TM) i9-14900HX 2.20 GHz
64-bit operating system, x64-based processor
Windows 11 Pro
Nvidia RTX 4080
Package Version


ninja 1.11.1
numpy 1.26.4
packaging 23.2
pillow 10.4.0
pip 24.1.2
pyparsing 3.1.2
python-dateutil 2.9.0.post0
requests 2.32.3
safetensors 0.4.3
setuptools 70.2.0
tokenizers 0.19.1
torch 2.3.1+cu118
torchaudio 2.3.1+cu118
torchvision 0.18.1+cu118
tqdm 4.66.4
urllib3 2.2.2
wheel 0.43.0

@Boubou78000
Copy link

Boubou78000 commented Jul 10, 2024

I ran

set MAX_JOBS=4

And restarted my computer.
Then I ran the pip command and it worked

@jhj0517
Copy link

jhj0517 commented Jul 10, 2024

set MAX_JOBS=1
pip install flash-attn

It worked, but it took hours to install on Windows.
( Stuck at "Building wheel for flash-attn (setup.py)...", building wheel was super slow )

@Julianvaldesv
Copy link

It does not work in my case, :(
PC Specs:
Intel(R) Core(TM) i9-14900HX 2.20 GHz
64-bit operating system, x64-based processor
Windows 11 Pro
Nvidia RTX 4080

Settings:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Package Version


python 3.10.8
ninja 1.11.1
numpy 1.26.4
packaging 23.2
pillow 10.4.0
pip 24.1.2
pyparsing 3.1.2
python-dateutil 2.9.0.post0
requests 2.32.3
setuptools 70.2.0
tokenizers 0.19.1
torch 2.3.1+cu118
torchaudio 2.3.1+cu118
torchvision 0.18.1+cu118
tqdm 4.66.4
urllib3 2.2.2
wheel 0.43.0

VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\

Commands :
set MAX_JOBS=1
pip install flash-attn --no-build-isolation

Errors:

Building wheels for collected packages: flash-attn
Building wheel for flash-attn (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [271 lines of output]
fatal: not a git repository (or any of the parent directories): .git

  torch.__version__  = 2.3.1+cu118

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

FAILED: C:/Users/15023/AppData/Local/Temp/pip-install-dfkun1cn/flash-attn_b24e1ea8cfd04a7980b436f7faaf577f/build/temp.win-amd64-cpython-310/Release/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.obj

RuntimeError: Error compiling objects for extension
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)

@abgulati
Copy link

@Julianvaldesv

The key line is: fatal: not a git repository (or any of the parent directories): .git

This occurs because the setup.py script for flash-attention is trying to run a Git command to update submodules.

Clone the flash-attn git repo and run the pip install command from within it. If you encounter errors stating no flash-attn or something, try running pip install . --no-build-isolation

@Julianvaldesv
Copy link

pip install . --no-build-isolation

I did that before, no good results . I am not sure if I need to upgrade the CUDA from 11.8 to 12.4.
Run from git repo:

PS C:\Users\15023\Documents\Models\Tiny> cd flash-attention

set MAX_JOBS=4

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation
Processing c:\users\15023\documents\models\tiny\flash-attention
Preparing metadata (setup.py) ... done
PS C:\Users\15023\Documents\Models\Tiny> cd flash-attention

set MAX_JOBS=4

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> pip install . --no-build-isolation
Processing c:\users\15023\documents\models\tiny\flash-attention
Preparing metadata (setup.py) ... done
Building wheels for collected packages: flash_attn
Building wheel for flash_attn (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [274 lines of output]

  torch.__version__  = 2.3.1+cu118


  C:\Users\15023\Documents\Models\Tiny\.venv\lib\site-packages\setuptools\__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!

          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************

  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu118torch2.3cxx11abiFALSE-cp310-cp310-win_amd64.whl

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

File "C:\Users\15023\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

RuntimeError: Error compiling objects for extension
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash_attn)

@abgulati
Copy link

abgulati commented Jul 10, 2024

@Julianvaldesv mate you need to start reading those error messages!

The git issue has been resolved and the error has changed so there's progress. It's screaming at you to upgrade PIP:

********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
********************************************************************************

It's even giving you the command to use there and if that doesn't work, simply Google how to upgrade PIP!

It's also telling you your version of MSVS is unsupported: fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

Upgrade pip, then refer to the instructions in my repo to install VisualStudio Build Tools and try again: https://github.com/abgulati/LARS?tab=readme-ov-file#1-build-tools

@Julianvaldesv
Copy link

@Julianvaldesv mate you need to start reading those error messages!

The git issue has been resolved and the error has changed so there's progress. It's screaming at you to upgrade PIP:

********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
********************************************************************************

It's even giving you the command to use there and if that doesn't work, simply Google how to upgrade PIP!

It's also telling you your version of MSVS is unsupported: fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

Upgrade pip, then refer to the instructions in my repo to install VisualStudio Build Tools and try again: https://github.com/abgulati/LARS?tab=readme-ov-file#1-build-tools

@abgulati my friend, thanks for your help. Something else is going on. I upgraded PIP days ago.

PS C:\Users\15023\Documents\Models\Tiny\flash-attention> python -m pip install --upgrade pip

Requirement already satisfied: pip in c:\users\15023\documents\models\tiny.venv\lib\site-packages (24.1.2)

Also I have installed the VisualStudio Build Tools 2022.
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\Bin\MSBuild

@abgulati
Copy link

abgulati commented Jul 10, 2024

@Julianvaldesv In that case, try pasting this error in GPT-4/o or any other good LLM you have access to, describe the problem and background and see what it says

@dicksondickson
Copy link

@Julianvaldesv You are upgrading pip in that tiny.venv. Seems like your system is a mess. Much easier and faster to nuke your system from orbit and start from scratch. Sometimes that's the only way.

@Julianvaldesv
Copy link

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5 Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

What Torch version did you install that it's compatible with CUDA 12.5? According to Pytorch site, only 12.1 is fully supported (or 12.4 from source).

@i486
Copy link

i486 commented Jul 18, 2024

Looks like oobabooga has Windows wheels for cu122, but sadly, no CU118 wheels.

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10")

@pwillia7
Copy link

If pip isn't working for you, you may need more RAM. I was not able to compile in any way on 16GB of RAM, pip worked fine after upgrading to 64GB -- Took a few hours.

@SGrebenkin
Copy link

SGrebenkin commented Sep 9, 2024

Windows 10 Pro x64
cuda 12.5
torch 2.4.1
RTX4070 12GB RAM Core I5 14400F 16GM RAM
python3.9 works

@dunbin
Copy link

dunbin commented Sep 9, 2024 via email

@kairin
Copy link

kairin commented Sep 14, 2024

Screenshot 2024-09-14 212805

Screenshot 2024-09-14 213505

it took me an hr 15 minutes or so.

initially I have issue whereby the installation can't figure out where lcuda is located.

i installed pytorch nightly 12.4
cuda 12.6
windows 11 - but using ubuntu 24.04 in WSL2
nvidia 4080 16gb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests