-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Might be a solution to get built/compiles Flash Attention 2 on Windows #595
Comments
I did try replacing you files .h files on my venv, with
And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja? Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it. EDIT: Tried with CUDA 12.2, no luck either. EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM. It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work. EDIT3: I can confirm it works for exllamav2 + FA v2 Without FA
With FA
|
This is very helpful, thanks @Akatsuki030 and @Panchovix. |
@tridao just tested the compilation with your latest push, and now it works. I did use
|
Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later. |
Great! I did built a whl with |
@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well. We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile. |
I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows. |
Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1). https://github.com/bdashore3/flash-attention/releases/tag/2.3.2 |
Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version. |
Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible). If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there. |
你真乃神人也! |
Works like a charm. I used:
I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM. If you get a "ninja: build stopped: subcommand failed" error, do this: |
Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution? |
@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8. |
I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention I used py3.10, vs2019,cuda12.1 |
You don't have to use layer_norm. |
However, I made it work. The trouble is in ln_bwd_kernels.cuh line 54 For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as
That's stupid way, but it works ,and now is compiling. |
Does it mean I can use FA2 on windows if build it from source? |
您好!信件已收到,感谢您的来信。
|
Any compiled wheel for Windows 11, note: This error originates from a subprocess, and is likely not a problem with pip. |
您好!信件已收到,感谢您的来信。
|
@Julianvaldesv You are upgrading pip in that tiny.venv. Seems like your system is a mess. Much easier and faster to nuke your system from orbit and start from scratch. Sometimes that's the only way. |
What Torch version did you install that it's compatible with CUDA 12.5? According to Pytorch site, only 12.1 is fully supported (or 12.4 from source). |
Looks like oobabooga has Windows wheels for cu122, but sadly, no CU118 wheels. https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11" https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10") |
If pip isn't working for you, you may need more RAM. I was not able to compile in any way on 16GB of RAM, pip worked fine after upgrading to 64GB -- Took a few hours. |
Windows 10 Pro x64 (cs224u) E:>pip show torch Works fine. |
您好!信件已收到,感谢您的来信。
|
您好!信件已收到,感谢您的来信。
|
windows 10 python 3.11 (b) C:\Users\m\Desktop\2>pip show torch |
(b) C:\Users\m\Desktop\2>pip install filelock fsspec jinja2 networkx sympy typing-extensions |
(b) C:\Users\m\Desktop\2>pip list einops 0.8.0 (b) C:\Users\m\Desktop\2> |
(b) C:\Users\m\Desktop\2>pip show flash_attn (b) C:\Users\m\Desktop\2> |
|
(b) C:\Users\m\Desktop\2>pip show flash_attn (b) C:\Users\m\Desktop\2>pip show torch |
how use flash_attn in python code import torch tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") |
(b) C:\Users\m\Desktop\2>pip install wheel (b) C:\Users\m\Desktop\2>pip install flash-attn --no-build-isolation --force-reinstall --no-cache-dir × python setup.py bdist_wheel did not run successfully.
note: This error originates from a subprocess, and is likely not a problem with pip. |
(my10) C:\Users\TARGET STORE\Downloads>pip install flash_attn-2.3.2-cp310-cp310-win_amd64.whl (my10) C:\Users\TARGET STORE\Downloads>python
|
您好!信件已收到,感谢您的来信。
|
Installed in Python 3.10 and in wsl |
win10 |
(my10) C:\Users\TARGET STORE\Downloads>pip list Brotli 1.0.9 (my10) C:\Users\TARGET STORE\Downloads> |
Can someone who installed flash put a file containing libraries, their versions and steps in a text file and display the Python code to use flash and import it? If possible, an executable file containing the libraries in the version you worked with. |
https://huggingface.co/lldacing/flash-attention-windows-wheel It never worked. |
(base) C:\Windows\system32>cd C:\Users\TARGET STORE\Desktop\1\flash-attention (base) C:\Users\TARGET STORE\Desktop\1\flash-attention>conda activate my10 (my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>pip innstall flash_attn-2.3.2-cp310-cp310-win_amd64.whl (my10) C:\Users\TARGET STORE\Desktop\1\flash-attention> (my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>cd C:\Users\TARGET STORE\Downloads (my10) C:\Users\TARGET STORE\Downloads>pip install flash_attn-2.3.2-cp310-cp310-win_amd64.whl (my10) C:\Users\TARGET STORE\Downloads>python
(my10) C:\Users\TARGET STORE\Downloads>pip list Brotli 1.0.9 (my10) C:\Users\TARGET STORE\Downloads> |
|
@werruww spamming this issue does not help but to try and help I note that I could not compile with CUDA 11.8. upgrade to 12.1 or higher before rerunning "pip install flash-attn --no-build-isolation" |
您好!信件已收到,感谢您的来信。
|
cu128 torch vision and torch is out for windows anyone compiled for cu128 for 5000 series GPUs? |
As a Windows user, I tried to compile this and found the problem was on these two files "
flash_fwd_launch_template.h
" and "flash_bwd_launch_template.h
". below "./flash-attention/csrc/flash_attn/src
". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "
Headdim == XX
", and some templates like this:run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure)
, the thing I did is change all the "Headdim
" in these templates in the function. Take an example, if the function calledrun_mha_bwd_hdim128
and has a constant declaration"
Headdim == 128
", you have to change Headdim as 128 in the templates, which likesrun_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure)
, and I did the same thing to the functions "run_mha_fwd_hdimXX
" and also the templates.Second, another error is from the "
flash_fwd_launch_template.h
", line 107, also the problem of referencing the constant "kBlockM
" in the below if-else statement, and I rewrote it toThird, for the function"
run_mha_fwd_splitkv_dispatch
" in "flash_fwd_launch_template.h
", line 194, you also have to change "kBlockM
" in the template as 64. And then you can try to compile it.These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used
The text was updated successfully, but these errors were encountered: