Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Might be a solution to get built/compiles Flash Attention 2 on Windows #595

Open
Akatsuki030 opened this issue Oct 8, 2023 · 76 comments
Open

Comments

@Akatsuki030
Copy link

Akatsuki030 commented Oct 8, 2023

As a Windows user, I tried to compile this and found the problem was on these two files "flash_fwd_launch_template.h" and "flash_bwd_launch_template.h". below "./flash-attention/csrc/flash_attn/src". While the template tried to reference the variable"Headdim", it caused error C2975. I think this might be the reason why we always get compile errors on the Windows system. Below is how I solve this problem:

First, in the file "flash_bwd_launch_template.h", you can find many functions like "run_mha_bwd_hdimXX", also the constant declaration "Headdim == XX", and some templates like this: run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 4, 2, 2, false, false, T>, Is_dropout>(params, stream, configure), the thing I did is change all the "Headdim" in these templates in the function. Take an example, if the function called run_mha_bwd_hdim128 and has a constant declaration
"Headdim == 128", you have to change Headdim as 128 in the templates, which likes run_flash_bwd<Flash_bwd_kernel_traits<128, 64, 128, 8, 2, 4, 2, false, false, T>, Is_dropout>(params, stream, configure), and I did the same thing to the functions "run_mha_fwd_hdimXX" and also the templates.

Second, another error is from the "flash_fwd_launch_template.h", line 107, also the problem of referencing the constant "kBlockM" in the below if-else statement, and I rewrote it to

		if constexpr(Kernel_traits::kHeadDim % 128 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 4 - 1) / 4);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 4, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else if constexpr(Kernel_traits::kHeadDim % 64 == 0){
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 8 - 1) / 8);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 8, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}else{
			dim3 grid_combine((params.b * params.h * params.seqlen_q + 16 - 1) / 16);
			BOOL_SWITCH(is_even_K, IsEvenKConst, [&] {
				if (params.num_splits <= 2) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 1, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 4) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 2, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 8) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 3, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 16) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 4, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 32) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 5, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 64) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 6, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				} else if (params.num_splits <= 128) {
					flash_fwd_splitkv_combine_kernel<Kernel_traits, 16, 7, IsEvenKConst><<<grid_combine, Kernel_traits::kNThreads, 0, stream>>>(params);
				}
				C10_CUDA_KERNEL_LAUNCH_CHECK();
			});
		}

Third, for the function"run_mha_fwd_splitkv_dispatch" in "flash_fwd_launch_template.h", line 194, you also have to change "kBlockM" in the template as 64. And then you can try to compile it.
These solutions looked stupid but really solved my problem, I successfully compiled flash_attn_2 on Windows, and I still need to take some time to test it on other computers.
I put the files I rewrote: link.
I think there might be a better solution, but for me, it at least works.
Oh, I didn't use Ninja and compiled it from source code, might someone can try to compile it with Ninja?
EDIT: I used

  • python 3.11
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Anaconda
  • Windows 11 22H2
@Akatsuki030 Akatsuki030 changed the title Migbe be a solution to get built/compiles Flash Attention 2 on Windows Might be a solution to get built/compiles Flash Attention 2 on Windows Oct 8, 2023
@Panchovix
Copy link

Panchovix commented Oct 8, 2023

I did try replacing you files .h files on my venv, with

  • Python 3.10
  • Pytorch 2.2 Nightly
  • CUDA 12.1
  • Visual Studio 2022
  • Ninja

And the build failed fairly quickly. I have uninstalled ninja but it seems to be importing it anyways? How did you make to not use ninja?

Also, I can't install your build since I'm on Python 3.10. Gonna see if I manage to compile it.

EDIT: Tried with CUDA 12.2, no luck either.

EDIT2: I managed to build it. I took your .h codes and uncommeneted the variable declarations, and then it worked. It took ~30 minutes on a 7800X3D and 64GB RAM.

It seems that for some reason Windows try to use/import those variables, even when not declared. But, at the same time, if used in some lines below, it doesn't work.

image

EDIT3: I can confirm it works for exllamav2 + FA v2

Without FA

-- Measuring token speed...
 ** Position     1 + 127 tokens:   13.5848 t/s
 ** Position   128 + 128 tokens:   13.8594 t/s
 ** Position   256 + 128 tokens:   14.1394 t/s
 ** Position   384 + 128 tokens:   13.8138 t/s
 ** Position   512 + 128 tokens:   13.4949 t/s
 ** Position   640 + 128 tokens:   13.6474 t/s
 ** Position   768 + 128 tokens:   13.7073 t/s
 ** Position   896 + 128 tokens:   12.3254 t/s
 ** Position  1024 + 128 tokens:   13.8960 t/s
 ** Position  1152 + 128 tokens:   13.7677 t/s
 ** Position  1280 + 128 tokens:   12.9869 t/s
 ** Position  1408 + 128 tokens:   12.1336 t/s
 ** Position  1536 + 128 tokens:   13.0463 t/s
 ** Position  1664 + 128 tokens:   13.2463 t/s
 ** Position  1792 + 128 tokens:   12.6211 t/s
 ** Position  1920 + 128 tokens:   13.1429 t/s
 ** Position  2048 + 128 tokens:   12.5674 t/s
 ** Position  2176 + 128 tokens:   12.5847 t/s
 ** Position  2304 + 128 tokens:   13.3471 t/s
 ** Position  2432 + 128 tokens:   12.9135 t/s
 ** Position  2560 + 128 tokens:   12.2195 t/s
 ** Position  2688 + 128 tokens:   11.6120 t/s
 ** Position  2816 + 128 tokens:   11.2545 t/s
 ** Position  2944 + 128 tokens:   11.5304 t/s
 ** Position  3072 + 128 tokens:   11.7982 t/s
 ** Position  3200 + 128 tokens:   11.8041 t/s
 ** Position  3328 + 128 tokens:   12.8038 t/s
 ** Position  3456 + 128 tokens:   12.7324 t/s
 ** Position  3584 + 128 tokens:   11.7733 t/s
 ** Position  3712 + 128 tokens:   10.7961 t/s
 ** Position  3840 + 128 tokens:   11.1014 t/s
 ** Position  3968 + 128 tokens:   10.8474 t/s

With FA

-- Measuring token speed...
** Position     1 + 127 tokens:   22.6606 t/s
** Position   128 + 128 tokens:   22.5140 t/s
** Position   256 + 128 tokens:   22.6111 t/s
** Position   384 + 128 tokens:   22.6027 t/s
** Position   512 + 128 tokens:   22.3392 t/s
** Position   640 + 128 tokens:   22.0570 t/s
** Position   768 + 128 tokens:   22.3728 t/s
** Position   896 + 128 tokens:   22.4983 t/s
** Position  1024 + 128 tokens:   21.9384 t/s
** Position  1152 + 128 tokens:   22.3509 t/s
** Position  1280 + 128 tokens:   22.3189 t/s
** Position  1408 + 128 tokens:   22.2739 t/s
** Position  1536 + 128 tokens:   22.4145 t/s
** Position  1664 + 128 tokens:   21.9608 t/s
** Position  1792 + 128 tokens:   21.7645 t/s
** Position  1920 + 128 tokens:   22.1468 t/s
** Position  2048 + 128 tokens:   22.3400 t/s
** Position  2176 + 128 tokens:   21.9830 t/s
** Position  2304 + 128 tokens:   21.8387 t/s
** Position  2432 + 128 tokens:   20.2306 t/s
** Position  2560 + 128 tokens:   21.0056 t/s
** Position  2688 + 128 tokens:   22.2157 t/s
** Position  2816 + 128 tokens:   22.1912 t/s
** Position  2944 + 128 tokens:   22.1835 t/s
** Position  3072 + 128 tokens:   22.1393 t/s
** Position  3200 + 128 tokens:   22.1182 t/s
** Position  3328 + 128 tokens:   22.0821 t/s
** Position  3456 + 128 tokens:   22.0308 t/s
** Position  3584 + 128 tokens:   22.0060 t/s
** Position  3712 + 128 tokens:   21.9909 t/s
** Position  3840 + 128 tokens:   21.9816 t/s
** Position  3968 + 128 tokens:   21.9757 t/s

@tridao
Copy link
Member

tridao commented Oct 8, 2023

This is very helpful, thanks @Akatsuki030 and @Panchovix.
@Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with constexpr static int instead of constexpr int? I've just pushed a commit that does it. Can you check if that compile on Windows?
A while back someone (I think it was Daniel Haziza from the xformers team) told me that they need constexpr static int for Windows compilation.

@Panchovix
Copy link

Panchovix commented Oct 9, 2023

@tridao just tested the compilation with your latest push, and now it works.

I did use

  • Python 3.10
  • Pytorch 2.2+cu121 Nightly
  • CUDA 12.2
  • Visual Studio 2022
  • Ninja

@tridao
Copy link
Member

tridao commented Oct 9, 2023

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

@Panchovix
Copy link

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

@Panchovix
Copy link

Panchovix commented Oct 9, 2023

@tridao based on some tests, it seems you need, at least CUDA 12.x and a torch version to build flash attn 2 on Windows, or to even use the wheel. CUDA 11.8 fails to build. Exllamav2 needs to be built with torch+cu121 as well.

We have to be aware that ooba webui comes by default with torch+cu118, so if Windows + that cuda version, it won't compile.

@tridao
Copy link
Member

tridao commented Oct 9, 2023

I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on Windows.

@kingbri1
Copy link

kingbri1 commented Oct 9, 2023

Just built on cuda 12.1 and tested with exllama_v2 on oobabooga's webui. And can confirm what @Panchovix said above, cuda 12.x is required for Cutlass (12.1 if you want pytorch v2.1).

https://github.com/bdashore3/flash-attention/releases/tag/2.3.2

@kingbri1
Copy link

kingbri1 commented Oct 9, 2023

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

@tridao
Copy link
Member

tridao commented Oct 9, 2023

Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version.

Right now github actions only build for Linux. We intentionally don't build with CUDA 12.1 (due to some segfault with nvcc) but when installing on CUDA 12.1, setup.py will download the wheel for 12.2 and use that (they're compatible).

If you (or anyone) have experience with setting up github actions for Windows I'd love to get help there.

@dunbin
Copy link

dunbin commented Oct 9, 2023

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead of having to compile locally, but that can wait till later.

Great! I did built a whl with python setup.py bdist_wheel but it seems some people have issues, but it is here in any case https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel. Probably a missing step for now.

你真乃神人也!

@mattiamazzari
Copy link

mattiamazzari commented Oct 11, 2023

Works like a charm. I used:

  • CUDA 12.2
  • PyTorch 2.2.0.dev20231011+cu121 (installed with the command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121). Be sure you install this CUDA version and not the CPU version.

I have a CPU with 6 cores, so I set the environment variable MAX_JOBS to 4 (previously I've set it to 6 but I got an out-of-memory error), remember to restart your computer after you set it. It took 3h more or less to compile everything with 16GB of RAM.

If you get a "ninja: build stopped: subcommand failed" error, do this:
git clean -xdf
python setup.py clean
git submodule sync
git submodule deinit -f .
git submodule update --init --recursive
python setup.py install

@YuehChuan
Copy link

GOOD🎶
RTX4090 24GB RAM AMD7950X 64GM RAM
python3.8 python3.10 both work

python3.10
https://www.python.org/downloads/release/python-3100/
win11

python -m venv venv

cd venc/Scripts
activate
-----------------------

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention

pip install packaging 
pip install wheel

set MAX_JOBS=4
python setup.py install
flashattention2

@Nicoolodion2
Copy link

Hey, Got it build the wheels finally (on windows), but oobaboogas webui still doesn't detect it... It still gives me the message to install Flash-attention... Anyone got a solution?

@kingbri1
Copy link

@Nicoolodion2 Use my PR until ooba merges it. FA2 on Windows requires Cuda 12.1 while ooba is still stuck on 11.8.

@neocao123
Copy link

neocao123 commented Oct 18, 2023

I'm trying using flash attention in modelscope-agent, which needs layer_norm and rotary.Now flash attention
and rotary has been built by @bdashore3 's branch, while layer_norm in error.

I used py3.10, vs2019,cuda12.1

@tridao
Copy link
Member

tridao commented Oct 18, 2023

You don't have to use layer_norm.

@neocao123
Copy link

neocao123 commented Oct 18, 2023

You don't have to use layer_norm.

However, I made it work.

The trouble is in ln_bwd_kernels.cuh line 54

For some reason unknown, BOOL_SWITCH not worked as turning bool has_colscale to constrexpr bool HasColscaleConst,which caused error C2975.I just make it as

if(HasColscaleConst){
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  true,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);
					}else{
						using Kernel_traits_f = layer_norm::Kernel_traits_finalize<HIDDEN_SIZE,
																				  weight_t,
																				  input_t,
																				  residual_t,
																				  output_t,
																				  compute_t,
																				  index_t,
																				  false,
																				  32 * 32,  // THREADS_PER_CTA
																				  BYTES_PER_LDG_FINAL>;

						auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<Kernel_traits_f, HasColscaleConst, IsEvenColsConst>;
						kernel_f<<<Kernel_traits_f::CTAS, Kernel_traits_f::THREADS_PER_CTA, 0, stream>>>(launch_params.params);

That's stupid way, but it works ,and now is compiling.

@havietisov
Copy link

Does it mean I can use FA2 on windows if build it from source?

@dunbin
Copy link

dunbin commented Dec 14, 2023 via email

@Piscabo
Copy link

Piscabo commented Jan 10, 2024

Any compiled wheel for Windows 11,
Python 3.11
Cuda 12.2
Torch 2.1.2

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash_attn
Running setup.py clean for flash_attn
Failed to build flash_attn
ERROR: Could not build wheels for flash_attn, which is required to install pyproject.toml-based projects

@dunbin
Copy link

dunbin commented Jan 10, 2024 via email

@dicksondickson
Copy link

@Julianvaldesv You are upgrading pip in that tiny.venv. Seems like your system is a mess. Much easier and faster to nuke your system from orbit and start from scratch. Sometimes that's the only way.

@Julianvaldesv
Copy link

I was able to compile and build from the source repository on Windows 11 with:

CUDA 12.5 Python 3.12

I have a Visual Studio 2019 that came with Windows and I've never used it.

pip install never not worked for me.

What Torch version did you install that it's compatible with CUDA 12.5? According to Pytorch site, only 12.1 is fully supported (or 12.4 from source).

@i486
Copy link

i486 commented Jul 18, 2024

Looks like oobabooga has Windows wheels for cu122, but sadly, no CU118 wheels.

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"

https://github.com/oobabooga/flash-attention/releases/download/v2.6.1/flash_attn-2.6.1+cu122torch2.2.2cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10")

@pwillia7
Copy link

If pip isn't working for you, you may need more RAM. I was not able to compile in any way on 16GB of RAM, pip worked fine after upgrading to 64GB -- Took a few hours.

@SGrebenkin
Copy link

SGrebenkin commented Sep 9, 2024

Windows 10 Pro x64
cuda 12.5
torch 2.4.1 + cu124
RTX4070 12GB RAM Core I5 14400F 16GM RAM
python3.9 / jupyter

(cs224u) E:>pip show torch
Name: torch
Version: 2.4.1+cu124
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: c:\users\sgrebenkin.conda\envs\cs224u\lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, flash_attn, torchaudio, torchmetrics, torchvision

Works fine.

@dunbin
Copy link

dunbin commented Sep 9, 2024 via email

@kairin
Copy link

kairin commented Sep 14, 2024

Screenshot 2024-09-14 212805

Screenshot 2024-09-14 213505

it took me an hr 15 minutes or so.

initially I have issue whereby the installation can't figure out where lcuda is located.

i installed pytorch nightly 12.4
cuda 12.6
windows 11 - but using ubuntu 24.04 in WSL2
nvidia 4080 16gb

@werruww
Copy link

werruww commented Jan 16, 2025

Image
Image

@dunbin
Copy link

dunbin commented Jan 16, 2025 via email

@werruww
Copy link

werruww commented Jan 16, 2025

windows 10

python 3.11
rtx4060ti 16gb
cuda 11.8
(b) C:\Users\m\Desktop\2>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

(b) C:\Users\m\Desktop\2>pip show torch
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: C:\Users\m\Desktop\2\b\Lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: flash_attn

@werruww
Copy link

werruww commented Jan 16, 2025

(b) C:\Users\m\Desktop\2>pip install filelock fsspec jinja2 networkx sympy typing-extensions
Requirement already satisfied: filelock in c:\users\m\desktop\2\b\lib\site-packages (3.16.1)
Requirement already satisfied: fsspec in c:\users\m\desktop\2\b\lib\site-packages (2024.12.0)
Requirement already satisfied: jinja2 in c:\users\m\desktop\2\b\lib\site-packages (3.1.5)
Requirement already satisfied: networkx in c:\users\m\desktop\2\b\lib\site-packages (3.4.2)
Requirement already satisfied: sympy in c:\users\m\desktop\2\b\lib\site-packages (1.13.1)
Requirement already satisfied: typing-extensions in c:\users\m\desktop\2\b\lib\site-packages (4.12.2)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\m\desktop\2\b\lib\site-packages (from jinja2) (3.0.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\users\m\desktop\2\b\lib\site-packages (from sympy) (1.3.0)

@werruww
Copy link

werruww commented Jan 16, 2025

(b) C:\Users\m\Desktop\2>pip list
Package Version


einops 0.8.0
filelock 3.16.1
flash_attn 2.6.1
fsspec 2024.12.0
Jinja2 3.1.5
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.4.2
ninja 1.11.1.3
packaging 24.2
pip 24.3.1
setuptools 65.5.0
sympy 1.13.1
torch 2.5.1
typing_extensions 4.12.2

(b) C:\Users\m\Desktop\2>

@werruww
Copy link

werruww commented Jan 16, 2025

(b) C:\Users\m\Desktop\2>pip show flash_attn
Name: flash_attn
Version: 2.6.1
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: C:\Users\m\Desktop\2\b\Lib\site-packages
Requires: einops, torch
Required-by:

(b) C:\Users\m\Desktop\2>

@werruww
Copy link

werruww commented Jan 16, 2025

@werruww
Copy link

werruww commented Jan 17, 2025

from flash_attn import flash_attn_qkvpacked_func, flash_attn_func
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\m\Desktop\2\b\Lib\site-packages\flash_attn_init_.py", line 3, in
from flash_attn.flash_attn_interface import (
File "C:\Users\m\Desktop\2\b\Lib\site-packages\flash_attn\flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: DLL load failed while importing flash_attn_2_cuda: The specified module could not be found.

@werruww
Copy link

werruww commented Jan 17, 2025

(b) C:\Users\m\Desktop\2>pip show flash_attn
Name: flash_attn
Version: 2.6.1
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: C:\Users\m\Desktop\2\b\Lib\site-packages
Requires: einops, torch
Required-by:

(b) C:\Users\m\Desktop\2>pip show torch
Name: torch
Version: 2.2.2+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: C:\Users\m\Desktop\2\b\Lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: flash_attn, torchaudio, torchvision

@werruww
Copy link

werruww commented Jan 17, 2025

how use flash_attn in python code

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", attn_implementation="flash_attention_2").to('cuda')

@werruww
Copy link

werruww commented Jan 17, 2025

(b) C:\Users\m\Desktop\2>pip install wheel
Collecting wheel
Downloading wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Downloading wheel-0.45.1-py3-none-any.whl (72 kB)
Installing collected packages: wheel
Successfully installed wheel-0.45.1

(b) C:\Users\m\Desktop\2>pip install flash-attn --no-build-isolation --force-reinstall --no-cache-dir
Collecting flash-attn
Downloading flash_attn-2.7.3.tar.gz (3.2 MB)
---------------------------------------- 3.2/3.2 MB 3.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting torch (from flash-attn)
Downloading torch-2.5.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting einops (from flash-attn)
Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting filelock (from torch->flash-attn)
Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch->flash-attn)
Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch->flash-attn)
Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch->flash-attn)
Downloading jinja2-3.1.5-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch->flash-attn)
Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting sympy==1.13.1 (from torch->flash-attn)
Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch->flash-attn)
Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch->flash-attn)
Downloading MarkupSafe-3.0.2-cp311-cp311-win_amd64.whl.metadata (4.1 kB)
Downloading einops-0.8.0-py3-none-any.whl (43 kB)
Downloading torch-2.5.1-cp311-cp311-win_amd64.whl (203.1 MB)
---------------------------------------- 203.1/203.1 MB 3.6 MB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
---------------------------------------- 6.2/6.2 MB 3.6 MB/s eta 0:00:00
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
Downloading jinja2-3.1.5-py3-none-any.whl (134 kB)
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
---------------------------------------- 1.7/1.7 MB 3.7 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp311-cp311-win_amd64.whl (15 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
---------------------------------------- 536.2/536.2 kB 4.4 MB/s eta 0:00:00
Building wheels for collected packages: flash-attn
Building wheel for flash-attn (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [126 lines of output]

  A module that was compiled using NumPy 1.x cannot be run in
  NumPy 2.2.1 as it may crash. To support both 1.x and 2.x
  versions of NumPy, modules must be compiled with NumPy 2.0.
  Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

  If you are a user of the module, the easiest solution will be to
  downgrade to 'numpy<2' or try to upgrade the affected module.
  We expect that some modules will need time to support NumPy 2.

  Traceback (most recent call last):  File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "C:\Users\m\AppData\Local\Temp\pip-install-gxcdk1m4\flash-attn_87a632e88ffe4203993fcedb5d7a2d41\setup.py", line 22, in <module>
      import torch
    File "C:\Users\m\Desktop\2\b\Lib\site-packages\torch\__init__.py", line 1477, in <module>
      from .functional import *  # noqa: F403
    File "C:\Users\m\Desktop\2\b\Lib\site-packages\torch\functional.py", line 9, in <module>
      import torch.nn.functional as F
    File "C:\Users\m\Desktop\2\b\Lib\site-packages\torch\nn\__init__.py", line 1, in <module>
      from .modules import *  # noqa: F403
    File "C:\Users\m\Desktop\2\b\Lib\site-packages\torch\nn\modules\__init__.py", line 35, in <module>
      from .transformer import TransformerEncoder, TransformerDecoder, \
    File "C:\Users\m\Desktop\2\b\Lib\site-packages\torch\nn\modules\transformer.py", line 20, in <module>
      device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
  C:\Users\m\Desktop\2\b\Lib\site-packages\torch\nn\modules\transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:84.)
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
  fatal: not a git repository (or any of the parent directories): .git


  torch.__version__  = 2.2.2+cu118


  C:\Users\m\Desktop\2\b\Lib\site-packages\setuptools\installer.py:27: SetuptoolsDeprecationWarning: setuptools.installer is deprecated. Requirements should be satisfied by a PEP 517 installer.
    warnings.warn(
  running bdist_wheel
  Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu11torch2.2cxx11abiFALSE-cp311-cp311-win_amd64.whl
  Precompiled wheel not found. Building from source...
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\bert_padding.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\flash_attn_interface.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\flash_attn_triton.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\flash_attn_triton_og.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\flash_blocksparse_attention.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\flash_blocksparse_attn_interface.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\fused_softmax.py -> build\lib.win-amd64-cpython-311\flash_attn
  copying flash_attn\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn
  creating build\lib.win-amd64-cpython-311\hopper
  copying hopper\benchmark_attn.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\benchmark_flash_attention_fp8.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\benchmark_split_kv.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\flash_attn_interface.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\generate_kernels.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\padding.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\setup.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\test_attn_kvcache.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\test_flash_attn.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\test_kvcache.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\test_util.py -> build\lib.win-amd64-cpython-311\hopper
  copying hopper\__init__.py -> build\lib.win-amd64-cpython-311\hopper
  creating build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\bench.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\bwd_prefill.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\bwd_ref.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\fwd_decode.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\fwd_prefill.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\fwd_ref.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\interface_fa.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\interface_torch.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\test.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\utils.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  copying flash_attn\flash_attn_triton_amd\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\flash_attn_triton_amd
  creating build\lib.win-amd64-cpython-311\flash_attn\layers
  copying flash_attn\layers\patch_embed.py -> build\lib.win-amd64-cpython-311\flash_attn\layers
  copying flash_attn\layers\rotary.py -> build\lib.win-amd64-cpython-311\flash_attn\layers
  copying flash_attn\layers\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\layers
  creating build\lib.win-amd64-cpython-311\flash_attn\losses
  copying flash_attn\losses\cross_entropy.py -> build\lib.win-amd64-cpython-311\flash_attn\losses
  copying flash_attn\losses\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\losses
  creating build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\baichuan.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\bert.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\bigcode.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\btlm.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\falcon.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\gpt.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\gptj.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\gpt_neox.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\llama.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\opt.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\vit.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  copying flash_attn\models\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\models
  creating build\lib.win-amd64-cpython-311\flash_attn\modules
  copying flash_attn\modules\block.py -> build\lib.win-amd64-cpython-311\flash_attn\modules
  copying flash_attn\modules\embedding.py -> build\lib.win-amd64-cpython-311\flash_attn\modules
  copying flash_attn\modules\mha.py -> build\lib.win-amd64-cpython-311\flash_attn\modules
  copying flash_attn\modules\mlp.py -> build\lib.win-amd64-cpython-311\flash_attn\modules
  copying flash_attn\modules\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\modules
  creating build\lib.win-amd64-cpython-311\flash_attn\ops
  copying flash_attn\ops\activations.py -> build\lib.win-amd64-cpython-311\flash_attn\ops
  copying flash_attn\ops\fused_dense.py -> build\lib.win-amd64-cpython-311\flash_attn\ops
  copying flash_attn\ops\layer_norm.py -> build\lib.win-amd64-cpython-311\flash_attn\ops
  copying flash_attn\ops\rms_norm.py -> build\lib.win-amd64-cpython-311\flash_attn\ops
  copying flash_attn\ops\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\ops
  creating build\lib.win-amd64-cpython-311\flash_attn\utils
  copying flash_attn\utils\benchmark.py -> build\lib.win-amd64-cpython-311\flash_attn\utils
  copying flash_attn\utils\distributed.py -> build\lib.win-amd64-cpython-311\flash_attn\utils
  copying flash_attn\utils\generation.py -> build\lib.win-amd64-cpython-311\flash_attn\utils
  copying flash_attn\utils\pretrained.py -> build\lib.win-amd64-cpython-311\flash_attn\utils
  copying flash_attn\utils\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\utils
  creating build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\cross_entropy.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\k_activations.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\layer_norm.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\linear.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\mlp.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\rotary.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  copying flash_attn\ops\triton\__init__.py -> build\lib.win-amd64-cpython-311\flash_attn\ops\triton
  running build_ext
  C:\Users\m\Desktop\2\b\Lib\site-packages\torch\utils\cpp_extension.py:381: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
    warnings.warn(f'Error checking compiler version for {compiler}: {error}')
  building 'flash_attn_2_cuda' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)

@werruww
Copy link

werruww commented Jan 31, 2025

(my10) C:\Users\TARGET STORE\Downloads>pip install flash_attn-2.3.2-cp310-cp310-win_amd64.whl
Processing c:\users\target store\downloads\flash_attn-2.3.2-cp310-cp310-win_amd64.whl
Requirement already satisfied: torch in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (2.1.0+cu121)
Requirement already satisfied: einops in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (0.8.0)
Requirement already satisfied: packaging in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (24.2)
Requirement already satisfied: ninja in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (1.11.1.3)
Requirement already satisfied: filelock in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.13.1)
Requirement already satisfied: typing-extensions in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (4.12.2)
Requirement already satisfied: sympy in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (1.13.1)
Requirement already satisfied: networkx in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.3)
Requirement already satisfied: jinja2 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.1.4)
Requirement already satisfied: fsspec in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (2024.6.1)
Requirement already satisfied: MarkupSafe>=2.0 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from jinja2->torch->flash-attn==2.3.2) (2.1.3)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from sympy->torch->flash-attn==2.3.2) (1.3.0)
Installing collected packages: flash-attn
Attempting uninstall: flash-attn
Found existing installation: flash_attn 2.7.0.post2
Uninstalling flash_attn-2.7.0.post2:
Successfully uninstalled flash_attn-2.7.0.post2
Successfully installed flash-attn-2.3.2

(my10) C:\Users\TARGET STORE\Downloads>python
Python 3.10.16 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:19:12) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import flash_attn_interface
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'flash_attn_interface'

@dunbin
Copy link

dunbin commented Jan 31, 2025 via email

@werruww
Copy link

werruww commented Jan 31, 2025

Installed in Python 3.10 and in wsl
It is not imported

@werruww
Copy link

werruww commented Jan 31, 2025

win10
ubuntu wsl
python 3.10
vstuio 2022

@werruww
Copy link

werruww commented Jan 31, 2025

(my10) C:\Users\TARGET STORE\Downloads>pip list
Package Version


Brotli 1.0.9
certifi 2022.12.7
charset-normalizer 2.1.1
einops 0.8.0
filelock 3.13.1
flash-attn 2.3.2
fsspec 2024.6.1
gmpy2 2.2.1
idna 3.4
Jinja2 3.1.4
MarkupSafe 2.1.3
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
mpmath 1.3.0
networkx 3.3
ninja 1.11.1.3
numpy 2.0.0
packaging 24.2
pillow 11.0.0
pip 25.0
psutil 6.1.1
pybind11 2.12.0
PySocks 1.7.1
PyYAML 6.0.2
requests 2.28.1
setuptools 75.8.0
sympy 1.13.1
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchvision 0.16.0+cu121
typing_extensions 4.12.2
urllib3 1.26.13
wheel 0.45.1
win-inet-pton 1.1.0

(my10) C:\Users\TARGET STORE\Downloads>

@werruww
Copy link

werruww commented Jan 31, 2025

Can someone who installed flash put a file containing libraries, their versions and steps in a text file and display the Python code to use flash and import it?

If possible, an executable file containing the libraries in the version you worked with.

@werruww
Copy link

werruww commented Jan 31, 2025

@werruww
Copy link

werruww commented Jan 31, 2025

(base) C:\Windows\system32>cd C:\Users\TARGET STORE\Desktop\1\flash-attention

(base) C:\Users\TARGET STORE\Desktop\1\flash-attention>conda activate my10

(my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>pip innstall flash_attn-2.3.2-cp310-cp310-win_amd64.whl
ERROR: unknown command "innstall" - maybe you meant "install"

(my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>
(my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>pip install flash_attn-2.3.2-cp310-cp310-win_amd64.whl
WARNING: Requirement 'flash_attn-2.3.2-cp310-cp310-win_amd64.whl' looks like a filename, but the file does not exist
Processing c:\users\target store\desktop\1\flash-attention\flash_attn-2.3.2-cp310-cp310-win_amd64.whl
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\Users\TARGET STORE\Desktop\1\flash-attention\flash_attn-2.3.2-cp310-cp310-win_amd64.whl'

(my10) C:\Users\TARGET STORE\Desktop\1\flash-attention>cd C:\Users\TARGET STORE\Downloads

(my10) C:\Users\TARGET STORE\Downloads>pip install flash_attn-2.3.2-cp310-cp310-win_amd64.whl
Processing c:\users\target store\downloads\flash_attn-2.3.2-cp310-cp310-win_amd64.whl
Requirement already satisfied: torch in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (2.1.0+cu121)
Requirement already satisfied: einops in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (0.8.0)
Requirement already satisfied: packaging in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (24.2)
Requirement already satisfied: ninja in c:\programdata\anaconda3\envs\my10\lib\site-packages (from flash-attn==2.3.2) (1.11.1.3)
Requirement already satisfied: filelock in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.13.1)
Requirement already satisfied: typing-extensions in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (4.12.2)
Requirement already satisfied: sympy in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (1.13.1)
Requirement already satisfied: networkx in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.3)
Requirement already satisfied: jinja2 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (3.1.4)
Requirement already satisfied: fsspec in c:\programdata\anaconda3\envs\my10\lib\site-packages (from torch->flash-attn==2.3.2) (2024.6.1)
Requirement already satisfied: MarkupSafe>=2.0 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from jinja2->torch->flash-attn==2.3.2) (2.1.3)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\programdata\anaconda3\envs\my10\lib\site-packages (from sympy->torch->flash-attn==2.3.2) (1.3.0)
Installing collected packages: flash-attn
Attempting uninstall: flash-attn
Found existing installation: flash_attn 2.7.0.post2
Uninstalling flash_attn-2.7.0.post2:
Successfully uninstalled flash_attn-2.7.0.post2
Successfully installed flash-attn-2.3.2

(my10) C:\Users\TARGET STORE\Downloads>python
Python 3.10.16 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:19:12) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import flash_attn_interface
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'flash_attn_interface'
exit()

(my10) C:\Users\TARGET STORE\Downloads>pip list
Package Version


Brotli 1.0.9
certifi 2022.12.7
charset-normalizer 2.1.1
einops 0.8.0
filelock 3.13.1
flash-attn 2.3.2
fsspec 2024.6.1
gmpy2 2.2.1
idna 3.4
Jinja2 3.1.4
MarkupSafe 2.1.3
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
mpmath 1.3.0
networkx 3.3
ninja 1.11.1.3
numpy 2.0.0
packaging 24.2
pillow 11.0.0
pip 25.0
psutil 6.1.1
pybind11 2.12.0
PySocks 1.7.1
PyYAML 6.0.2
requests 2.28.1
setuptools 75.8.0
sympy 1.13.1
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchvision 0.16.0+cu121
typing_extensions 4.12.2
urllib3 1.26.13
wheel 0.45.1
win-inet-pton 1.1.0

(my10) C:\Users\TARGET STORE\Downloads>

@werruww
Copy link

werruww commented Jan 31, 2025

import flash_attn
Traceback (most recent call last):
File "", line 1, in
File "C:\ProgramData\anaconda3\envs\my10\lib\site-packages\flash_attn_init_.py", line 3, in
from flash_attn.flash_attn_interface import (
File "C:\ProgramData\anaconda3\envs\my10\lib\site-packages\flash_attn\flash_attn_interface.py", line 3, in
import torch
File "C:\ProgramData\anaconda3\envs\my10\lib\site-packages\torch_init_.py", line 137, in
raise err
OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\ProgramData\anaconda3\envs\my10\lib\site-packages\torch\lib\nvfuser_codegen.dll" or one of its dependencies.
import flash
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'flash'

@LucisVivae
Copy link

@werruww spamming this issue does not help but to try and help I note that I could not compile with CUDA 11.8. upgrade to 12.1 or higher before rerunning "pip install flash-attn --no-build-isolation"

@dunbin
Copy link

dunbin commented Feb 21, 2025 via email

@FurkanGozukara
Copy link

cu128 torch vision and torch is out for windows

anyone compiled for cu128 for 5000 series GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests