-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate SDXL VAE using NATTEN local neighbourhood attention #3
Conversation
a06922f
to
425d7fb
Compare
425d7fb
to
95266f1
Compare
…vs/p311-sdxl/bin/python /home/birch/.vscode-server/extensions/ms-python.python-2023.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 58517 -- -m scripts.vae_roundtrip
okay so installing this again was a wild ride. maybe it got harder in latest version of cmake (v3.30.2)? maybe specifically because FindCUDA got removed?
Anyway, cmake was failing build on pretty much the very first line, To get a look at the true error, I modified f"-DNATTEN_CUDA_ARCH_LIST={cuda_arch_list_str}",
+ f"-DCMAKE_CUDA_ARCHITECTURES={cuda_arch_list_str}", This got it to stop complaining about an empty
nvcc compilation failed because my gcc and g++ didn't point anywhere. So I used sudo update-alternatives --config gcc
sudo update-alternatives --config g++ each of these commands told me that it'd been broken until now:
On the next build attempt, it complained of:
So I made a different change to setup.py (I didn't need to keep the f"-DNATTEN_CUDA_ARCH_LIST={cuda_arch_list_str}",
+ f"-DCUDA_CUDART_LIBRARY=/usr/local/cuda/lib64/libcudart.so", This finally got it compiling!
===== to get a better look at the command that python setup.py, I built it the way they build official distributions. first needed wheel: pip install wheel Then I could use NATTEN_CUDA_ARCH=8.9 NATTEN_VERBOSE=0 NATTEN_IS_BUILDING_DIST=1 NATTEN_WITH_CUDA=1 NATTEN_N_WORKERS=4 python setup.py bdist_wheel -d out/wheels/cu121/torch/240 The CUDA and torch versions were determined like so: from torch import version, __version__
from sys import version_info
# https://github.com/SHI-Labs/NATTEN/blob/7bc099de2a307e23903bb4f8ca1ca36c9df54cef/setup.py#L121
cuda_tag = "".join(version.cuda.split(".")[:2])
torch_tag = "".join(__version__.split("+", maxsplit=1)[0].split(".")) |
Accelerate SDXL VAE using NATTEN local neighbourhood attention
Thanks @crowsonkb for the idea!
Install natten from source like so:
or get latest stable from pip:
Input image:
Output image after VAE round-trip (global self-attention):
Output image after VAE round-trip (local neighbourhood attention, kernel size 17):
This looks identical to global self-attention, whilst requiring far less memory and compute.
Output image after VAE round-trip (local neighbourhood attention, kernel size 3):
This looks nearly identical to global self-attention, requiring even less memory and compute.
Null attention
It looks like there's not actually much similarity-based mixing of information between tokens. so what happens if we just drop scaled dot product attention altogether?
so instead of:
softmax(Q•K.mT*scale)•V•O
we just do:
V•O
Output image after VAE round-trip (null attention):
It's not identical to global self-attention, but it's pretty close. and far cheaper & more scalable than global self-attention (in compute and in memory).