attention_autotuner

An autotuner for the flash version of attention and retnet

Introduction

This project provide an autotuner for the flash version of attention and retnet. Users can use it as an pytorch func.

# Attention on A100
from ops.attention_interface import flash_attn_func
import torch
from arch import A100

device_type = A100()
dtype = torch.float16
device = torch.device("cuda")

q = torch.randn(batch, heads, seqlen_q, dim_qk, device=device, dtype=dtype)
k = torch.randn(batch, heads, seqlen_kv, dim_qk, device=device, dtype=dtype)
v = torch.randn(batch, heads, seqlen_kv, dim_v, device=device, dtype=dtype)

o = flash_attn_func(q,k,v,device_type)

# retnet on RTX4090
from ops.retnet_interface import RetNetAttnFunc
import torch

device_type = RTX4090()
dtype = torch.float16
device = torch.device("cuda")

q = torch.randn(batch, heads, seqlen_q, dim_qk, device=device, dtype=dtype)
k = torch.randn(batch, heads, seqlen_kv, dim_qk, device=device, dtype=dtype)
v = torch.randn(batch, heads, seqlen_kv, dim_v, device=device, dtype=dtype)
mask = torch.randn(heads, seqlen_q, seqlen_kv, device=device, dtype=dtype)

o = RetNetAttnFunc(q, k, v, mask, device_type)

do = torch.randn(batch, heads, seqlen_q, dim_v, device=device, dtype=dtype)
o.backward(do)

Installation

Requirements

cuda 12.3
cmake 3.24

To install

clone this repo and its submodule cutlass

git clone --recursive https://github.com/smallscientist1/attention_autotuner.git

add to PYTHONPATH

export PYTHONPATH=$PYTHONPATH:/path/to/attention_autotuner/python

Benchmark

build the C++ benchmark on nvidia Ampere GPU(e.g. A100)

cd benchmarks
mkdir build
cd build
cmake -DPROJECT_CUDA_ARCH="80" ..

Performance

Appendix

attention algo

flash attention

q @ k
reduce_max(qk)
scale = exp(m_old-m_new)
lse* scale
acco * scale
accs * exp(accs-m_new)
lse = reduce_sum(accs)

retnet parallel

q @ k
qk * mask
reduce_abs(qk)
clamp(r)
scale = r_old/r_new
acco * scale
accs / r_new

TODO

chunkwise retnet
cost model
autotuner(more general policy for retnet)
elementwise op
attention backward
retnet performance issue(added load q once, mask stage 2?)
causal config
retnet parallel scan version seqlen_q != seqlen_kv
retnet parallel scan template
retnet bwd load_q_once,causal
the performance of python interface?
retnet backward num_stage_qk=2 bug

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
benchmarks		benchmarks
csrc		csrc
docs		docs
python		python
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

attention_autotuner

Introduction

Installation

Requirements

To install

Benchmark

Performance

Appendix

attention algo

flash attention

retnet parallel

TODO

About

Releases

Packages

Languages

smallscientist1/attention_autotuner

Folders and files

Latest commit

History

Repository files navigation

attention_autotuner

Introduction

Installation

Requirements

To install

Benchmark

Performance

Appendix

attention algo

flash attention

retnet parallel

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages