Linear Attention Benchmarks

MAD

Model	Compression	Fuzzy-In-Context	In-Context	Memorization	Noisy-In-Context	Selective-Copying	Average	Model Size	Implementation
PaTHAttention	0.419	0.647	0.999	0.784	0.999	0.999	0.808	~450K	fla
Transformer	0.432	0.565	0.905	0.788	0.897	0.998	0.764	~400K	fla
MesaNet	0.347	0.515	0.999	0.750	0.999	0.894	0.751	~400K	fla
mLSTM	0.478	0.237	1.000	0.896	1.000	0.870	0.747	~400K	fla
RWKV 7	0.462	0.180	0.999	0.886	0.999	0.949	0.746	~550K	fla
Mamba	0.491	0.123	0.993	0.896	0.997	0.887	0.731	~400K	fla
GatedDeltaProduct	0.416	0.277	0.999	0.648	0.999	0.999	0.723	~750K	fla
Gated DeltaNet	0.435	0.286	0.999	0.552	0.999	0.997	0.712	~450K	fla
mLSTM	0.387	0.268	0.999	0.843	0.998	0.690	0.698	~500K
DeltaNet	0.396	0.393	0.999	0.394	0.999	0.997	0.697	~450K	fla
Gated Linear Attention	0.408	0.155	0.918	0.771	0.931	0.891	0.679	~425K	fla
Gated Slot Attention	0.397	0.212	0.769	0.831	0.821	0.852	0.648	~450K	fla

Running the benchmarks

You will need a linux machine equipped with NVIDIA GPUs. To setup the environment for running these benchmarks you should do the following:

Install pytorch version 2.5.x or 2.6.x. This is important, since we use ray for multi-GPU scripts and ray isn't compatible with triton > 3.1, which come with any pytorch version newer than 2.6, but fla requires pytorch >= 2.5.
Install dependencies in requirements.txt. This may take a while because of causal-conv1d.
Install flash-attention.
Clone the MAD repo and install the requirements.
Rename mad-lab to mad_lab.
Move mad_lab/configs/tasks to ./configs/tasks.
Download the benchmarking data from https://zenodo.org/records/10843663 and place it in ./benchmark/data/.
Run a script from ./evals/mad/ using the following command:

PYTHONPATH=.:"$(pwd)/mad_lab" python -m evals.mad.mad_eval_gsa --data-path ./benchmark/data --num-workers X --n-gpu X --n-tasks-gpu X

Setting the configs

The default configs, as specified in each eval script and corresponding adapter, are the ones used to produce the above results. They can be modified by defining a different configs dictionary and passing it to the adapter. In general, we used the following config:

{
 "dim": 128,
 "head_dim": 128,
 "num_heads": 1
}

With 4 layers: attn → SwiGLU → attn → SwiGLU.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evals/mad		evals/mad
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Linear Attention Benchmarks

MAD

Running the benchmarks

Setting the configs

About

Uh oh!

Releases

Packages

Languages

HassanJbara/lin-attn-bench

Folders and files

Latest commit

History

Repository files navigation

Linear Attention Benchmarks

MAD

Running the benchmarks

Setting the configs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages