Model | Compression | Fuzzy-In-Context | In-Context | Memorization | Noisy-In-Context | Selective-Copying | Average | Model Size | Implementation |
---|---|---|---|---|---|---|---|---|---|
PaTHAttention | 0.419 | 0.647 | 0.999 | 0.784 | 0.999 | 0.999 | 0.808 | ~450K | fla |
Transformer | 0.432 | 0.565 | 0.905 | 0.788 | 0.897 | 0.998 | 0.764 | ~400K | fla |
MesaNet | 0.347 | 0.515 | 0.999 | 0.750 | 0.999 | 0.894 | 0.751 | ~400K | fla |
mLSTM | 0.478 | 0.237 | 1.000 | 0.896 | 1.000 | 0.870 | 0.747 | ~400K | fla |
RWKV 7 | 0.462 | 0.180 | 0.999 | 0.886 | 0.999 | 0.949 | 0.746 | ~550K | fla |
Mamba | 0.491 | 0.123 | 0.993 | 0.896 | 0.997 | 0.887 | 0.731 | ~400K | fla |
GatedDeltaProduct | 0.416 | 0.277 | 0.999 | 0.648 | 0.999 | 0.999 | 0.723 | ~750K | fla |
Gated DeltaNet | 0.435 | 0.286 | 0.999 | 0.552 | 0.999 | 0.997 | 0.712 | ~450K | fla |
mLSTM | 0.387 | 0.268 | 0.999 | 0.843 | 0.998 | 0.690 | 0.698 | ~500K | |
DeltaNet | 0.396 | 0.393 | 0.999 | 0.394 | 0.999 | 0.997 | 0.697 | ~450K | fla |
Gated Linear Attention | 0.408 | 0.155 | 0.918 | 0.771 | 0.931 | 0.891 | 0.679 | ~425K | fla |
Gated Slot Attention | 0.397 | 0.212 | 0.769 | 0.831 | 0.821 | 0.852 | 0.648 | ~450K | fla |
You will need a linux machine equipped with NVIDIA GPUs. To setup the environment for running these benchmarks you should do the following:
- Install pytorch version
2.5.x
or2.6.x
. This is important, since we use ray for multi-GPU scripts and ray isn't compatible withtriton > 3.1
, which come with any pytorch version newer than2.6
, but fla requirespytorch >= 2.5
. - Install dependencies in requirements.txt. This may take a while because of
causal-conv1d
. - Install flash-attention.
- Clone the MAD repo and install the requirements.
- Rename
mad-lab
tomad_lab
. - Move
mad_lab/configs/tasks
to./configs/tasks
. - Download the benchmarking data from https://zenodo.org/records/10843663 and place it in
./benchmark/data/
. - Run a script from
./evals/mad/
using the following command:
PYTHONPATH=.:"$(pwd)/mad_lab" python -m evals.mad.mad_eval_gsa --data-path ./benchmark/data --num-workers X --n-gpu X --n-tasks-gpu X
The default configs, as specified in each eval script and corresponding adapter, are the ones used to produce the above results. They can be modified by defining a different configs dictionary and passing it to the adapter. In general, we used the following config:
{
"dim": 128,
"head_dim": 128,
"num_heads": 1
}
With 4 layers: attn → SwiGLU → attn → SwiGLU
.