Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt to combine the feedforward into the ReLA layer as memory key / values, as proposed in All Attention, suggestion made by Charles Foster.
$ pip install rela-transformer
import torch
from rela_transformer import ReLATransformer
model = ReLATransformer(
num_tokens = 20000,
dim = 512,
depth = 8,
max_seq_len = 1024,
dim_head = 64,
heads = 8
)
x = torch.randint(0, 20000, (1, 1024))
mask = torch.ones(1, 1024).bool()
logits = model(x, mask = mask) # (1, 1024, 20000)
$ python train.py
@misc{zhang2021sparse,
title = {Sparse Attention with Linear Units},
author = {Biao Zhang and Ivan Titov and Rico Sennrich},
year = {2021},
eprint = {2104.07012},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}