Add flash-attn #41

RaymondLi0 · 2023-03-23T14:36:22Z

Flash-attention, based on NVIDIA#267
with support for MQA

RaymondLi0 · 2023-03-23T16:01:32Z

Some tests with a 1B MQA model (santacoder's config: num_layers 24, num_heads 16, hidden_size 2048), bf16, on 1 A100 gpu.

With flash-attn, this model can be trained with sequences of length up to 8192, and with full-recomputation up to 32768.
With normal-attn, we only reach 2048, or 8192 with selective or full recomputation.

Flash-attn is faster, especially for longer sequences:
Time-per-iteration for seq-length 2048: flash-attn: 19794.1 VS normal-attn: 22679.3
Time-per-iteration for seq-length 4096: flash-attn: 44040.8 VS normal-attn: 71740 (selective-recomputation)
Time-per-iteration for seq-length 8192: flash-attn: 113715.5 VS normal-attn: 256122 (selective-recomputation)

use_flash_attn	seq_len	mbs	gbs	Activation-recomputation	mem_reserved (GB)	iteration_time	TFLOPs
TRUE	512	2	192	None	22.45	6323.2	109.18
TRUE	1024	2	192	None	24.73	9888.1	145.64
TRUE	2048	2	192	None	29.27	19794.1	157.51
TRUE	4096	2	192	None	38.36	44040.8	163.15
TRUE	8192	2	192	None	56.63	113715.5	159.75
TRUE	16384	2	192	None	OOM	OOM	OOM
TRUE	512	2	192	Full	20.8	8606.4	104.65
TRUE	1024	2	192	Full	21.4	12837.4	146.48
TRUE	2048	2	192	Full	22.58	26172.6	155.8
TRUE	4096	2	192	Full	25	58775.6	160.3
TRUE	8192	2	192	Full	29.85	151869.1	157.44
TRUE	16384	2	192	Full	36.95	442532.3	153.86
TRUE	32768	2	192	Full	50.72	1440645.7	150.79
FALSE	512	2	192	None	23.18	6138.9	110.4
FALSE	1024	2	192	None	28.24	10404.8	137.88
FALSE	2048	2	192	None	44.46	22679.3	137.47
FALSE	4096	2	192	None	OOM	OOM	OOM
FALSE	512	2	192	Selective	22.18	6840.5	100.92
FALSE	1024	2	192	Selective	24.18	11216.9	128.39
FALSE	2048	2	192	Selective	28.67	25446.1	122.52
FALSE	4096	2	192	Selective	38.95	71740	100.16
FALSE	8192	2	192	Selective	65.74	256122	70.95
FALSE	16384	2	192	Selective	OOM	OOM	OOM
FALSE	512	2	192	Full	20.8	8109.4	111.06
FALSE	1024	2	192	Full	21.44	13569.5	138.58
FALSE	2048	2	192	Full	23.3	29930.3	136.24
FALSE	4096	2	192	Full	28.06	81514.7	115.58
FALSE	8192	2	192	Full	46.4	276012.9	86.63
FALSE	16384	2	192	Full	OOM	OOM	OOM

RaymondLi0 · 2023-03-23T17:58:55Z

Additional test currently running: training runs on 5k steps should give the same loss
normal-attn model VS flash-attn VS flash-attn with TP and SP

megatron/model/transformer.py

jlamypoirier · 2023-03-23T23:41:26Z

megatron/model/transformer.py

+                # [sq, b, 1, hn] -> [sq, b, np, hn]
+                key_layer = key_layer.expand((sq, b, np, hn))
+                value_layer = value_layer.expand((sq, b, np, hn))
+            q, k, v = [rearrange(x, 's b ... -> b s ...').contiguous()


That looks very bad. Megatron uses the s b format precisely to avoid this kind of reshape. If FlashAttention uses b s we should use that format instead. It should be OK to just comment the two conversions, at least without sequence parallelism (SP would need extra changes but we probably won't use it anyway) https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/model/language_model.py#L240 https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/model/gpt_model.py#L43

Are suggesting to use b s through the whole transformer model?
I think that would require a big chunk of refactoring work, and also testing to make sure we are not breaking anything.
Looking at the nice performance improvements that flash-attn brings, I wouldn't take the risk of breaking everything else just to avoid a transpose here.

Actually the order only matters for attention (and sequence parallell), so it should just be about bypassing these two lines.

The transposes have a big impact on memory usage and a moderate one on speed (I think) so it's quite important.

jlamypoirier · 2023-03-23T23:53:53Z

megatron/model/transformer.py

+                sq, b, np, hn = query_layer.size()
+                # Expand kv to be compatible with flash-attn implementation
+                # [sq, b, 1, hn] -> [sq, b, np, hn]
+                key_layer = key_layer.expand((sq, b, np, hn))


I'm wondering if FlashAttention would work with just expand, that doesn't allocate new memory. If it were to work we would get the full benefits of FlashAttention for MQA. (I would expect it to enforce contiguous tensors but it's worth checking)

Are you asking whether if it would still work if we remove the call to .contiguous() on the next line?

That would almost certainly not work (transposed tensors are much harder to deal with), but maybe if we do the expand after the transpose or skip the transpose altogether.

megatron/model/transformer.py

jlamypoirier

The remaining comments on eliminating unnecessary ops are not essential and can be looked into later.

Great job!

RaymondLi0 added 4 commits March 22, 2023 14:43

add flash-attn

7d5154f

flash-attn: assert that alibi is not used

118f0a8

fix import

d50a89b

update readme

61fe86d

RaymondLi0 changed the title ~~WIP: add flash-attn~~ Add flash-attn Mar 23, 2023

jlamypoirier reviewed Mar 23, 2023

View reviewed changes

raise if using flash-attn with selective recomputation, swap if/else

f5019c8

jlamypoirier approved these changes Mar 24, 2023

View reviewed changes

change back to warning

0ff5746

RaymondLi0 merged commit e0b644b into multi-query-attention Mar 24, 2023

jlamypoirier deleted the flash-attention branch March 25, 2023 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash-attn #41

Add flash-attn #41

RaymondLi0 commented Mar 23, 2023 •

edited

Loading

RaymondLi0 commented Mar 23, 2023

RaymondLi0 commented Mar 23, 2023 •

edited

Loading

jlamypoirier Mar 23, 2023

RaymondLi0 Mar 24, 2023

jlamypoirier Mar 24, 2023

jlamypoirier Mar 24, 2023

jlamypoirier Mar 23, 2023

RaymondLi0 Mar 24, 2023

jlamypoirier Mar 24, 2023

jlamypoirier left a comment

Add flash-attn #41

Add flash-attn #41

Conversation

RaymondLi0 commented Mar 23, 2023 • edited Loading

RaymondLi0 commented Mar 23, 2023

RaymondLi0 commented Mar 23, 2023 • edited Loading

jlamypoirier Mar 23, 2023

Choose a reason for hiding this comment

RaymondLi0 Mar 24, 2023

Choose a reason for hiding this comment

jlamypoirier Mar 24, 2023

Choose a reason for hiding this comment

jlamypoirier Mar 24, 2023

Choose a reason for hiding this comment

jlamypoirier Mar 23, 2023

Choose a reason for hiding this comment

RaymondLi0 Mar 24, 2023

Choose a reason for hiding this comment

jlamypoirier Mar 24, 2023

Choose a reason for hiding this comment

jlamypoirier left a comment

Choose a reason for hiding this comment

RaymondLi0 commented Mar 23, 2023 •

edited

Loading

RaymondLi0 commented Mar 23, 2023 •

edited

Loading