Add fused kernel for Hawk forward and backward #1

fattorib · 2024-10-26T20:09:44Z

This PR adds a fused kernel scan kernel for the forward and backward pass. It is around 2x faster than the Triton kernel I originally wrote. It should also use less memory, due to the activation recomputation in the backward pass (benchmarks pending).

Benchmarks

Also benchmarked against torch.nn.functional.scaled_dot_product_attention with hd=64

Forward Pass

bs=8,d_model=1024

Forward + Backward

bs=8,d_model=1024

…r perf at low BS, don't recompute log(2*exp) 3 times per iteration

svladusic · 2024-10-30T01:39:26Z

LGTM 👍

fattorib · 2024-10-30T01:39:48Z

LGTM 👍

Thanks Stefan, appreciate your detailed review. Merging now :)

fattorib added 5 commits October 26, 2024 16:04

add fused kernel for hawk forward and backward

e05e423

update fused kernel to parallelize across channel dimension for faste…

25a9783

…r perf at low BS, don't recompute log(2*exp) 3 times per iteration

remove masking condition, fix pyright errors

0cebe6f

bump version

14c8835

rename header + add comment

1fe8f17

fattorib assigned fattorib and unassigned fattorib Oct 30, 2024

fattorib merged commit 206a3aa into main Oct 30, 2024
2 checks passed

fattorib deleted the fused-kernel branch November 7, 2024 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fused kernel for Hawk forward and backward #1

Add fused kernel for Hawk forward and backward #1

Uh oh!

fattorib commented Oct 26, 2024

Uh oh!

svladusic commented Oct 30, 2024

Uh oh!

fattorib commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!

Add fused kernel for Hawk forward and backward #1

Add fused kernel for Hawk forward and backward #1

Uh oh!

Conversation

fattorib commented Oct 26, 2024

Benchmarks

Forward Pass

Forward + Backward

Uh oh!

svladusic commented Oct 30, 2024

Uh oh!

fattorib commented Oct 30, 2024

Uh oh!

Uh oh!

Uh oh!