Performance benchmarks? #9

imoneoi · 2022-12-07T11:00:35Z

Are there any benchmark results now? Looking forward to performance comparisons with original attention, and official torch+CUDA implementation.

jakubMitura14 · 2022-12-29T12:59:18Z

I am also curious, additionally maybe it is possible to use cuda code with jax ?

https://github.com/dfm/extending-jax

OhadRubin · 2023-02-27T15:30:03Z

https://colab.research.google.com/drive/1-YCU9ps4gNuROJ3_8MLjSpbICGHaySxh?usp=sharing

jakubMitura14 · 2023-02-27T15:32:54Z

Fantastic! have you done experiment with the same data on original flash attention ?

OhadRubin · 2023-02-27T18:01:31Z

Not yet

jon-chuang · 2023-04-06T22:56:04Z

Hello, could I ask if this works with TPUs?

evanatyourservice · 2023-10-21T15:47:53Z

Here's an updated notebook that precompiles jit and blocks results until ready for anyone interested:

https://colab.research.google.com/drive/11QKRdgMtcivrJNmjTrf2bXTE5yXkXl_Z?usp=sharing

Looks like JAX compiles vanilla attention in a way to be faster than jax flash attention, so no need to change to flash attention if you use JAX.

SamuelGabriel · 2023-11-11T09:49:46Z

Wow this is open from almost a year ago...

I think someone could get a lot citations / clicks if they did a proper benchmark of transformer train/inference across platforms torch/jax GPU/TPU with standard tricks. I would cite you straight away for sure. It would also just be nice to settle this dispute that Google employees seem to have with everyone else, whether or not Jax is meaningfully more efficient or not (might just be up to TPU or GPU!?).

niemiaszek · 2023-11-29T16:05:58Z

Wow this is open from almost a year ago...

I think someone could get a lot citations / clicks if they did a proper benchmark of transformer train/inference across platforms torch/jax GPU/TPU with standard tricks. I would cite you straight away for sure. It would also just be nice to settle this dispute that Google employees seem to have with everyone else, whether or not Jax is meaningfully more efficient or not (might just be up to TPU or GPU!?).

Would be definitely nice to see such benchmark, but I can imagine how hard is comparing JAX vs PyTorch (GPU/TPU), with many optimized implementations for each device. For PyTorch with GPU we have Triton/CUDA, but JAX recently has also added Triton-like mechanism for writing custom Kernels with GPU/TPU - Pallas. You can even find implementation of attention in it here.

evanatyourservice · 2023-11-29T16:19:12Z

@niemiaszek I just recently saw they named and added docs for pallas, looks very interesting. JAX is also improving our ability to customize how networks are sharded across accelerators and are publishing papers on their results wrt efficiency, pretty cool I think. Unfortunately I don't have time to do a fair comparison between torch and jax with attention but it seems that whoever takes the time to delve into it, especially jax's recent improvements, would certainly benefit if they have a need.

Even if we don't take the time, it looks like the jax team continually adds their efficiency findings into jax as defaults so we don't have to implement ourselves.

lucidrains · 2023-11-29T16:27:06Z

from what i've heard, flash attention doesn't work well on TPUs, but i haven't kept up with the latest iteration of their chip design.

Pallas is just a wrapper around Triton, developed at OpenAI for GPUs. you will basically be always limited by what the Triton compiler can do

lucidrains · 2023-11-29T16:35:27Z

while this is a hard pill to swallow, i think existence of flash attention is a clear victory for having finely controlled GPGPU programming.

evanatyourservice · 2023-11-29T16:53:01Z

@lucidrains I'd agree as far as single-device optimizations go. I solely use jax because my work deals mainly with RL and I've already built everything out, but for things like language and vision models, resources like xformers are hard to beat. I do like jax's work toward multi-device customization especially from an RL perspective.

jon-chuang · 2023-11-29T17:27:44Z

while this is a hard pill to swallow, i think existence of flash attention is a clear victory for having finely controlled GPGPU programming.

Well, I would argue that in this day, that's no longer such a hard pill given the wide adoption of tiled programming paradigm like Triton (e.g. PyTorch - both codegen + incoming custom kernels, JAX - e.g. Pallas, hardware vendors including NVIDIA, AMD, Intel) which greatly reduces the effort and complexity of getting SOTA perf on GPUs.

lucidrains · 2023-11-29T17:31:41Z

@jon-chuang hmm, still a bit early to declare that imho

we'll see, i hope so!

jon-chuang · 2023-11-29T17:34:33Z

Yes, Triton is still not 100% (some matmul kernel size and certain kernels like flash attention backwards are still not SOTA). But it's certainly the direction that industry is investing in, and IMO it's good news for developers and tinkerers who want hackability of each layer of the stack.

I've already heard of some success stories with customizing flash attention kernels via Triton.

jon-chuang · 2023-11-29T17:41:12Z

I think these newish attention replacements will take time to be adopted particularly because the dust has not settled on them and it takes a while for wide-scale experimentation and large-scale training with them to truly prove them out.

IMO all it takes is a leap for a highly-funded industrial lab to go out on a limb and train an LLM with one of these...

For instance, Mistral AI essentially has a linear cost attention mechanism based on SWA - sliding window attention - one could argue of course how effective it is at truly capturing information across long context.

all these frameworks cannot do.

I think this is an overstatement? I think it simply has not been tried out in Triton yet. But it should not be that hard. But whether the performance matches is an open question.

I just hope that more devs become aware of how powerful triton is so that there's more experimentation with implementing these kernels.

lucidrains · 2023-11-29T18:04:26Z

@jon-chuang yea, let us just agree that we both wish for Triton and the like to succeed so us non-CUDA experts can have control over the entire stack

i just know it isn't there yet.

jon-chuang · 2023-11-30T05:48:44Z

Interestingly, a basic building block for Mamba (associative scan) already has support in Triton: pytorch/pytorch#95408 (comment)

lucidrains · 2023-11-30T14:16:42Z

it doesn't support multiple inputs. also i heard it is still buggy in its current state

lucidrains · 2023-11-30T15:52:41Z

@jon-chuang anyways, let us take the discussion elsewhere, as this is about flash attention

MasterSkepticista · 2024-07-21T04:50:52Z

Flash attention is now available in jax-nightly with a cudnn implementation: jax.nn.dot_product_attention. It only supports Ampere architecture and later.

Note that the default is xla.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarks? #9

Performance benchmarks? #9

imoneoi commented Dec 7, 2022

jakubMitura14 commented Dec 29, 2022

OhadRubin commented Feb 27, 2023

jakubMitura14 commented Feb 27, 2023 •

edited

Loading

OhadRubin commented Feb 27, 2023

jon-chuang commented Apr 6, 2023

evanatyourservice commented Oct 21, 2023 •

edited

Loading

SamuelGabriel commented Nov 11, 2023

niemiaszek commented Nov 29, 2023

evanatyourservice commented Nov 29, 2023

lucidrains commented Nov 29, 2023 •

edited

Loading

lucidrains commented Nov 29, 2023

evanatyourservice commented Nov 29, 2023

jon-chuang commented Nov 29, 2023 •

edited

Loading

lucidrains commented Nov 29, 2023

jon-chuang commented Nov 29, 2023

jon-chuang commented Nov 29, 2023

lucidrains commented Nov 29, 2023 •

edited

Loading

jon-chuang commented Nov 30, 2023

lucidrains commented Nov 30, 2023 •

edited

Loading

lucidrains commented Nov 30, 2023

MasterSkepticista commented Jul 21, 2024

Performance benchmarks? #9

Performance benchmarks? #9

Comments

imoneoi commented Dec 7, 2022

jakubMitura14 commented Dec 29, 2022

OhadRubin commented Feb 27, 2023

jakubMitura14 commented Feb 27, 2023 • edited Loading

OhadRubin commented Feb 27, 2023

jon-chuang commented Apr 6, 2023

evanatyourservice commented Oct 21, 2023 • edited Loading

SamuelGabriel commented Nov 11, 2023

niemiaszek commented Nov 29, 2023

evanatyourservice commented Nov 29, 2023

lucidrains commented Nov 29, 2023 • edited Loading

lucidrains commented Nov 29, 2023

evanatyourservice commented Nov 29, 2023

jon-chuang commented Nov 29, 2023 • edited Loading

lucidrains commented Nov 29, 2023

jon-chuang commented Nov 29, 2023

jon-chuang commented Nov 29, 2023

lucidrains commented Nov 29, 2023 • edited Loading

jon-chuang commented Nov 30, 2023

lucidrains commented Nov 30, 2023 • edited Loading

lucidrains commented Nov 30, 2023

MasterSkepticista commented Jul 21, 2024

jakubMitura14 commented Feb 27, 2023 •

edited

Loading

evanatyourservice commented Oct 21, 2023 •

edited

Loading

lucidrains commented Nov 29, 2023 •

edited

Loading

jon-chuang commented Nov 29, 2023 •

edited

Loading

lucidrains commented Nov 29, 2023 •

edited

Loading

lucidrains commented Nov 30, 2023 •

edited

Loading