Skip to content

Conversation

ggerganov
Copy link
Member

ref #16372 (comment)

The mask that we use in the FA tests is always filled with non-INF values, while in practice it often contains -INF blocks due to causal masking or due to padding. Extend the tests to add -INF values so we can exercise any logic that relies on detecting such blocks in the mask.

@ggerganov ggerganov requested a review from slaren as a code owner October 2, 2025 05:44
@github-actions github-actions bot added the testing Everything test related label Oct 2, 2025
@Green-Sky
Copy link
Collaborator

I forgot to mention this, but it is possible that cuda fails this test.

@ggerganov
Copy link
Member Author

@Green-Sky Do you mean you tried it locally and it failed?

@Green-Sky
Copy link
Collaborator

I just tried it locally and it works. I will have to go and look at the sd.cpp+chroma+fa bug again, so I actually know what fails there.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Oct 2, 2025

Ok, did some more debugging. I am not sure if the issue really is with ggml, but flash attention just seems to die when we use the mask in the DiT for chroma.

The mask is:

  • 0 in [0;19] (in my example)
  • -inf in [20;511]
  • and then padded with 0 again for [512;2815]

attention_ext L_q:2816 L_k:2816 n_head:24 C:3072 d_head:128 N:1

@ggerganov
Copy link
Member Author

Does it show any error? What do you mean by "die"?

@Green-Sky
Copy link
Collaborator

Does it show any error? What do you mean by "die"?

Ah, sorry for that.
By "die", I mean silently fail. Pulling data from a backend is kind of a pain, but the resulting image is always black, so I assume it it pulled to (+/-) INF somewhere, or maybe zero.

@ggerganov ggerganov requested a review from jeffbolznv October 6, 2025 13:04
@jeffbolznv
Copy link
Collaborator

I'm seeing some failures in the vulkan path with the current change. I'll look into it.

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
@jeffbolznv
Copy link
Collaborator

I'm seeing two classes of failures with vulkan. In the coopmat2 path, the backend will generate some negative infinity results, e.g.:

[FLASH_ATTN_EXT] inf mismatch: Vulkan0=-340282346638528859811704183484516925440.000000 CPU=0.019682   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=1,mask=1,sinks=0,max_bias=8.000000,logit_softcap=10.000000,prec=def,type_KV=q8_0,permute=[0,1,2,3]): FAIL

I can workaround this by clamping in the shader. But more concerning is cases where the CPU backend generates NaN:

[FLASH_ATTN_EXT] NaN at index 18432 (Vulkan0=nan CPU=-nan(ind))   FLASH_ATTN_EXT(hsk=80,hsv=80,nh=4,nr23=[4,1],kv=512,nb=35,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q8_0,permute=[0,2,1,3]): FAIL

It's easy to trigger this - just increase the frequency of the -inf blocks by 100x:

-    const int n_inf_blocks = 0.1*(ne0*ne1*ne2*ne3)/(blck0*blck1);
+    const int n_inf_blocks = 10*(ne0*ne1*ne2*ne3)/(blck0*blck1);

I imagine this may be happening when a whole row has -inf. I assume this will reprodue with other backends too, but I haven't verified it. It's not clear to me what the correct result is in this case.

@ggerganov
Copy link
Member Author

ggerganov commented Oct 6, 2025

I imagine this may be happening when a whole row has -inf. I assume this will reprodue with other backends too, but I haven't verified it. It's not clear to me what the correct result is in this case.

Yes, I have this fixed for CPU and Metal in: fdedda0

I think the correct thing to do is return 0 in this case.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 6, 2025
@ggerganov
Copy link
Member Author

With ad83f06, the CPU backend should no longer generate NaNs.

@jeffbolznv
Copy link
Collaborator

After #16447, the vulkan backend should pass the updated tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants