_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

RalfJung · 2024-02-17T13:16:36Z

Today I learned about the existence of the simd_reduce_add_unordered intrinsic. When called on a float, this compiles to LLVM's vector.reduce.fadd with the "fast" flag set, which means that passing in NAN or INF is UB and optimizations are allowed "to treat the sign of a zero argument or zero result as insignificant" (which I think means the sign of input zeros is non-deterministically swapped and returned zeros have non-deterministic sign).

This intrinsic is not used a lot in stdarch, but it has a total of 8 uses (all in avx512f.rs). 4 of these are integer intrinsics, where this should be entirely equivalent to simd_reduce_add; not sure why the "unordered" version is used. The other 4 are float intrinsics, _mm512_reduce_add_ps being the first:

https://github.com/rust-lang/stdarch/blob/4d9c0bb591336792c4c4baf293d0acc944e57e28/crates/core_arch/src/x86/avx512f.rs#L31262-L31270

/// Reduce the packed single-precision (32-bit) floating-point elements in a by addition. Returns the sum of all elements in a.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=4562)
#[inline]
#[target_feature(enable = "avx512f")]
#[unstable(feature = "stdarch_x86_avx512", issue = "111137")]
pub unsafe fn _mm512_reduce_add_ps(a: __m512) -> f32 {
    simd_reduce_add_unordered(a.as_f32x16())
}

Neither the docs here nor Intel's docs mention that this is UB on NAN or INF, and the concerns around signed zeros and doing the addition in an unspecified order. Given that the Intel docs should be the authoritative docs (since this is a vendor intrinsic), why is it even correct to use fast-math flags here? Either the docs need to be updated to state the fast-math preconditions, or the implementation needs to be updated to avoid the fast-math flag. Maybe it should only use "reassoc", not the full but unsafe "fast" flag? But even that should probably be mentioned in the docs.

The text was updated successfully, but these errors were encountered:

Amanieu · 2024-02-18T15:30:57Z

It is definitely incorrect for these intrinsics to be using the fast-math flag.

Here is the LLVM IR that clang generates for this intrinsic:

%0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)

RalfJung · 2024-02-18T22:02:25Z

That allows LLVM to add the elements in any order, and also do re-association optimizations when the result is fed into another reassoc function. I don't see how that matches the Intel docs which describe a very particular order of summation:

DEFINE REDUCE_ADD(src, len) {
	IF len == 2
		RETURN src[31:0] + src[63:32]
	FI
	len := len / 2
	FOR j:= 0 to (len-1)
		i := j*32
		src[i+31:i] := src[i+31:i] + src[i+32*len+31:i+32*len]
	ENDFOR
	RETURN REDUCE_ADD(src[32*len-1:0], len)
}
dst[31:0] := REDUCE_ADD(a, 16)

(It seems like Intel uses array[last_elem:first_elem] syntax for bitwise subslicing, which must be the least intuitive subslicing syntax I have ever seen.)

RalfJung · 2024-02-21T07:33:21Z

Specifically, if I were to do something like

let sum = _mm512_reduce_add_ps(a);
let vec = _mm512_set_pd(sum, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0);
let sum2 = _mm512_reduce_add_ps(vec);

then in my reading of reassoc, LLVM would be allowed to arbitrarily reorder all the 15 elements being summed up here when computing sum2. There's nothing that constrains it to only reassociate "within" a single _mm512_reduce_add_ps.

I don't think this is a correct implementation of the Intel vendor intrinsic.

workingjubilee · 2024-02-21T08:24:34Z

I do not believe it is, no.

workingjubilee · 2024-02-21T08:55:16Z

@nikic has confirmed to me that Ralf's concerns about reassoc being allowed to potentially do "basically whatever" are accurate, with the caveat that all the relevant operations have to be tagged with reassoc. They can't "jump" between reassoc to non-reassoc to reassoc.

So yes, these are very not correct implementations.

workingjubilee · 2024-02-21T08:57:24Z

The problem is that the lack of the reassoc doesn't seem to be correct either? The ordering is not serial.

RalfJung · 2024-02-21T09:01:49Z

Intel specifies a very specific order of summation. It's not left-to-right, which is what the no-reassoc version would do. From what I understand there is anyway no hardware operation that actually performs this particular kind of summation, so either LLVM needs to have support for this specific operation (and lower it to the best instruction sequence), or we need to do the lowering ourselves in the implementation of _mm512_reduce_add_ps.

workingjubilee · 2024-02-21T09:09:48Z

@RalfJung Hmm. There are a few different possible sequences a compiler can use, but one of the "obvious" ones is a sequence that just repeatedly uses the "do one round of tree-reduction" instruction, which works like you might imagine from that description I just gave.

nikic · 2024-02-21T09:12:50Z

Yes, in practice reassoc on reductions produces a tree reduction. Of course, this is not guaranteed from a semantics perspective.

workingjubilee · 2024-02-21T09:27:24Z

@RalfJung fwiw, Niki mentions that "perform a tree reduction" was proposed in the past as a possible annotation for the reduces, so perhaps that's the tree we should be barking up this time.

RalfJung · 2024-02-21T10:00:15Z

If that's a possibility then that would make most sense, yes -- have an intrinsic that reduces in a well-defined order that matches what the Intel docs say (i.e., tree reduction).

RalfJung · 2024-02-22T06:34:29Z

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

Amanieu · 2024-02-23T17:14:44Z

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

https://godbolt.org/z/7Wbhjeo7n

RalfJung · 2024-02-23T18:53:49Z

Thanks! Filed an issue: llvm/llvm-project#82813

RalfJung · 2024-02-25T10:35:55Z

@nikic it seems I don't know how to talk to LLVM people, they don't seem to agree with me on what it even means to have a LangRef. :/ Maybe you can help move the discussion in llvm/llvm-project#82813 somewhere productive?

sayantn · 2024-06-20T19:45:30Z

We can also not use simd_reduce_add_unordered, because as LLVM says (at least in LangRef), there is no guarantee of associativity in vector.reduce.add, so we can do what gcc does, and hand-code the reduce-add ourselves. I did a small implementation in Godbolt for _mm512_reduce_add_ps here. It seems like LLVM is doing a spurious zero addition - I am no expert on floating point, but I think that addition with +0.0 is nop

Amanieu · 2024-06-30T10:34:25Z

This is now fixed by #1594: all of these are now implemented by explicitly expanding to a sequence of operations instead of using the LLVM intrinsics.

RalfJung · 2024-06-30T10:41:01Z

Should we remove the unordered intrinsics or are they still useful?

Amanieu · 2024-07-02T09:47:27Z

They may still be useful for generic simd (cc @workingjubilee)

This was referenced Feb 17, 2024

Tracking Issue for AVX512 intrinsics rust-lang/rust#111137

Open

intrinsics::simd: add missing functions, avoid UB-triggering fast-math rust-lang/rust#121223

Merged

RalfJung changed the title ~~_mm512_reduce_add_ps and friends are not documenting some relevant safety requirements~~ _mm512_reduce_add_ps and friends are setting fast-math flags they should not set Feb 18, 2024

sayantn mentioned this issue Jun 24, 2024

Various Fixes and enhancements in x86 intrinsics #1594

Merged

Amanieu closed this as completed Jun 30, 2024

sayantn mentioned this issue Jul 30, 2024

Stabilize Wasm relaxed SIMD rust-lang/rust#117468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

RalfJung commented Feb 17, 2024 •

edited

Loading

Amanieu commented Feb 18, 2024

RalfJung commented Feb 18, 2024 •

edited

Loading

RalfJung commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

workingjubilee commented Feb 21, 2024 •

edited

Loading

workingjubilee commented Feb 21, 2024 •

edited

Loading

RalfJung commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

nikic commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

RalfJung commented Feb 21, 2024

RalfJung commented Feb 22, 2024

Amanieu commented Feb 23, 2024

RalfJung commented Feb 23, 2024

RalfJung commented Feb 25, 2024

sayantn commented Jun 20, 2024

Amanieu commented Jun 30, 2024

RalfJung commented Jun 30, 2024

Amanieu commented Jul 2, 2024

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

Comments

RalfJung commented Feb 17, 2024 • edited Loading

Amanieu commented Feb 18, 2024

RalfJung commented Feb 18, 2024 • edited Loading

RalfJung commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

workingjubilee commented Feb 21, 2024 • edited Loading

workingjubilee commented Feb 21, 2024 • edited Loading

RalfJung commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

nikic commented Feb 21, 2024

workingjubilee commented Feb 21, 2024

RalfJung commented Feb 21, 2024

RalfJung commented Feb 22, 2024

Amanieu commented Feb 23, 2024

RalfJung commented Feb 23, 2024

RalfJung commented Feb 25, 2024

sayantn commented Jun 20, 2024

Amanieu commented Jun 30, 2024

RalfJung commented Jun 30, 2024

Amanieu commented Jul 2, 2024

RalfJung commented Feb 17, 2024 •

edited

Loading

RalfJung commented Feb 18, 2024 •

edited

Loading

workingjubilee commented Feb 21, 2024 •

edited

Loading

workingjubilee commented Feb 21, 2024 •

edited

Loading