Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

Closed
RalfJung opened this issue Feb 17, 2024 · 19 comments
Closed

Comments

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2024

Today I learned about the existence of the simd_reduce_add_unordered intrinsic. When called on a float, this compiles to LLVM's vector.reduce.fadd with the "fast" flag set, which means that passing in NAN or INF is UB and optimizations are allowed "to treat the sign of a zero argument or zero result as insignificant" (which I think means the sign of input zeros is non-deterministically swapped and returned zeros have non-deterministic sign).

This intrinsic is not used a lot in stdarch, but it has a total of 8 uses (all in avx512f.rs). 4 of these are integer intrinsics, where this should be entirely equivalent to simd_reduce_add; not sure why the "unordered" version is used. The other 4 are float intrinsics, _mm512_reduce_add_ps being the first:

https://github.com/rust-lang/stdarch/blob/4d9c0bb591336792c4c4baf293d0acc944e57e28/crates/core_arch/src/x86/avx512f.rs#L31262-L31270

/// Reduce the packed single-precision (32-bit) floating-point elements in a by addition. Returns the sum of all elements in a.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=4562)
#[inline]
#[target_feature(enable = "avx512f")]
#[unstable(feature = "stdarch_x86_avx512", issue = "111137")]
pub unsafe fn _mm512_reduce_add_ps(a: __m512) -> f32 {
    simd_reduce_add_unordered(a.as_f32x16())
}

Neither the docs here nor Intel's docs mention that this is UB on NAN or INF, and the concerns around signed zeros and doing the addition in an unspecified order. Given that the Intel docs should be the authoritative docs (since this is a vendor intrinsic), why is it even correct to use fast-math flags here? Either the docs need to be updated to state the fast-math preconditions, or the implementation needs to be updated to avoid the fast-math flag. Maybe it should only use "reassoc", not the full but unsafe "fast" flag? But even that should probably be mentioned in the docs.

@Amanieu
Copy link
Member

Amanieu commented Feb 18, 2024

It is definitely incorrect for these intrinsics to be using the fast-math flag.

Here is the LLVM IR that clang generates for this intrinsic:

%0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)

@RalfJung
Copy link
Member Author

RalfJung commented Feb 18, 2024

That allows LLVM to add the elements in any order, and also do re-association optimizations when the result is fed into another reassoc function. I don't see how that matches the Intel docs which describe a very particular order of summation:

DEFINE REDUCE_ADD(src, len) {
	IF len == 2
		RETURN src[31:0] + src[63:32]
	FI
	len := len / 2
	FOR j:= 0 to (len-1)
		i := j*32
		src[i+31:i] := src[i+31:i] + src[i+32*len+31:i+32*len]
	ENDFOR
	RETURN REDUCE_ADD(src[32*len-1:0], len)
}
dst[31:0] := REDUCE_ADD(a, 16)

(It seems like Intel uses array[last_elem:first_elem] syntax for bitwise subslicing, which must be the least intuitive subslicing syntax I have ever seen.)

@RalfJung RalfJung changed the title _mm512_reduce_add_ps and friends are not documenting some relevant safety requirements _mm512_reduce_add_ps and friends are setting fast-math flags they should not set Feb 18, 2024
@RalfJung
Copy link
Member Author

Specifically, if I were to do something like

let sum = _mm512_reduce_add_ps(a);
let vec = _mm512_set_pd(sum, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0);
let sum2 = _mm512_reduce_add_ps(vec);

then in my reading of reassoc, LLVM would be allowed to arbitrarily reorder all the 15 elements being summed up here when computing sum2. There's nothing that constrains it to only reassociate "within" a single _mm512_reduce_add_ps.

I don't think this is a correct implementation of the Intel vendor intrinsic.

@workingjubilee
Copy link
Member

I do not believe it is, no.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 21, 2024

@nikic has confirmed to me that Ralf's concerns about reassoc being allowed to potentially do "basically whatever" are accurate, with the caveat that all the relevant operations have to be tagged with reassoc. They can't "jump" between reassoc to non-reassoc to reassoc.

So yes, these are very not correct implementations.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 21, 2024

The problem is that the lack of the reassoc doesn't seem to be correct either? The ordering is not serial.

@RalfJung
Copy link
Member Author

Intel specifies a very specific order of summation. It's not left-to-right, which is what the no-reassoc version would do. From what I understand there is anyway no hardware operation that actually performs this particular kind of summation, so either LLVM needs to have support for this specific operation (and lower it to the best instruction sequence), or we need to do the lowering ourselves in the implementation of _mm512_reduce_add_ps.

@workingjubilee
Copy link
Member

@RalfJung Hmm. There are a few different possible sequences a compiler can use, but one of the "obvious" ones is a sequence that just repeatedly uses the "do one round of tree-reduction" instruction, which works like you might imagine from that description I just gave.

@nikic
Copy link
Contributor

nikic commented Feb 21, 2024

Yes, in practice reassoc on reductions produces a tree reduction. Of course, this is not guaranteed from a semantics perspective.

@workingjubilee
Copy link
Member

@RalfJung fwiw, Niki mentions that "perform a tree reduction" was proposed in the past as a possible annotation for the reduces, so perhaps that's the tree we should be barking up this time.

@RalfJung
Copy link
Member Author

If that's a possibility then that would make most sense, yes -- have an intrinsic that reduces in a well-defined order that matches what the Intel docs say (i.e., tree reduction).

@RalfJung
Copy link
Member Author

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

@Amanieu
Copy link
Member

Amanieu commented Feb 23, 2024

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

https://godbolt.org/z/7Wbhjeo7n

@RalfJung
Copy link
Member Author

Thanks! Filed an issue: llvm/llvm-project#82813

@RalfJung
Copy link
Member Author

@nikic it seems I don't know how to talk to LLVM people, they don't seem to agree with me on what it even means to have a LangRef. :/ Maybe you can help move the discussion in llvm/llvm-project#82813 somewhere productive?

@sayantn
Copy link
Contributor

sayantn commented Jun 20, 2024

We can also not use simd_reduce_add_unordered, because as LLVM says (at least in LangRef), there is no guarantee of associativity in vector.reduce.add, so we can do what gcc does, and hand-code the reduce-add ourselves. I did a small implementation in Godbolt for _mm512_reduce_add_ps here. It seems like LLVM is doing a spurious zero addition - I am no expert on floating point, but I think that addition with +0.0 is nop

@Amanieu
Copy link
Member

Amanieu commented Jun 30, 2024

This is now fixed by #1594: all of these are now implemented by explicitly expanding to a sequence of operations instead of using the LLVM intrinsics.

@Amanieu Amanieu closed this as completed Jun 30, 2024
@RalfJung
Copy link
Member Author

Should we remove the unordered intrinsics or are they still useful?

@Amanieu
Copy link
Member

Amanieu commented Jul 2, 2024

They may still be useful for generic simd (cc @workingjubilee)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants