Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

rincebrain · 2022-07-06T23:38:33Z

(I'll rebase with a better description before merge assuming nobody has a reason this is a terrible idea, but since we're doing the opposite cast as input in a number of places, it seems no worse than status quo ante...)

Motivation and Context

Getting rid of "shhh nobody look it's fine we promise" excludes in the code is always good, plus with gcc defaulting to auto-vectorization at O2 now, the results can be...explosive (#13605 #13620).

Description

When we cast the input to fletcher_4_scalar_native or friends to fletcher_4_ctx_t, we're promising that the thing is up to 64B (!) aligned, as far as the compiler's concerned, but because we're not actually guaranteeing that, auto-vectorizing the code results in trying to do an aligned write to the item, and that is sometimes going to crash (in userland; since the kernel has big flashing NO DON'T around auto-vectorizing, it's probably just going to upset sanitizers).

But it turns out, everywhere we explicitly call the scalar implementation, we're just casting a zio_cksum_t * to fletcher_4_ctx_t *, so if that's ever incorrect behavior, we're going to horrifically crash a dozen ways to Sunday anyway.

I just dropped the forced alignment annotations, because it turns out all the implementations are written entirely in unaligned access instructions anyway, so if we really think that giving unaligned instructions aligned accesses is more performant in some spots, we can just do that there, rather than claiming it's aligned everywhere and then lying a few times.

How Has This Been Tested?

Before this change, using -ftree-vectorize -march=znver2 would result in a zfs binary that crashed around half the time on trying to send on my 8700k or my 5900X, and removing the UBSan exceptions would result in erroring out hard immediately on send with or without the -ftree-vectorize.

After this change, with or without the -ftree-vectorize, UBSan as far as I can see has no complaints about these functions, and I haven't made it crash again.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf

This is definitely better that the current situation. Alternately, I wonder if we could force the correct alignment with the aligned attribute so it could be safely vectorized.

module/zcommon/zfs_fletcher.c

behlendorf · 2022-07-07T20:32:27Z

module/zcommon/zfs_fletcher.c

+__attribute__((optimize("no-tree-vectorize")))
+#elif defined(__clang__)
+__attribute__((optnone))
+#endif


I see the zstd code defines a DONT_VECTORIZE macro in module/zstd/lib/common/compiler.h. We could define our own version of this for use throughout the code.

As you already mentioned I think we should update the superscaler implementations as well.

thesamesam

Happy apart from the optnone. The rest are nits/debatable (although I'd prefer comments for this, if you feel it doesn't fit with what ZFS usually does, so be it).

We went back and forth on this and while from a purity POV I'd prefer memcpy, I don't think there's anything wrong with this, and like you said, we use the same method elsewhere anyway.

thesamesam · 2022-07-07T23:55:35Z

module/zcommon/zfs_fletcher.c

@@ -300,33 +300,47 @@ fletcher_2_byteswap(const void *buf, uint64_t size,
 	(void) fletcher_2_incremental_byteswap((void *) buf, size, zcp);
 }

-ZFS_NO_SANITIZE_UNDEFINED
+#if defined(__GNUC__) && !defined(__clang__)


This deserves a comment, IMO, as I can absolutely see us questioning this in a few years time (... just like I did with the UBSAN annotation).

module/zcommon/zfs_fletcher.c

thesamesam · 2022-07-07T23:58:32Z

module/zcommon/zfs_fletcher.c

+#if defined(__GNUC__) && !defined(__clang__)
+__attribute__((optimize("no-tree-vectorize")))
+#elif defined(__clang__)
+__attribute__((optnone))


Requiring absolutely no optimisation for a memcpy might be interesting.

Anyway, we want -fno-vectorize here for Clang, as less of a sledgehammer than optnone.

And by interesting you mean terrible. On x86, arm, and several other architectures the compiler will often perform the unaligned read directly without needing a copy. Turning off optimization would likely force the copy and probably disable the compiler builtin from being used (forcing the actual memcpy function).

thesamesam · 2022-07-08T00:00:16Z

module/zcommon/zfs_fletcher.c

 static void
 fletcher_4_scalar_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
 {
+	zio_cksum_t *zcp = (zio_cksum_t *)ctx;


Suggested change

zio_cksum_t *zcp = (zio_cksum_t *)ctx;

/* Drop the huge alignment constraint (64B) from fletcher_4_ctx_t */

zio_cksum_t *zcp = (zio_cksum_t *)ctx;

KungFuJesus · 2022-07-17T20:08:38Z

lib/libzfs/libzfs_sendrecv.c

@@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd,
 	int err = 0;
 	char *packbuf = NULL;
 	size_t buflen = 0;
-	zio_cksum_t zc = { {0} };
+	zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1);


This seems a little heavy handed. Is this for alignment purposes? Couldn't you just use a compiler directive to align this on the stack?

Yes, and then we get to update it every time we change SIMD alignment requirements, and god help us if someone does something clever and they're no longer mutually compatible.

The implementations are all written entirely with unaligned accesses anyway (except possibly the NEON one? I'll go boot my AArch64 testbed and see...), I'm just going to drop the alignment definition in the union, because we're basically just saying "pretty please" and wishing we had our cake and ate it too, now.

64 byte alignment is pretty futureproof, no? It might add some padding to your stack location but it's going to be way better than allocating dynamically.

Having checked, the NEON implementation also appears to not require the alignment at all, so I think I prefer suggesting people add alignment if there are places they feel the difference is a significant win, rather than pretending it's aligned everywhere and then having to manually force things we're munging into fletcher_4_ctx_t * to make that true.

(That said, UBsan also screams murder on AArch64 about the LZ4 implementation unrelated to these changes, so that's a different problem...)

So for an infrequent enough access, I don't think it's a significant win much anymore on newer CPUs. For AARCH64, particularly on "little" cores, aligned access is still quite a bit faster than unaligned. I get the impression this checksum is only written back to memory after summing a block's worth of bytes, so it's probably not going to be significant either way.

Now, on PowerPC, particularly the older, pre-power7 variants, alignment is a requirement or it takes two loads and a permute from a permutation vector returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support goes back and if it needs the VSX extensions or supports the less VMX/altivec ones.

Huh, my emailed reply got lost.

Anyway, my remark was that presently OpenZFS has no POWER-specific fletcher implementation, and the alignment is based just on the compile-time directives for supporting SSE2/4/AVX/etc, so changing it to not think it's always aligned would currently only impact x86 and ARM.

(And it just grew the ability to notice if VSX is supported for the BLAKE3 PR.)

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

rincebrain · 2022-07-19T20:56:14Z

I've pushed it just not having an alignment requirement, after both checking and testing that the implementations are entirely using unaligned instructions everywhere.

thesamesam

TIL NEON doesn't care about alignment.

Looks good, thank you!

rincebrain · 2022-07-20T03:29:24Z

Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.

rincebrain · 2022-07-22T09:42:18Z

So, as far as I can tell, the only way I can tell Clang to not do this is either:
#pragma clang loop vectorize(disable) (which I can't turn into a preprocessor macro)
or a long platform-specific sea of __attribute__((target("no-sse,no-sse2,..."))), the latter of which not being directly what I mean anyway.

So I guess Clang is why we can't have nice things today. Time to just mark the entire file with -fno-vectorize for Clang and -fno-tree-vectorize for gcc, and if people override that, they get to keep all the pieces...

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

KungFuJesus · 2022-07-22T12:20:24Z

Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.

I've never seen gcc emit that decoration in the assembly even when it had absolute certainty of alignment. I have seen the performance benefit of aligned access but it was always automatic. I suspect much later implementations of aarch64 don't need that.

behlendorf · 2022-07-28T22:47:41Z

module/Kbuild.in

+# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
+#
+# Good luck.
+ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize


Perhaps I'm just overlooking it, but don't we also need to enforce this for the user space build?

Thanks, I missed that hunk in my git add, apparently.

Will push shortly - had my primary machine die yesterday, still getting set up on the replacement one...

Actually, wait, depends what behavior we want to enforce.

-fno-tree-vectorize makes sense to enforce in userland too, but -mgeneral-regs-only seems unnecessary since we nop out the fpu save/restore dance as irrelevant.

I'm fine including both for consistency, I just thought I'd check because the reasons for one don't apply in userland.

(To be clear, with the alignment requirement in the union dropped, -ftree-vectorize won't segfault any more in userland, and the kernel enforces you not doing that a dozen ways anyway, so this is just making explicit the behavior we desire.)

Staying consistent seems preferable to me. Let's just also include a comment explaining the situation for user space.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

…sn't it Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

behlendorf · 2022-08-09T00:00:57Z

Unfortunately, it looks like this breaks the build on CentOS 7 and the old ppc64 builder due to the old gcc version. Seems like a configure check will be needed for this.

gcc: error: unrecognized command line option '-mgeneral-regs-only'

rincebrain · 2022-08-09T18:40:57Z

Cute. I'll give it a go later.

rincebrain · 2022-10-11T08:45:50Z

Presently, there's not a PPC accelerated fletcher4 at all, and the alignment requirements were imposed per-arch based on what got included into the union definition, so this only affects x86 and ARM at the moment. - Rich

…

On Sun, Jul 17, 2022 at 10:03 PM Adam Stylinski ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lib/libzfs/libzfs_sendrecv.c <#13631 (comment)>: > @@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd, int err = 0; char *packbuf = NULL; size_t buflen = 0; - zio_cksum_t zc = { {0} }; + zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1); So for an infrequent enough access, I don't think it's a significant win much anymore on newer CPUs. For AARCH64, particularly on "little" cores, aligned access is still quite a bit faster than unaligned. I get the impression this checksum is only written back to memory after summing a block's worth of bytes, so it's probably not going to be significant either way. Now, on PowerPC, particularly the older, pre-power7 variants, alignment is a requirement or it takes two loads and a permute from a permutation vector returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support goes back and if it needs the VSX extensions or supports the less VMX/altivec ones. — Reply to this email directly, view it on GitHub <#13631 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUI7ILOV32RNYP2OQYV73VUS3Q5ANCNFSM523PCLZA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ryao · 2023-01-23T21:16:36Z

module/Kbuild.in

+# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
+#
+# Good luck.
+ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize


The Linux kernel really should handle this for us by disabling all of the vector instructions such that there is no need to pass either -mgeneral-regs-only or -fno-tree-vectorize.

-mgeneral-regs-only is a convenience flag meant to avoid requiring people to know about all various isa extensions. It is not available on all architectures and it will break the build on x86_64 on GCC versions older than gcc 7.

-fno-tree-vectorize is only specified in Linux's arch/csky/Makefile and no where else in Linux, since it really should not be necessary. That said, I am fine with passing -fno-tree-vectorize here.

I didn't just add it for fun, no.

As the comment above explains, if the user forcibly overrides -march later in the line, then these are overridden as well, and boom goes the dynamite.

Alright. -mgeneral-regs-only is not backward compatible with older compilers so unless we write an autotools check for it, we probably should drop it in favor of only doing -fno-tree-vectorize.

I was going to write an autoconf check, yes.

I think -fno-tree-vectorize might also need a check of that nature, but it's been a while since I looked, so I'll check when I get back to this.

ryao · 2023-01-23T21:17:33Z

module/Makefile.bsd

+# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
+#
+# Good luck.
+CFLAGS += -mgeneral-regs-only -fno-tree-vectorize


I am not sure how far back FreeBSD's compiler support goes, but the same remark about -mgeneral-regs-only being a potential problem applies here.

ryao · 2023-01-23T21:20:42Z

lib/libzfs/Makefile.am

+# We would keep -mgeneral-regs-only, but on e.g. x64, things like atof
+# can't be defined without implying using FPU regs, so it's a compile
+# time error.
+NOVECTOR := -fno-tree-vectorize


Does this fix a real problem or is it just a precaution? I would rather not disable auto-vectorization in userspace since we expect the compiler to attempt these things in userspace, where using SIMD should always be safe.

That said, I know for a fact that the compiler will sometimes vectorize the mixing matrix calculations. I am not sure if its vectorization makes a difference, but I am inclined to let it try.

#13605 says hello.

ryao · 2023-01-23T21:20:52Z

lib/libzpool/Makefile.am

+# can't be defined without implying using FPU regs, so it's a compile
+# time error.
+NOVECTOR := -fno-tree-vectorize
+$(addprefix module/zcommon/libzpool_la-,zfs_fletcher.$(OBJEXT) zfs_fletcher.l$(OBJEXT)) : CFLAGS += $(NOVECTOR)


The same as above.

This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

This is probably the uncontroversial part of #13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14649

This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#14649

rincebrain force-pushed the scalar_means_scalar branch from 4bfbd1c to 3e7c6a6 Compare July 7, 2022 00:18

behlendorf reviewed Jul 7, 2022

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 7, 2022

thesamesam suggested changes Jul 8, 2022

View reviewed changes

rincebrain force-pushed the scalar_means_scalar branch from 62333d9 to 1630bd4 Compare July 17, 2022 17:27

KungFuJesus suggested changes Jul 17, 2022

View reviewed changes

rincebrain added 2 commits July 18, 2022 04:25

Drop sanitize annotations and alignment requirement in fletcher_4_ctx_t

f4338b5

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

And let's drop novector annotations every which where...

a4dee02

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

rincebrain force-pushed the scalar_means_scalar branch from 87cfc56 to a4dee02 Compare July 19, 2022 20:53

thesamesam approved these changes Jul 20, 2022

View reviewed changes

Clang is why we can't have nice things

2e096eb

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

behlendorf reviewed Jul 28, 2022

View reviewed changes

rincebrain added 2 commits July 29, 2022 09:04

Now with working and included userland makefile changes

896773d

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

It helps when I check my kernel-side changes still change things, doe…

9b13f28

…sn't it Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

ryao reviewed Jan 23, 2023

View reviewed changes

rincebrain mentioned this pull request Mar 19, 2023

Drop lying to the compiler in the fletcher4 code #14649

Merged

13 tasks

mabod mentioned this pull request Aug 10, 2023

zfs send/receive coredump with docker dataset #13605

Closed

rincebrain closed this Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

rincebrain commented Jul 6, 2022 •

edited

Loading

behlendorf left a comment

behlendorf Jul 7, 2022

thesamesam left a comment •

edited

Loading

thesamesam Jul 7, 2022

thesamesam Jul 7, 2022

KungFuJesus Jul 8, 2022

thesamesam Jul 8, 2022

thesamesam Jul 8, 2022

KungFuJesus Jul 17, 2022

rincebrain Jul 17, 2022

KungFuJesus Jul 18, 2022

rincebrain Jul 18, 2022

KungFuJesus Jul 18, 2022

rincebrain Jul 18, 2022 •

edited

Loading

rincebrain commented Jul 19, 2022

thesamesam left a comment •

edited

Loading

rincebrain commented Jul 20, 2022

rincebrain commented Jul 22, 2022

KungFuJesus commented Jul 22, 2022

behlendorf Jul 28, 2022

rincebrain Jul 28, 2022

rincebrain Jul 28, 2022 •

edited

Loading

behlendorf Jul 29, 2022

behlendorf commented Aug 9, 2022

rincebrain commented Aug 9, 2022

rincebrain commented Oct 11, 2022 via email

ryao Jan 23, 2023

rincebrain Jan 24, 2023

ryao Jan 24, 2023

rincebrain Jan 24, 2023

ryao Jan 23, 2023

ryao Jan 23, 2023

rincebrain Jan 24, 2023

ryao Jan 23, 2023

	zio_cksum_t zcp = (zio_cksum_t )ctx;
	/* Drop the huge alignment constraint (64B) from fletcher_4_ctx_t */
	zio_cksum_t zcp = (zio_cksum_t )ctx;

Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

Conversation

rincebrain commented Jul 6, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thesamesam left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rincebrain Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

rincebrain commented Jul 19, 2022

thesamesam left a comment • edited Loading

Choose a reason for hiding this comment

rincebrain commented Jul 20, 2022

rincebrain commented Jul 22, 2022

KungFuJesus commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rincebrain Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented Aug 9, 2022

rincebrain commented Aug 9, 2022

rincebrain commented Oct 11, 2022 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rincebrain commented Jul 6, 2022 •

edited

Loading

thesamesam left a comment •

edited

Loading

rincebrain Jul 18, 2022 •

edited

Loading

thesamesam left a comment •

edited

Loading

rincebrain Jul 28, 2022 •

edited

Loading