Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631

Closed
wants to merge 5 commits into from

Conversation

rincebrain
Copy link
Contributor

@rincebrain rincebrain commented Jul 6, 2022

(I'll rebase with a better description before merge assuming nobody has a reason this is a terrible idea, but since we're doing the opposite cast as input in a number of places, it seems no worse than status quo ante...)

Motivation and Context

Getting rid of "shhh nobody look it's fine we promise" excludes in the code is always good, plus with gcc defaulting to auto-vectorization at O2 now, the results can be...explosive (#13605 #13620).

Description

When we cast the input to fletcher_4_scalar_native or friends to fletcher_4_ctx_t, we're promising that the thing is up to 64B (!) aligned, as far as the compiler's concerned, but because we're not actually guaranteeing that, auto-vectorizing the code results in trying to do an aligned write to the item, and that is sometimes going to crash (in userland; since the kernel has big flashing NO DON'T around auto-vectorizing, it's probably just going to upset sanitizers).

But it turns out, everywhere we explicitly call the scalar implementation, we're just casting a zio_cksum_t * to fletcher_4_ctx_t *, so if that's ever incorrect behavior, we're going to horrifically crash a dozen ways to Sunday anyway.

I just dropped the forced alignment annotations, because it turns out all the implementations are written entirely in unaligned access instructions anyway, so if we really think that giving unaligned instructions aligned accesses is more performant in some spots, we can just do that there, rather than claiming it's aligned everywhere and then lying a few times.

How Has This Been Tested?

Before this change, using -ftree-vectorize -march=znver2 would result in a zfs binary that crashed around half the time on trying to send on my 8700k or my 5900X, and removing the UBSan exceptions would result in erroring out hard immediately on send with or without the -ftree-vectorize.

After this change, with or without the -ftree-vectorize, UBSan as far as I can see has no complaints about these functions, and I haven't made it crash again.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@rincebrain rincebrain force-pushed the scalar_means_scalar branch from 4bfbd1c to 3e7c6a6 Compare July 7, 2022 00:18
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely better that the current situation. Alternately, I wonder if we could force the correct alignment with the aligned attribute so it could be safely vectorized.

module/zcommon/zfs_fletcher.c Outdated Show resolved Hide resolved
__attribute__((optimize("no-tree-vectorize")))
#elif defined(__clang__)
__attribute__((optnone))
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the zstd code defines a DONT_VECTORIZE macro in module/zstd/lib/common/compiler.h. We could define our own version of this for use throughout the code.

As you already mentioned I think we should update the superscaler implementations as well.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 7, 2022
Copy link
Contributor

@thesamesam thesamesam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy apart from the optnone. The rest are nits/debatable (although I'd prefer comments for this, if you feel it doesn't fit with what ZFS usually does, so be it).

We went back and forth on this and while from a purity POV I'd prefer memcpy, I don't think there's anything wrong with this, and like you said, we use the same method elsewhere anyway.

@@ -300,33 +300,47 @@ fletcher_2_byteswap(const void *buf, uint64_t size,
(void) fletcher_2_incremental_byteswap((void *) buf, size, zcp);
}

ZFS_NO_SANITIZE_UNDEFINED
#if defined(__GNUC__) && !defined(__clang__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a comment, IMO, as I can absolutely see us questioning this in a few years time (... just like I did with the UBSAN annotation).

module/zcommon/zfs_fletcher.c Outdated Show resolved Hide resolved
#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("no-tree-vectorize")))
#elif defined(__clang__)
__attribute__((optnone))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring absolutely no optimisation for a memcpy might be interesting.

Anyway, we want -fno-vectorize here for Clang, as less of a sledgehammer than optnone.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by interesting you mean terrible. On x86, arm, and several other architectures the compiler will often perform the unaligned read directly without needing a copy. Turning off optimization would likely force the copy and probably disable the compiler builtin from being used (forcing the actual memcpy function).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

static void
fletcher_4_scalar_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
uint64_t size)
{
zio_cksum_t *zcp = (zio_cksum_t *)ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zio_cksum_t *zcp = (zio_cksum_t *)ctx;
/* Drop the huge alignment constraint (64B) from fletcher_4_ctx_t */
zio_cksum_t *zcp = (zio_cksum_t *)ctx;

@rincebrain rincebrain force-pushed the scalar_means_scalar branch from 62333d9 to 1630bd4 Compare July 17, 2022 17:27
@@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd,
int err = 0;
char *packbuf = NULL;
size_t buflen = 0;
zio_cksum_t zc = { {0} };
zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little heavy handed. Is this for alignment purposes? Couldn't you just use a compiler directive to align this on the stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and then we get to update it every time we change SIMD alignment requirements, and god help us if someone does something clever and they're no longer mutually compatible.

The implementations are all written entirely with unaligned accesses anyway (except possibly the NEON one? I'll go boot my AArch64 testbed and see...), I'm just going to drop the alignment definition in the union, because we're basically just saying "pretty please" and wishing we had our cake and ate it too, now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 byte alignment is pretty futureproof, no? It might add some padding to your stack location but it's going to be way better than allocating dynamically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having checked, the NEON implementation also appears to not require the alignment at all, so I think I prefer suggesting people add alignment if there are places they feel the difference is a significant win, rather than pretending it's aligned everywhere and then having to manually force things we're munging into fletcher_4_ctx_t * to make that true.

(That said, UBsan also screams murder on AArch64 about the LZ4 implementation unrelated to these changes, so that's a different problem...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for an infrequent enough access, I don't think it's a significant win much anymore on newer CPUs. For AARCH64, particularly on "little" cores, aligned access is still quite a bit faster than unaligned. I get the impression this checksum is only written back to memory after summing a block's worth of bytes, so it's probably not going to be significant either way.

Now, on PowerPC, particularly the older, pre-power7 variants, alignment is a requirement or it takes two loads and a permute from a permutation vector returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support goes back and if it needs the VSX extensions or supports the less VMX/altivec ones.

Copy link
Contributor Author

@rincebrain rincebrain Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, my emailed reply got lost.

Anyway, my remark was that presently OpenZFS has no POWER-specific fletcher implementation, and the alignment is based just on the compile-time directives for supporting SSE2/4/AVX/etc, so changing it to not think it's always aligned would currently only impact x86 and ARM.

(And it just grew the ability to notice if VSX is supported for the BLAKE3 PR.)

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@rincebrain rincebrain force-pushed the scalar_means_scalar branch from 87cfc56 to a4dee02 Compare July 19, 2022 20:53
@rincebrain
Copy link
Contributor Author

I've pushed it just not having an alignment requirement, after both checking and testing that the implementations are entirely using unaligned instructions everywhere.

Copy link
Contributor

@thesamesam thesamesam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL NEON doesn't care about alignment.

Looks good, thank you!

@rincebrain
Copy link
Contributor Author

Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.

@rincebrain
Copy link
Contributor Author

So, as far as I can tell, the only way I can tell Clang to not do this is either:
#pragma clang loop vectorize(disable) (which I can't turn into a preprocessor macro)
or a long platform-specific sea of __attribute__((target("no-sse,no-sse2,..."))), the latter of which not being directly what I mean anyway.

So I guess Clang is why we can't have nice things today. Time to just mark the entire file with -fno-vectorize for Clang and -fno-tree-vectorize for gcc, and if people override that, they get to keep all the pieces...

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@KungFuJesus
Copy link

Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to.

I've never seen gcc emit that decoration in the assembly even when it had absolute certainty of alignment. I have seen the performance benefit of aligned access but it was always automatic. I suspect much later implementations of aarch64 don't need that.

module/Kbuild.in Outdated
# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
#
# Good luck.
ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm just overlooking it, but don't we also need to enforce this for the user space build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that hunk in my git add, apparently.

Will push shortly - had my primary machine die yesterday, still getting set up on the replacement one...

Copy link
Contributor Author

@rincebrain rincebrain Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wait, depends what behavior we want to enforce.

-fno-tree-vectorize makes sense to enforce in userland too, but -mgeneral-regs-only seems unnecessary since we nop out the fpu save/restore dance as irrelevant.

I'm fine including both for consistency, I just thought I'd check because the reasons for one don't apply in userland.

(To be clear, with the alignment requirement in the union dropped, -ftree-vectorize won't segfault any more in userland, and the kernel enforces you not doing that a dozen ways anyway, so this is just making explicit the behavior we desire.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staying consistent seems preferable to me. Let's just also include a comment explaining the situation for user space.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
…sn't it

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@behlendorf
Copy link
Contributor

Unfortunately, it looks like this breaks the build on CentOS 7 and the old ppc64 builder due to the old gcc version. Seems like a configure check will be needed for this.

gcc: error: unrecognized command line option '-mgeneral-regs-only'

@rincebrain
Copy link
Contributor Author

Cute. I'll give it a go later.

@rincebrain
Copy link
Contributor Author

rincebrain commented Oct 11, 2022 via email

# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
#
# Good luck.
ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Linux kernel really should handle this for us by disabling all of the vector instructions such that there is no need to pass either -mgeneral-regs-only or -fno-tree-vectorize.

-mgeneral-regs-only is a convenience flag meant to avoid requiring people to know about all various isa extensions. It is not available on all architectures and it will break the build on x86_64 on GCC versions older than gcc 7.

-fno-tree-vectorize is only specified in Linux's arch/csky/Makefile and no where else in Linux, since it really should not be necessary. That said, I am fine with passing -fno-tree-vectorize here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't just add it for fun, no.

As the comment above explains, if the user forcibly overrides -march later in the line, then these are overridden as well, and boom goes the dynamite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. -mgeneral-regs-only is not backward compatible with older compilers so unless we write an autotools check for it, we probably should drop it in favor of only doing -fno-tree-vectorize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to write an autoconf check, yes.

I think -fno-tree-vectorize might also need a check of that nature, but it's been a while since I looked, so I'll check when I get back to this.

# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing.
#
# Good luck.
CFLAGS += -mgeneral-regs-only -fno-tree-vectorize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how far back FreeBSD's compiler support goes, but the same remark about -mgeneral-regs-only being a potential problem applies here.

# We would keep -mgeneral-regs-only, but on e.g. x64, things like atof
# can't be defined without implying using FPU regs, so it's a compile
# time error.
NOVECTOR := -fno-tree-vectorize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix a real problem or is it just a precaution? I would rather not disable auto-vectorization in userspace since we expect the compiler to attempt these things in userspace, where using SIMD should always be safe.

That said, I know for a fact that the compiler will sometimes vectorize the mixing matrix calculations. I am not sure if its vectorization makes a difference, but I am inclined to let it try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13605 says hello.

# can't be defined without implying using FPU regs, so it's a compile
# time error.
NOVECTOR := -fno-tree-vectorize
$(addprefix module/zcommon/libzpool_la-,zfs_fletcher.$(OBJEXT) zfs_fletcher.l$(OBJEXT)) : CFLAGS += $(NOVECTOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

rincebrain added a commit to rincebrain/zfs that referenced this pull request Mar 19, 2023
This is probably the uncontroversial part of openzfs#13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 20, 2023
This is probably the uncontroversial part of openzfs#13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
behlendorf pushed a commit that referenced this pull request Mar 24, 2023
This is probably the uncontroversial part of #13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14649
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
This is probably the uncontroversial part of openzfs#13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#14649
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
This is probably the uncontroversial part of openzfs#13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#14649
@rincebrain rincebrain closed this Sep 11, 2023
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
This is probably the uncontroversial part of openzfs#13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#14649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants