-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop UB complaints and autovectorization of scalar fletcher4 implementation #13631
Conversation
4bfbd1c
to
3e7c6a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely better that the current situation. Alternately, I wonder if we could force the correct alignment with the aligned
attribute so it could be safely vectorized.
module/zcommon/zfs_fletcher.c
Outdated
__attribute__((optimize("no-tree-vectorize"))) | ||
#elif defined(__clang__) | ||
__attribute__((optnone)) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the zstd code defines a DONT_VECTORIZE
macro in module/zstd/lib/common/compiler.h
. We could define our own version of this for use throughout the code.
As you already mentioned I think we should update the superscaler implementations as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy apart from the optnone. The rest are nits/debatable (although I'd prefer comments for this, if you feel it doesn't fit with what ZFS usually does, so be it).
We went back and forth on this and while from a purity POV I'd prefer memcpy, I don't think there's anything wrong with this, and like you said, we use the same method elsewhere anyway.
module/zcommon/zfs_fletcher.c
Outdated
@@ -300,33 +300,47 @@ fletcher_2_byteswap(const void *buf, uint64_t size, | |||
(void) fletcher_2_incremental_byteswap((void *) buf, size, zcp); | |||
} | |||
|
|||
ZFS_NO_SANITIZE_UNDEFINED | |||
#if defined(__GNUC__) && !defined(__clang__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves a comment, IMO, as I can absolutely see us questioning this in a few years time (... just like I did with the UBSAN annotation).
module/zcommon/zfs_fletcher.c
Outdated
#if defined(__GNUC__) && !defined(__clang__) | ||
__attribute__((optimize("no-tree-vectorize"))) | ||
#elif defined(__clang__) | ||
__attribute__((optnone)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring absolutely no optimisation for a memcpy might be interesting.
Anyway, we want -fno-vectorize here for Clang, as less of a sledgehammer than optnone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And by interesting you mean terrible. On x86, arm, and several other architectures the compiler will often perform the unaligned read directly without needing a copy. Turning off optimization would likely force the copy and probably disable the compiler builtin from being used (forcing the actual memcpy function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
module/zcommon/zfs_fletcher.c
Outdated
static void | ||
fletcher_4_scalar_byteswap(fletcher_4_ctx_t *ctx, const void *buf, | ||
uint64_t size) | ||
{ | ||
zio_cksum_t *zcp = (zio_cksum_t *)ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zio_cksum_t *zcp = (zio_cksum_t *)ctx; | |
/* Drop the huge alignment constraint (64B) from fletcher_4_ctx_t */ | |
zio_cksum_t *zcp = (zio_cksum_t *)ctx; |
62333d9
to
1630bd4
Compare
lib/libzfs/libzfs_sendrecv.c
Outdated
@@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd, | |||
int err = 0; | |||
char *packbuf = NULL; | |||
size_t buflen = 0; | |||
zio_cksum_t zc = { {0} }; | |||
zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little heavy handed. Is this for alignment purposes? Couldn't you just use a compiler directive to align this on the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and then we get to update it every time we change SIMD alignment requirements, and god help us if someone does something clever and they're no longer mutually compatible.
The implementations are all written entirely with unaligned accesses anyway (except possibly the NEON one? I'll go boot my AArch64 testbed and see...), I'm just going to drop the alignment definition in the union, because we're basically just saying "pretty please" and wishing we had our cake and ate it too, now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 byte alignment is pretty futureproof, no? It might add some padding to your stack location but it's going to be way better than allocating dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having checked, the NEON implementation also appears to not require the alignment at all, so I think I prefer suggesting people add alignment if there are places they feel the difference is a significant win, rather than pretending it's aligned everywhere and then having to manually force things we're munging into fletcher_4_ctx_t *
to make that true.
(That said, UBsan also screams murder on AArch64 about the LZ4 implementation unrelated to these changes, so that's a different problem...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for an infrequent enough access, I don't think it's a significant win much anymore on newer CPUs. For AARCH64, particularly on "little" cores, aligned access is still quite a bit faster than unaligned. I get the impression this checksum is only written back to memory after summing a block's worth of bytes, so it's probably not going to be significant either way.
Now, on PowerPC, particularly the older, pre-power7 variants, alignment is a requirement or it takes two loads and a permute from a permutation vector returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support goes back and if it needs the VSX extensions or supports the less VMX/altivec ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, my emailed reply got lost.
Anyway, my remark was that presently OpenZFS has no POWER-specific fletcher implementation, and the alignment is based just on the compile-time directives for supporting SSE2/4/AVX/etc, so changing it to not think it's always aligned would currently only impact x86 and ARM.
(And it just grew the ability to notice if VSX is supported for the BLAKE3 PR.)
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
87cfc56
to
a4dee02
Compare
I've pushed it just not having an alignment requirement, after both checking and testing that the implementations are entirely using unaligned instructions everywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL NEON doesn't care about alignment.
Looks good, thank you!
Apparently you can tell NEON that you promise an access is aligned and it'll then optimize based on it and fault if you lied, but we don't appear to. |
So, as far as I can tell, the only way I can tell Clang to not do this is either: So I guess Clang is why we can't have nice things today. Time to just mark the entire file with |
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
I've never seen gcc emit that decoration in the assembly even when it had absolute certainty of alignment. I have seen the performance benefit of aligned access but it was always automatic. I suspect much later implementations of aarch64 don't need that. |
module/Kbuild.in
Outdated
# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing. | ||
# | ||
# Good luck. | ||
ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm just overlooking it, but don't we also need to enforce this for the user space build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed that hunk in my git add, apparently.
Will push shortly - had my primary machine die yesterday, still getting set up on the replacement one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, wait, depends what behavior we want to enforce.
-fno-tree-vectorize
makes sense to enforce in userland too, but -mgeneral-regs-only
seems unnecessary since we nop out the fpu save/restore dance as irrelevant.
I'm fine including both for consistency, I just thought I'd check because the reasons for one don't apply in userland.
(To be clear, with the alignment requirement in the union dropped, -ftree-vectorize
won't segfault any more in userland, and the kernel enforces you not doing that a dozen ways anyway, so this is just making explicit the behavior we desire.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staying consistent seems preferable to me. Let's just also include a comment explaining the situation for user space.
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
…sn't it Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Unfortunately, it looks like this breaks the build on CentOS 7 and the old ppc64 builder due to the old
|
Cute. I'll give it a go later. |
Presently, there's not a PPC accelerated fletcher4 at all, and the
alignment requirements were imposed per-arch based on what got included
into the union definition, so this only affects x86 and ARM at the moment.
- Rich
…On Sun, Jul 17, 2022 at 10:03 PM Adam Stylinski ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lib/libzfs/libzfs_sendrecv.c
<#13631 (comment)>:
> @@ -2060,12 +2060,12 @@ send_prelim_records(zfs_handle_t *zhp, const char *from, int fd,
int err = 0;
char *packbuf = NULL;
size_t buflen = 0;
- zio_cksum_t zc = { {0} };
+ zio_cksum_t *zc = calloc(sizeof (zio_cksum_t), 1);
So for an infrequent enough access, I don't think it's a significant win
much anymore on newer CPUs. For AARCH64, particularly on "little" cores,
aligned access is still quite a bit faster than unaligned. I get the
impression this checksum is only written back to memory after summing a
block's worth of bytes, so it's probably not going to be significant either
way.
Now, on PowerPC, particularly the older, pre-power7 variants, alignment is
a requirement or it takes two loads and a permute from a permutation vector
returned from vec_lvsl. I'm not sure how far back OpenZFS's PowerPC support
goes back and if it needs the VSX extensions or supports the less
VMX/altivec ones.
—
Reply to this email directly, view it on GitHub
<#13631 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7ILOV32RNYP2OQYV73VUS3Q5ANCNFSM523PCLZA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing. | ||
# | ||
# Good luck. | ||
ZFS_MODULE_CFLAGS += -mgeneral-regs-only -fno-tree-vectorize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Linux kernel really should handle this for us by disabling all of the vector instructions such that there is no need to pass either -mgeneral-regs-only or -fno-tree-vectorize.
-mgeneral-regs-only
is a convenience flag meant to avoid requiring people to know about all various isa extensions. It is not available on all architectures and it will break the build on x86_64 on GCC versions older than gcc 7.
-fno-tree-vectorize
is only specified in Linux's arch/csky/Makefile
and no where else in Linux, since it really should not be necessary. That said, I am fine with passing -fno-tree-vectorize
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't just add it for fun, no.
As the comment above explains, if the user forcibly overrides -march later in the line, then these are overridden as well, and boom goes the dynamite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. -mgeneral-regs-only is not backward compatible with older compilers so unless we write an autotools check for it, we probably should drop it in favor of only doing -fno-tree-vectorize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to write an autoconf check, yes.
I think -fno-tree-vectorize might also need a check of that nature, but it's been a while since I looked, so I'll check when I get back to this.
# will undo -mgeneral-regs-only, and gcc's -O2 starting in 12 does autovectorizing. | ||
# | ||
# Good luck. | ||
CFLAGS += -mgeneral-regs-only -fno-tree-vectorize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how far back FreeBSD's compiler support goes, but the same remark about -mgeneral-regs-only
being a potential problem applies here.
# We would keep -mgeneral-regs-only, but on e.g. x64, things like atof | ||
# can't be defined without implying using FPU regs, so it's a compile | ||
# time error. | ||
NOVECTOR := -fno-tree-vectorize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this fix a real problem or is it just a precaution? I would rather not disable auto-vectorization in userspace since we expect the compiler to attempt these things in userspace, where using SIMD should always be safe.
That said, I know for a fact that the compiler will sometimes vectorize the mixing matrix calculations. I am not sure if its vectorization makes a difference, but I am inclined to let it try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#13605 says hello.
# can't be defined without implying using FPU regs, so it's a compile | ||
# time error. | ||
NOVECTOR := -fno-tree-vectorize | ||
$(addprefix module/zcommon/libzpool_la-,zfs_fletcher.$(OBJEXT) zfs_fletcher.l$(OBJEXT)) : CFLAGS += $(NOVECTOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
This is probably the uncontroversial part of #13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14649
This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#14649
This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#14649
This is probably the uncontroversial part of openzfs#13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#14649
(I'll rebase with a better description before merge assuming nobody has a reason this is a terrible idea, but since we're doing the opposite cast as input in a number of places, it seems no worse than status quo ante...)
Motivation and Context
Getting rid of "shhh nobody look it's fine we promise" excludes in the code is always good, plus with gcc defaulting to auto-vectorization at O2 now, the results can be...explosive (#13605 #13620).
Description
When we cast the input to
fletcher_4_scalar_native
or friends tofletcher_4_ctx_t
, we're promising that the thing is up to 64B (!) aligned, as far as the compiler's concerned, but because we're not actually guaranteeing that, auto-vectorizing the code results in trying to do an aligned write to the item, and that is sometimes going to crash (in userland; since the kernel has big flashing NO DON'T around auto-vectorizing, it's probably just going to upset sanitizers).But it turns out, everywhere we explicitly call the scalar implementation, we're just casting a
zio_cksum_t *
tofletcher_4_ctx_t *
, so if that's ever incorrect behavior, we're going to horrifically crash a dozen ways to Sunday anyway.I just dropped the forced alignment annotations, because it turns out all the implementations are written entirely in unaligned access instructions anyway, so if we really think that giving unaligned instructions aligned accesses is more performant in some spots, we can just do that there, rather than claiming it's aligned everywhere and then lying a few times.
How Has This Been Tested?
Before this change, using
-ftree-vectorize -march=znver2
would result in azfs
binary that crashed around half the time on trying to send on my 8700k or my 5900X, and removing the UBSan exceptions would result in erroring out hard immediately on send with or without the-ftree-vectorize
.After this change, with or without the
-ftree-vectorize
, UBSan as far as I can see has no complaints about these functions, and I haven't made it crash again.Types of changes
Checklist:
Signed-off-by
.