-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ternlog recognition to lower #86228
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsMove ternlog insertion from importvectorization.cpp to lower to also handle user's case + LowerMemcmp vectorization. Handling for other cases where we can insert ternlog is left up for grabs. As a bonus, it now handles unrollings like these: bool Test(int[] a, int[] b) =>
a.AsSpan(0, 10).SequenceEqual(b); vmovups ymm0, ymmword ptr [rcx]
vmovups ymm1, ymmword ptr [rax]
vmovups ymm2, ymmword ptr [rcx+08H]
vmovups ymm3, ymmword ptr [rax+08H]
vpxor ymm0, ymm0, ymm1
- vpxor ymm1, ymm2, ymm3
- vpor ymm0, ymm0, ymm1
+ vpternlogq ymm0, ymm2, ymm3, -10
vptest ymm0, ymm0
sete al
movzx rax, al
|
src/coreclr/jit/lowerxarch.cpp
Outdated
case NI_SSE2_Or: | ||
case NI_AVX2_Or: | ||
case NI_AVX512F_Or: | ||
case NI_SSE2_Xor: | ||
case NI_AVX2_Xor: | ||
case NI_AVX512F_Xor: | ||
{ | ||
GenTree* ternLogic = RecognizeHWTernaryLog(node); | ||
if (ternLogic != nullptr) | ||
{ | ||
// Revisit this node again. | ||
return ternLogic->gtPrev; | ||
} | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the primary motivation for it being in lowering rather than morph? Most of our pattern based transform are recognized in morph
and done there instead (including ROL/ROR, AND_NOT, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding mainly, two reasons:
- Handle LowerMemCmp which is also in lower
- Handle user patterns with explicit intrinsics (since at this point we'll lower all cross-plat helpers to direct intrinsics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, it seems like it'd be easier to just special case it. That's what we do for most of our other patterns as well. It might be even better if we had a way to share most of the necessary logic between morph/lowering, that way we can get the early ones (which should also help TP) and also avoid missing any of the late ones; but that's not something we're doing elsewhere atm.
For 2, I don't see how that's impacted. We'll be recognizing the explicit intrinsics either way whether in morph or lowering, right?
src/coreclr/jit/lowerxarch.cpp
Outdated
if (IsHWBitwiseOp(node, &oper1)) | ||
{ | ||
if (IsHWBitwiseOp(op2, &oper2)) | ||
{ | ||
if ((oper1 == GT_OR) && (oper2 == GT_XOR)) | ||
{ | ||
mask = static_cast<uint8_t>(0xF0 | (0xCC ^ 0xAA)); | ||
} | ||
else if ((oper1 == GT_XOR) && (oper2 == GT_OR)) | ||
{ | ||
mask = static_cast<uint8_t>(0xF0 ^ (0xCC | 0xAA)); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to become untenable as we need to expand to the full set of operations.
It's also not accounting for things like containment or other factors that can influence which of the variants are best for a given operation and where the operands are coming from (e.g. a | (b ^ [c])
vs a | (c ^ [b])
vs (c ^ b) | [a]
)
I expect instead we need to have something more like an iterative approach. That is, we first get the two bitwise operations being combined and the three operands. We then validate early that bitwiseOper1
is containable by bitwiseOper2
(or vice versa). Next do the more expensive work (optimized code only) to determine optimal order (which operand should be RMW, which should be contained). Finally we pass the operands into a small helper with the operation being done:
mask = BuildTernaryLogicMask(bitOper1, maskOp1, maskOp2);
mask = BuildTernaryLogicMask(bitOper2, mask, maskOp3);
That avoids the duplication and centralizes the combining logic. The total set of operations supported are:
true: AllBitsSet
false: Zero
not: ~value
and: left & right
nand: ~(left & right)
or: left | right
nor: ~(left | right)
xor: left ^ right
xnor: ~(left ^ right)
cndsel: a ? b : c; aka (B & A) | (C & ~A)
major: 0 if two+ input bits are 0
minor: 1 if two+ input bits are 0
@tannergooding I realized that it's more complicated than I can spend my time on this and next weeks, so let's then just handle the |
low diffs from the Lower part so unlikely worth the effort. Although, we do need to remove ternary from importervectorizations.cpp where it doesn't belong and uglifies cross-plat code (I don't believe it improves TP by any visible value) |
Move ternlog insertion from importvectorization.cpp to lower to also handle user's case + LowerMemcmp vectorization. Handling for other cases where we can insert ternlog is left up for grabs.
As a bonus, it now handles unrollings like these: