Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ternlog recognition to lower #86228

Closed
wants to merge 5 commits into from
Closed

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented May 14, 2023

Move ternlog insertion from importvectorization.cpp to lower to also handle user's case + LowerMemcmp vectorization. Handling for other cases where we can insert ternlog is left up for grabs.

As a bonus, it now handles unrollings like these:

bool Test(int[] a, int[] b) => 
    a.AsSpan(0, 10).SequenceEqual(b);
       vmovups  ymm0, ymmword ptr [rcx]
       vmovups  ymm1, ymmword ptr [rax]
       vmovups  ymm2, ymmword ptr [rcx+08H]
       vmovups  ymm3, ymmword ptr [rax+08H]
       vpxor    ymm0, ymm0, ymm1
-      vpxor    ymm1, ymm2, ymm3
-      vpor     ymm0, ymm0, ymm1
+      vpternlogq ymm0, ymm2, ymm3, -10
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al

@ghost ghost assigned EgorBo May 14, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 14, 2023
@ghost
Copy link

ghost commented May 14, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Move ternlog insertion from importvectorization.cpp to lower to also handle user's case + LowerMemcmp vectorization. Handling for other cases where we can insert ternlog is left up for grabs.

As a bonus, it now handles unrollings like these:

bool Test(int[] a, int[] b) => 
    a.AsSpan(0, 10).SequenceEqual(b);
       vmovups  ymm0, ymmword ptr [rcx]
       vmovups  ymm1, ymmword ptr [rax]
       vmovups  ymm2, ymmword ptr [rcx+08H]
       vmovups  ymm3, ymmword ptr [rax+08H]
       vpxor    ymm0, ymm0, ymm1
-      vpxor    ymm1, ymm2, ymm3
-      vpor     ymm0, ymm0, ymm1
+      vpternlogq ymm0, ymm2, ymm3, -10
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo added the avx512 Related to the AVX-512 architecture label May 14, 2023
Comment on lines 1289 to 1303
case NI_SSE2_Or:
case NI_AVX2_Or:
case NI_AVX512F_Or:
case NI_SSE2_Xor:
case NI_AVX2_Xor:
case NI_AVX512F_Xor:
{
GenTree* ternLogic = RecognizeHWTernaryLog(node);
if (ternLogic != nullptr)
{
// Revisit this node again.
return ternLogic->gtPrev;
}
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the primary motivation for it being in lowering rather than morph? Most of our pattern based transform are recognized in morph and done there instead (including ROL/ROR, AND_NOT, etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding mainly, two reasons:

  1. Handle LowerMemCmp which is also in lower
  2. Handle user patterns with explicit intrinsics (since at this point we'll lower all cross-plat helpers to direct intrinsics)

Copy link
Member

@tannergooding tannergooding May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, it seems like it'd be easier to just special case it. That's what we do for most of our other patterns as well. It might be even better if we had a way to share most of the necessary logic between morph/lowering, that way we can get the early ones (which should also help TP) and also avoid missing any of the late ones; but that's not something we're doing elsewhere atm.

For 2, I don't see how that's impacted. We'll be recognizing the explicit intrinsics either way whether in morph or lowering, right?

Comment on lines 968 to 981
if (IsHWBitwiseOp(node, &oper1))
{
if (IsHWBitwiseOp(op2, &oper2))
{
if ((oper1 == GT_OR) && (oper2 == GT_XOR))
{
mask = static_cast<uint8_t>(0xF0 | (0xCC ^ 0xAA));
}
else if ((oper1 == GT_XOR) && (oper2 == GT_OR))
{
mask = static_cast<uint8_t>(0xF0 ^ (0xCC | 0xAA));
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to become untenable as we need to expand to the full set of operations.

It's also not accounting for things like containment or other factors that can influence which of the variants are best for a given operation and where the operands are coming from (e.g. a | (b ^ [c]) vs a | (c ^ [b]) vs (c ^ b) | [a])

I expect instead we need to have something more like an iterative approach. That is, we first get the two bitwise operations being combined and the three operands. We then validate early that bitwiseOper1 is containable by bitwiseOper2 (or vice versa). Next do the more expensive work (optimized code only) to determine optimal order (which operand should be RMW, which should be contained). Finally we pass the operands into a small helper with the operation being done:

mask = BuildTernaryLogicMask(bitOper1, maskOp1, maskOp2);
mask = BuildTernaryLogicMask(bitOper2, mask, maskOp3);

That avoids the duplication and centralizes the combining logic. The total set of operations supported are:

  • true: AllBitsSet
  • false: Zero
  • not: ~value
  • and: left & right
  • nand: ~(left & right)
  • or: left | right
  • nor: ~(left | right)
  • xor: left ^ right
  • xnor: ~(left ^ right)
  • cndsel: a ? b : c; aka (B & A) | (C & ~A)
  • major: 0 if two+ input bits are 0
  • minor: 1 if two+ input bits are 0

@EgorBo
Copy link
Member Author

EgorBo commented May 17, 2023

@tannergooding I realized that it's more complicated than I can spend my time on this and next weeks, so let's then just handle the LowerCallMemcmp case via special-casing since it won't be covered by morph anyway - I've just merged it

@EgorBo
Copy link
Member Author

EgorBo commented May 18, 2023

low diffs from the Lower part so unlikely worth the effort. Although, we do need to remove ternary from importervectorizations.cpp where it doesn't belong and uglifies cross-plat code (I don't believe it improves TP by any visible value)

@EgorBo EgorBo closed this May 18, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants