Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64: Don't emit mov for zero case in jump tables for shift intrinsics #107322

Merged

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Sep 3, 2024

Fixes #107173. ARM64 right-shift intrinsics don't support encoding shift amounts of zero -- I believe this was the initial motivation for emitting a mov in lieu of the intrinsic when the shift amount is zero. However, HWIntrinsicImmOpHelper should never try to emit a shift amount of zero for right-shift intrinsics, so we shouldn't need the mov case. Furthermore, emitting a mov instead of the intrinsic drops any side effects other than the shift, as demonstrated by the usage of ShiftLeftLogicalWideningLower in the test without this fix.

By removing the mov case, left-shift intrinsics should work correctly with all immediates now, and for cases where we try to right-shift by zero, we either switch over to a fallback implementation during importation if one is available, or we throw an ArgumentOutOfRangeException at runtime.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 3, 2024
@amanasifkhalid amanasifkhalid changed the title JIT ARM64: Don't emit mov for zero case in jump tables for shift intrinsics JIT ARM64: Don't emit mov for zero case in jump tables for left-shift intrinsics Sep 4, 2024
@amanasifkhalid
Copy link
Member Author

I initially neglected that we can do something like Vector128<byte> >> 8, and the JIT will mask the immediate by the size of the base type such that it's zero, and then transform the >> into one of the ShiftRight intrinsics. Thus, it's possible to have a right shift with a shift amount of zero during codegen, and we can't encode this directly, so having the mov fallback is useful. Since this should only be possible when the immediate is known (the jump table helper will never try to emit a zero case for HW_Category_ShiftRightByImmediate intrinsics), we should be able to elide the mov without affecting jump table generation, so I got rid of the TODO.

cc @dotnet/jit-contrib, @tannergooding PTAL. Thanks!

@amanasifkhalid
Copy link
Member Author

ping @tannergooding @dotnet/jit-contrib

Comment on lines 438 to 447
if ((shiftAmount == 0) && (intrin.category == HW_Category_ShiftRightByImmediate))
{
// TODO: Use emitIns_Mov instead.
// We do not use it currently because it will still elide the 'mov'
// even if 'canSkip' is false. We cannot elide the 'mov' here.
GetEmitter()->emitIns_R_R_R(INS_mov, emitTypeSize(node), targetReg, reg, reg);
// For operations like `Vector128<byte> >> 8`, over-shifting behavior will mask
// the immediate so that it's zero, and the shift will be imported as a GT_HWINTRINSIC.
// If this isn't const-folded away, it is possible to try to emit a right-shift by zero,
// but only if the immediate is known (we won't emit a zero case for the jump table).
// Since we cannot encode a right-shift by 0 on ARM64, emit a mov instead.
assert(op->isContainedIntOrIImmed());
GetEmitter()->emitIns_Mov(INS_mov, emitTypeSize(node), targetReg, reg, /* canSkip */ true);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we emit a case for the jump table?

If the value is unknown and it doesn't throw, then won't we hit the case in the jump table and need handling emitted for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that for unknown valid immediates, we will emit the jump table. In the table generation logic, we get the table's bounds from HWIntrinsicInfo::lookupImmBounds, which sets the lower bound at 1 for right-shift intrinsics here. So if shiftAmount == 0 for a right-shift intrinsic, we should not be generating a jump table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what are we doing if the value is unknown and 0? Because we're not throwing for the constant case here.

Copy link
Member Author

@amanasifkhalid amanasifkhalid Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I tried the following program:

public class Test
{
    public byte m;
}

public class Program
{
    static Test t = new Test();
    public static void Main()
    {
        var v = Vector128<Byte>.Zero;
        var result = v >> t.m;
        Console.WriteLine(result);
    }
}

The JIT imports the shift as ShiftLogical, hence why we don't throw. I can update the comment to cover this case, if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I wonder if this is the "right" fix.

We basically have:

  • Import the intrinsic
  • If the shift amount is non-constant, import as ShiftLogical
  • If the shift amount is constant, then mask, which could become zero

So I think we're maybe missing a check that ensures that if its zero, its still treated as "out of range" and gets imported as ShiftLogical instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your proposed fix is better. I initially wanted the fallback implementations we've been adding to kick in for cases like this, but the logic for checking immediates and adding range checks/etc. in Compiler::impHWIntrinsic runs before we check if the intrinsic is table-driven or needs special handling. I guess in GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp, instead of just checking if the immediate is const, we need to also check if it's out-of-range and use ShiftLogical accordingly -- I'll give that a try.

@amanasifkhalid amanasifkhalid changed the title JIT ARM64: Don't emit mov for zero case in jump tables for left-shift intrinsics JIT ARM64: Don't emit mov for zero case in jump tables for shift intrinsics Sep 5, 2024
@amanasifkhalid
Copy link
Member Author

Diffs are from the op_RightShift tests that shift by the size of the base type. Here's a snippet of the jit-analyze output on coreclr_tests:

Top method regressions (bytes):
          32 (7.84 % of base) : 564602.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftByte8:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564632.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftDouble64:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564662.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftInt1616:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564692.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftInt3232:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564722.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftInt6464:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564752.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftSByte8:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564782.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftSingle32:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564812.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftUInt1616:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564842.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftUInt3232:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 564872.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_RightShiftUInt6464:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 565482.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftByte8:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 565512.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftDouble64:RunClassFldScenario():this (FullOpts)
          32 (7.84 % of base) : 565542.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt1616:RunClassFldScenario():this (FullOpts)
          32 (7.69 % of base) : 565543.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt1616:RunStructFldScenario():this (FullOpts)
          32 (7.84 % of base) : 565572.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt3232:RunClassFldScenario():this (FullOpts)
          32 (6.90 % of base) : 565595.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt6464:RunBasicScenario_UnsafeRead():this (FullOpts)
          32 (7.84 % of base) : 565602.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt6464:RunClassFldScenario():this (FullOpts)
          32 (7.77 % of base) : 565599.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt6464:RunClsVarScenario():this (FullOpts)
          32 (7.69 % of base) : 565603.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftInt6464:RunStructFldScenario():this (FullOpts)
          32 (7.84 % of base) : 565632.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorImmBinaryOpTest__op_UnsignedRightShiftSByte8:RunClassFldScenario():this (FullOpts)

I'm guessing these were previously getting folded away.

// and return the correct fallback intrinsic.
if ((op != GT_LSH) && (op2->AsIntCon()->IconValue() == 0))
{
op2 = gtNewSimdCreateBroadcastNode(type, op2, simdBaseJitType, simdSize);
Copy link
Member

@tannergooding tannergooding Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we prefer the direct zero constant helper in cases like this:

Suggested change
op2 = gtNewSimdCreateBroadcastNode(type, op2, simdBaseJitType, simdSize);
op2 = gtNewZeroConNode(type);

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing these were previously getting folded away

We can get this handled in .NET 10 and ensure that ShiftLogical(x, zero) is elided. It shouldn't be important to preserve for .NET 9

@amanasifkhalid amanasifkhalid merged commit 0fa7321 into dotnet:main Sep 6, 2024
114 of 117 checks passed
@amanasifkhalid amanasifkhalid deleted the emit-shift-zero-correctly branch September 6, 2024 21:02
@amanasifkhalid
Copy link
Member Author

/backport to release/9.0

Copy link
Contributor

github-actions bot commented Sep 6, 2024

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/10745195972

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Bad codegen with AdvSimd.ShiftLeftLogicalWideningLower
2 participants