HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

voinokin · 2018-09-11T08:06:48Z

In the repro code below sizeof(TChar) is known during JIT, still code is suboptimal in disasm - some helper (?) method is invoked instead of instruction and SIMD vector value is placed on stack due to this.
This issue is quite notable with shifts and typecasting during concrete value calculation, but to my memory there were similar cases with other instructions.

Repro code:

[MethodImpl(MethodImplOptions.NoInlining)]
static void Test<TChar>() where TChar : unmanaged
{
    Vector128<byte> v = Sse2.SetZeroVector128<byte>();
    v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar)); // <======
    Console.WriteLine(Sse41.Extract(v, 0));
}

static void Main(string[] args)
{
    Test<byte>();
    Test<ushort>();
}

Disasm:

--- ...\Program2.cs ---
Vector128 v = Sse2.SetZeroVector128();
000007FE75244A20 sub rsp,58h
000007FE75244A24 xor eax,eax
000007FE75244A26 mov qword ptr [rsp+40h],rax
000007FE75244A2B mov qword ptr [rsp+48h],rax
000007FE75244A30 pxor xmm0,xmm0
000007FE75244A34 movaps xmmword ptr [rsp+40h],xmm0

        v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar));

000007FE75244A39 movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A3E movaps xmmword ptr [rsp+30h],xmm0
000007FE75244A43 lea rcx,[rsp+40h]
000007FE75244A48 lea rdx,[rsp+20h]
000007FE75244A4D mov r8,qword ptr [rsp+30h]
000007FE75244A52 mov qword ptr [rdx],r8
000007FE75244A55 mov r8,qword ptr [rsp+38h]
000007FE75244A5A mov qword ptr [rdx+8],r8
000007FE75244A5E lea rdx,[rsp+20h]
000007FE75244A63 mov r8d,1
000007FE75244A69 call 000007FE752421A0 <=================

        Console.WriteLine(Sse41.Extract(v, 0));

000007FE75244A6E movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A73 pextrb ecx,xmm0,0
000007FE75244A79 call 000007FE75241AC8
000007FE75244A7E nop
000007FE75244A7F add rsp,58h
000007FE75244A83 ret
--- No source file -------------------------------------------------------------

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

mikedn · 2018-09-11T08:13:14Z

That's because of the cast to byte.

voinokin · 2018-09-11T08:26:34Z

...which makes things a bit funny in the context of C#/CLR due to roundtrips of most calculations on byte thru int ;-)

4creators · 2018-09-11T08:28:23Z

That's because of the cast to byte.

Yup, importer fails to recognize arg2 as const bcs it does not walk beyond cast:

               [000008] ------------ arg2            \--*  CAST      int <- ubyte <- int
               [000007] ------------                    \--*  CNS_INT   int    1

Whole importer dump

*************** In impImport() for Repro.Program:Test()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ---XG-------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A-XG-------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) sizeof 1B000002
    [ 2]  13 (0x00d) conv.u1
    [ 2]  14 (0x00e) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>
  Known type Vector128<byte>
Calling impNormStructVal on:
               [000006] ------------              *  LCL_VAR   simd16 V00 loc0         
  Known type Vector128<byte>
resulting tree:
               [000012] x-----------              *  OBJ(16)   simd16
               [000011] L-----------              \--*  ADDR      byref 
               [000006] ------------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 1]  19 (0x013) stloc.0

               [000017] ------------              *  STMT      void  (IL 0x006...  ???)
               [000009] S-C-G-------              \--*  CALL      void   System.Runtime.Intrinsics.X86.Sse2.ShiftRightLogical128BitLane
               [000015] L----------- arg0            +--*  ADDR      byref 
               [000014] ------------                 |  \--*  LCL_VAR   simd16 V00 loc0         
               [000012] x----------- arg1            +--*  OBJ(16)   simd16
               [000011] L-----------                 |  \--*  ADDR      byref 
               [000006] ------------                 |     \--*  LCL_VAR   simd16 V00 loc0         
               [000008] ------------ arg2            \--*  CAST      int <- ubyte <- int
               [000007] ------------                    \--*  CNS_INT   int    1

4creators · 2018-09-11T08:46:51Z

After simple fix:

Instructions as they come out of the scheduler


G_M44985_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop

G_M44985_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[F582F0A8] ptr arg pop  0
IN0004: 000016 E845000000           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop

G_M44985_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret
Allocated method code size =   33 , actual size =   33

Fixes #19888

4creators · 2018-09-11T08:52:26Z

No more poor codegen for HW intrinsics 😄

voinokin · 2018-09-11T08:52:58Z

Will this fix also handle the following code ? (attempt to work this issue around)

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void Test<TChar>()
            where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();

            byte C;
            if (sizeof(TChar) == sizeof(byte))
            {
                C = 1;
            }
            else if (sizeof(TChar) == sizeof(ushort))
            {
                C = 2;
            }
            else
            {
                throw new InvalidOperationException();
            }

            v = Sse2.ShiftRightLogical128BitLane(v, C);

            Console.WriteLine(Sse41.Extract(v, 0));
        }

4creators · 2018-09-11T09:01:09Z

Nope. It's too complex for codegen right now, but most importantly simple and equivalent approach works.

voinokin · 2018-09-11T09:03:14Z

...once you know it :-)

mikedn · 2018-09-11T09:05:02Z

..which makes things a bit funny in the context of CLR due to roundtrips of most calculations on byte thru int ;-)

Yeah. Making those parameters byte instead of int was likely a bad idea.

But then JIT's importer could and should probably do some trivial constant folding. constant/cast sequences do appear in other situations as well. Most commonly I've seen this with long constants where the C# compiler decides to save space and emits ldc.i4 and conv.i8. The importer obeys and generates a cast node, that then the optimizer removes.

Usually it's not a good idea to mix IL importing with optimizations but there are some cases where that's reasonable, especially when dealing with IL's own encoding specifics/limitations.

4creators · 2018-09-11T09:11:01Z

But then JIT's importer could and should probably do some trivial constant folding.

In this fix importer does not do any const folding just recognizes that cast is done on const. Later stages take care of const folding sufficiently well to let them do the job. The problem here is that Roslyn does constant folding and evidently it missed arg2.

4creators · 2018-09-11T09:12:13Z

...once you know it :-)

Yeah, right, it's not best of worlds we can have here.

mikedn · 2018-09-11T09:13:32Z

In this fix importer does not do any const folding just recognizes that cast is done on const.

Until someone tries something like 2 * sizeof(TChar).

4creators · 2018-09-11T09:14:51Z

Until someone tries something like 2 * sizeof(TChar).

This should be handled by Roslyn. Will check.

mikedn · 2018-09-11T09:18:52Z

This should be handled by Roslyn. Will check.

It's impossible, sizeof(TChar) is not a compile time constant.

4creators · 2018-09-11T15:12:39Z

@mikedn You are right. Added function to fold const expressions during import and it works as expected:

Test methods

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();
            v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar));
            Console.WriteLine(Sse41.Extract(v, 0));
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test1<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();
            v = Sse2.ShiftRightLogical128BitLane(v, (byte)((3 * sizeof(TChar)) / 2 + 4 - sizeof(TChar)));
            Console.WriteLine(Sse41.Extract(v, 0));
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test2<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();

            byte C;
            if (sizeof(TChar) == sizeof(byte))
            {
                v = Sse2.ShiftRightLogical128BitLane(v, 1);
            }
            else if (sizeof(TChar) == sizeof(ushort))
            {
                v = Sse2.ShiftRightLogical128BitLane(v, 2);
            }
            else
            {
                throw new InvalidOperationException();
            }

            Console.WriteLine(Sse41.Extract(v, 0));
        }

Method Test

Instructions as they come out of the scheduler


G_M44985_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M44985_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A96D338] ptr arg pop  0
IN0004: 000016 E87DE6FFFF           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M44985_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

Method Test1

Instructions as they come out of the scheduler


G_M17352_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M17352_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D805         vpsrldq  xmm0, xmm0, 5
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A97D9A8] ptr arg pop  0
IN0004: 000016 E845000000           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M17352_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

Method Test2

Instructions as they come out of the scheduler


G_M17451_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M17451_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A97DE18] ptr arg pop  0
IN0004: 000016 E855FEFFFF           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M17451_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

4creators · 2018-09-11T15:15:47Z

And in the importer we get:

Method Test:

*************** In impImport() for Repro.Program:Test()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) sizeof 1B000002
    [ 2]  13 (0x00d) conv.u1
    [ 2]  14 (0x00e) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

Folding operator with constant nodes into a constant:
               [000008] ------------              *  CAST      int <- ubyte <- int
               [000007] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000008] ------------              *  CNS_INT   int    2
  Known type Vector128<byte>

    [ 1]  19 (0x013) stloc.0

               [000013] ------------              *  STMT      void  (IL 0x006...  ???)
               [000008] ------------              |     /--*  CNS_INT   int    2
               [000009] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000006] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000012] -A----------              \--*  ASG       simd16 (copy)
               [000010] D-----------                 \--*  LCL_VAR   simd16 V00 loc0

Method Test1

*************** In impImport() for Repro.Program:Test1()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test1()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) ldc.i4.3 3
    [ 2]   8 (0x008) sizeof 1B000002
    [ 3]  14 (0x00e) mul
    [ 2]  15 (0x00f) ldc.i4.2 2
    [ 3]  16 (0x010) div
    [ 2]  17 (0x011) ldc.i4.4 4
    [ 3]  18 (0x012) add
    [ 2]  19 (0x013) sizeof 1B000002
    [ 3]  25 (0x019) sub
    [ 2]  26 (0x01a) conv.u1
    [ 2]  27 (0x01b) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

Folding operator with constant nodes into a constant:
               [000008] ------------              /--*  CNS_INT   int    2
               [000009] ------------              *  MUL       int   
               [000007] ------------              \--*  CNS_INT   int    3
Bashed to int constant:
               [000009] ------------              *  CNS_INT   int    6

Folding operator with constant nodes into a constant:
               [000010] ------------              /--*  CNS_INT   int    2
               [000011] ------------              *  DIV       int   
               [000009] ------------              \--*  CNS_INT   int    6
Bashed to int constant:
               [000011] ------------              *  CNS_INT   int    3

Folding operator with constant nodes into a constant:
               [000012] ------------              /--*  CNS_INT   int    4
               [000013] ------------              *  ADD       int   
               [000011] ------------              \--*  CNS_INT   int    3
Bashed to int constant:
               [000013] ------------              *  CNS_INT   int    7

Folding operator with constant nodes into a constant:
               [000014] ------------              /--*  CNS_INT   int    2
               [000015] ------------              *  SUB       int   
               [000013] ------------              \--*  CNS_INT   int    7
Bashed to int constant:
               [000015] ------------              *  CNS_INT   int    5

Folding operator with constant nodes into a constant:
               [000016] ------------              *  CAST      int <- ubyte <- int
               [000015] ------------              \--*  CNS_INT   int    5
Bashed to int constant:
               [000016] ------------              *  CNS_INT   int    5
  Known type Vector128<byte>

    [ 1]  32 (0x020) stloc.0

               [000021] ------------              *  STMT      void  (IL 0x006...  ???)
               [000016] ------------              |     /--*  CNS_INT   int    5
               [000017] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000006] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000020] -A----------              \--*  ASG       simd16 (copy)
               [000018] D-----------                 \--*  LCL_VAR   simd16 V00 loc0

Method Test2

*************** In impImport() for Repro.Program:Test2()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test2()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) sizeof 1B000002
    [ 1]  12 (0x00c) ldc.i4.1 1
    [ 2]  13 (0x00d) bne.un.s
Folding operator with constant nodes into a constant:
               [000007] ------------              /--*  CNS_INT   int    1
               [000008] N--------U--              *  NE        int   
               [000006] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000008] ------------              *  CNS_INT   int    1

The conditional jump becomes an unconditional jump to BB03

impImportBlockPending for BB03

Importing BB03 (PC=025) of 'Repro.Program:Test2()'
    [ 0]  25 (0x019) sizeof 1B000002
    [ 1]  31 (0x01f) ldc.i4.2 2
    [ 2]  32 (0x020) bne.un.s
Folding operator with constant nodes into a constant:
               [000011] ------------              /--*  CNS_INT   int    2
               [000012] N--------U--              *  NE        int   
               [000010] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000012] ------------              *  CNS_INT   int    0

The block falls through into the next BB04

impImportBlockPending for BB04

Importing BB04 (PC=034) of 'Repro.Program:Test2()'
    [ 0]  34 (0x022) ldloc.0
    [ 1]  35 (0x023) ldc.i4.2 2
    [ 2]  36 (0x024) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>
  Known type Vector128<byte>

    [ 1]  41 (0x029) stloc.0

               [000020] ------------              *  STMT      void  (IL 0x022...  ???)
               [000015] ------------              |     /--*  CNS_INT   int    2
               [000016] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000014] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000019] -A----------              \--*  ASG       simd16 (copy)
               [000017] D-----------                 \--*  LCL_VAR   simd16 V00 loc0

Fixes #19888 This enables expansin of imm HW Intrinsics which otherwise would be emitted with jump tables. Significantly improves user experience and performance

…uring import Fixes #19888 This enables expansion of imm HW Intrinsics which otherwise would be emitted as function calls with jump tables. Folding of const expressions happens during morphing and it means that directly before SSA generation and value numbering HW intrinsics are in final shape allowing them to participate in later optimizations once HW intrinsic support for value numbering is fully implemented. Significantly improves user experience.

…g import Fixes #19888 Solution is based on a call to fgMorphTreeNode without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTreeNode call. Additional benefit of not using gtFoldExpr is that this method was destructive for it's operands when processed. In contrast fgMorphTreeNode returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889

… import Fixes #19888 Solution is based on a call to fgMorphTree without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTree call. Additional benefit of not using gtFoldExpr is that this method is destructive for it's operands when processed. In contrast fgMorphTree returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889

gdkchan · 2018-09-22T19:03:50Z

I also had a similar problem with the following code:

private static void Test()
{
    Vector128<float> vec = Sse.SetZeroVector128();

    vec = VectorInsertSingle(vec, 0, 1.0f);
    vec = VectorInsertSingle(vec, 1, 1.5f);
    vec = VectorInsertSingle(vec, 2, 2.0f);
    vec = VectorInsertSingle(vec, 3, 2.5f);            
}

public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    //Produces sub-optimal code (jump table fallback,
    //JIT doesn't notice that the value is constant after inline).
    return Sse41.Insert(vector, value, (byte)(index << 4));
}

After inline, the index value is constant, however the jump table fallback is still called. This is the assembly produced for Test:

00007FFD4B14361B  vmovss      xmm2,dword ptr [7FFD4B1436E0h]  
00007FFD4B143624  xor         r9d,r9d  
00007FFD4B143627  call        00007FFD4B143360  
00007FFD4B14362C  mov         rcx,qword ptr [rsp+70h]  
00007FFD4B143631  mov         rdx,qword ptr [rsp+78h]  
00007FFD4B143636  lea         r9,[rsp+60h]  
00007FFD4B14363B  lea         rax,[rsp+20h]  
00007FFD4B143640  mov         qword ptr [rax],rcx  
00007FFD4B143643  mov         qword ptr [rax+8],rdx  
00007FFD4B143647  mov         rcx,r9  
00007FFD4B14364A  lea         rdx,[rsp+20h]  
00007FFD4B14364F  vmovss      xmm2,dword ptr [7FFD4B1436E4h]  
00007FFD4B143658  mov         r9d,10h  
00007FFD4B14365E  call        00007FFD4B143360  
00007FFD4B143663  mov         rcx,qword ptr [rsp+60h]  
00007FFD4B143668  mov         rdx,qword ptr [rsp+68h]  
00007FFD4B14366D  lea         r9,[rsp+50h]  
00007FFD4B143672  lea         rax,[rsp+20h]  
00007FFD4B143677  mov         qword ptr [rax],rcx  
00007FFD4B14367A  mov         qword ptr [rax+8],rdx  
00007FFD4B14367E  mov         rcx,r9  
00007FFD4B143681  lea         rdx,[rsp+20h]  
...

The values are constant (0, 0x10 etc), but it still produces the call.

Trying to workaround the issue I wrote another version using a switch case that use the constant values directly instead of relying on the JIT to do constant folding:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    switch (index)
    {
        case 0: return Sse41.Insert(vector, value, 0x00);
        case 1: return Sse41.Insert(vector, value, 0x10);
        case 2: return Sse41.Insert(vector, value, 0x20);
        case 3: return Sse41.Insert(vector, value, 0x30);

        default: throw new ArgumentOutOfRangeException(nameof(index));
    }
}

This also doesn't work well because the JIT is not capable of inlining methods with switch cases so the code produced is also sub-optimal.

Finally we have this version using if/else:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    if (index == 0)
    {
        return Sse41.Insert(vector, value, 0x00);
    }
    else if (index == 1)
    {
        return Sse41.Insert(vector, value, 0x10);
    }
    else if (index == 2)
    {
        return Sse41.Insert(vector, value, 0x20);
    }
    else if (index == 3)
    {
        return Sse41.Insert(vector, value, 0x30);
    }
    else
    {
        throw new ArgumentOutOfRangeException(nameof(index));
    }
}

This version produces more or less the expected code (this is the same Test method, calling the above VectorInsertSingle method):

00007FFD4B1235EC  vxorps      xmm0,xmm0,xmm0  
00007FFD4B1235F1  vmovss      xmm1,dword ptr [7FFD4B123668h]  
00007FFD4B1235FA  vinsertps   xmm0,xmm0,xmm1,0  
00007FFD4B123600  vmovapd     xmmword ptr [rsp+50h],xmm0  
00007FFD4B123607  vmovapd     xmm0,xmmword ptr [rsp+50h]  
00007FFD4B12360E  vmovss      xmm1,dword ptr [7FFD4B12366Ch]  
00007FFD4B123617  vinsertps   xmm0,xmm0,xmm1,10h  
00007FFD4B12361D  vmovapd     xmmword ptr [rsp+40h],xmm0  
00007FFD4B123624  vmovapd     xmm0,xmmword ptr [rsp+40h]  
00007FFD4B12362B  vmovss      xmm1,dword ptr [7FFD4B123670h]  
00007FFD4B123634  vinsertps   xmm0,xmm0,xmm1,20h  
00007FFD4B12363A  vmovapd     xmmword ptr [rsp+30h],xmm0  
00007FFD4B123641  vmovapd     xmm0,xmmword ptr [rsp+30h]  
00007FFD4B123648  vmovss      xmm1,dword ptr [7FFD4B123674h]  
00007FFD4B123651  vinsertps   xmm0,xmm0,xmm1,30h  
00007FFD4B123657  vmovapd     xmmword ptr [rsp+20h],xmm0  
00007FFD4B12365E  add         rsp,68h  
00007FFD4B123662  ret

The method was properly inlined, however there are still some register moves that I believe that are not necessary (for example XMM0 is spilled then immediately moved back). Furthermore this instruction also accepts a memory location, so the whole method after inline could be just a single instruction, something like this:

vxorps      xmm0,xmm0,xmm0
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123668h],0
vinsertps   xmm0,xmm0,dword ptr [7FFD4B12366Ch],10h
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123670h],20h  
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123674h],30h

I know there are more efficient ways to load a vector, this was just to demonstrate the usage of the VectorInsertSingle method.

Those tests were made with dotnet version 2.2.100-preview2-009404.

Another thing that may be worth pointing out is that the index argument name on the Vector128<float> overload of the Insert intrinsic method is a bit misleading, since it is not really index, the low bits are a zero mask and only the high bits are the actual index. One could assume it is just the destination index but that's not the case. Maybe its a good idea to rename it, I don't know if its worth opening a issue for that.

fyi @fiigii

Sorry if this is not the right place for that, but the issue is similar so I felt that opening a separate one was not necessary.

4creators · 2018-09-22T20:05:53Z

Those tests were made with dotnet version 2.2.100-preview2-009404.

All new work on HW intrinsics is in master branch what means that it is not available in 2.2 preview2 and will not be available in 2.2 RTM

HW Intrinsic with immediate operand expansion

The method below cannot be handled correctly by Jit even with current not accepted solution from dotnet/coreclr#19898.

public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    //Produces sub-optimal code (jump table fallback,
    //JIT doesn't notice that the value is constant after inline).
    return Sse41.Insert(vector, value, (byte)(index << 4));
}

The first problem is that Jit needs to recognize index variable as a const to be able to trigger HW intrinsic expansion and generate efficient code. Currently this happens only when value passed into intrinsic is a constant. Constant expressions in the form (byte)(index << 4) like in your code are not folded currently but they are handled correctly dotnet/coreclr#19898 under the assumption that index is a const or const expression. Unfortunately method parameter syntactically is not a const. There is a language proposal which could solve this dotnet/csharplang#886 but it is not accepted.

There is still a second problem here caused by method inlining. Inlining happens after decision on intrinsic expansion is made and Jit currently will not reconsider if immediate HW intrinsic should be expanded after it is inlined and arguments semantics changes like in your code. As a result your method despite working on semantically valid assumptions for Jit is not syntactically equivalent to intrinsic with imm8 constant.

I have created issue to track the second problem as this may be important in many situations as long as we do not have const parameters implemented.

fiigii · 2018-09-22T20:57:19Z

@gdkchan Thanks for reporting the codegen issue. The problem is different from this issue but also detected in https://github.com/dotnet/coreclr/issues/17108.

@CarolEidt @AndyAyersMS @mikedn @tannergooding
Currently, all the codegen issues of imm-intrinsic are from two aspects:

the importer does not fold some constant expressions, like GT_CAST
the const arguments flow from the function parameters and the function can be inlined, but we expend intrinsic before inlining.

The first issue can be solved by morphing the expressions in the importer (like dotnet/coreclr#19898), but both the two issues can be naturally solved by https://github.com/dotnet/coreclr/issues/17108. I know there is some difficulty to introduce call-node in lowering, can we just try to expend intrinsic behind inlining and constant propagation/folding?

4creators · 2018-09-22T21:31:34Z

but both the two issues can be naturally solved by #9989.

IMO we should reevaluate several problems:

We have mismatch between semantic requirements of imm intrinsics and C#/VB/F# semantics which will be impacting all future development of any intrinsic with immediate operands.
Current implementation despite providing ability to generate code for non const immediate parameters is undesired and avoided by users in other words it solves problem for compiler writers but does not solve users problems and users want high quality and fast code. The problem of codegen for non-const immediate parameters is raised in at least couple of issues.
We have deficiencies in importer and later compiler stages related to HW intrinsics handling. We do not fold constant expressions in importer, we do not reevaluate HW intrinsics expansion at later stages i.e. during morphing, value numbering, all optimizations happening before lowering and during lowering.

IMHO there is no single simple solution to problems which will be haunting imm HW intrinsic implementation, whether it would be dotnet/coreclr#19898 or #9989 or dotnet/csharplang#886 or current implementation of non-const parameter handling.

Scenarios which as turns out should be supported are much more complex:

support complex const expression folding
support immediate parameters which are jit time const
support semantic changes introduced by inlining
support bona fide immediate non-const parameters

IMHO we should do:
(i) const expression folding in importer,
(ii) reevaluate HW intrinsic expansion once or more times after morphing, value numbering and following optimizations,
(iii) const parameter semantics in C#/VB/F# from dotnet/csharplang#886,
(iv) have HW intrinsic overloads with non-const parameters what would allow for supporting of const expression which are constant only to Jit.

What could be helpful for (ii) it would be HW intrinsic support implementation for value numbering and following optimizations what could form a basis for further optimizations and even in more distant future autovectorization support.

https://github.com/dotnet/coreclr/blob/09cc49e8cac72915b72240c766e25ada171e9fe7/src/jit/valuenum.cpp#L6502

mikedn · 2018-09-22T22:17:39Z

I know there is some difficulty to introduce call-node in lowering, can we just try to expend intrinsic behind inlining and constant propagation/folding?

Basically what you want is to expand on import and then attempt to expand again during lowering, hoping that non-const parameters have become const. Not expanding on import at all is not an option, you really don't want to keep a bunch of call nodes if you can avoid it. Allways expanding on import and falling back to call in lowering is not an option as well, at least not a trivial one. So you really don't have much of a choice.

Well, there's also the choice of telling people - hey, that parameter must be a constant. Period.

echesakov · 2022-03-15T19:30:54Z

Un-assigning myself
cc @BruceForstall

tannergooding · 2022-03-15T19:34:25Z

This and #9989 are basically the same, they could probably be merged.

The rough requirement is that we support transforming a GT_HWINTRINSIC node back into a GT_CALL. We already do this for GT_INTRINSIC nodes in rationalization, but there were issues doing the same with HWIntrinsics. IIRC, the differences had to do with TYP_SIMD and some of the ABI transformations that would otherwise be required for out buffers, shadow copies, and the like. I wasn't familiar enough with the relevant parts of the JIT to make the necessary changes.

4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018

Fix HW Intrinsic imm const arg detection for casts

fd7c83b

Fixes #19888

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

4creators mentioned this issue Jan 31, 2020

Reevaluate HW intrinsic immediate const parameters after method containing HW intrinsics is inlined #11138

Closed

tannergooding mentioned this issue Mar 12, 2020

[jit] Add more SSE2 opcodes. Enable SSE2. #33465

Merged

tannergooding mentioned this issue Apr 15, 2020

API Proposal : Arm Shift and Permute intrinsics #31324

Closed

This was referenced May 6, 2020

Test failure: Assertion failed '!"SIMD intrinsic with unsupported base type."' #35876

Closed

Optimize AdvSimd.Extract() when passed variable that can be const propagated #36070

Closed

tannergooding mentioned this issue Jun 3, 2020

HFA/Vector calling convention representation in the Jit on arm64. #37341

Closed

CarolEidt mentioned this issue Oct 13, 2020

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

Closed

CarolEidt modified the milestones: Future, 6.0.0 Oct 13, 2020

JulieLeeMSFT assigned echesakov Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

tannergooding mentioned this issue Apr 9, 2021

Consider making Interlocked.And and friends into JIT intrinsics #32239

Closed

JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021

echesakov added the Priority:3 Work that is nice to have label Jul 8, 2021

echesakov modified the milestones: 6.0.0, Future Jul 8, 2021

echesakov removed their assignment Mar 15, 2022

BruceForstall removed the enhancement Product code improvement that does NOT require public API changes/additions label Mar 16, 2022

tannergooding mentioned this issue May 17, 2024

Investigate what would be required to move some SIMD functionality out of the JIT and into managed code #102275

Closed

tannergooding mentioned this issue May 27, 2024

Update the JIT to support rewriting more complex intrinsics as user calls #102702

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label May 27, 2024

tannergooding closed this as completed in #102702 May 29, 2024

tannergooding mentioned this issue May 29, 2024

Allow shuffle and other hwintrinsic that require a constant to stay intrinsic if the operand becomes constant later #102827

Merged

github-actions bot locked and limited conversation to collaborators Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

voinokin commented Sep 11, 2018 •

edited by tannergooding

Loading

mikedn commented Sep 11, 2018

voinokin commented Sep 11, 2018 •

edited

Loading

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

voinokin commented Sep 11, 2018

4creators commented Sep 11, 2018

voinokin commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

gdkchan commented Sep 22, 2018 •

edited

Loading

4creators commented Sep 22, 2018

fiigii commented Sep 22, 2018

4creators commented Sep 22, 2018 •

edited

Loading

mikedn commented Sep 22, 2018

echesakov commented Mar 15, 2022

tannergooding commented Mar 15, 2022

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

Comments

voinokin commented Sep 11, 2018 • edited by tannergooding Loading

mikedn commented Sep 11, 2018

voinokin commented Sep 11, 2018 • edited Loading

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

voinokin commented Sep 11, 2018

4creators commented Sep 11, 2018

voinokin commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

mikedn commented Sep 11, 2018

4creators commented Sep 11, 2018

4creators commented Sep 11, 2018

gdkchan commented Sep 22, 2018 • edited Loading

4creators commented Sep 22, 2018

HW Intrinsic with immediate operand expansion

fiigii commented Sep 22, 2018

4creators commented Sep 22, 2018 • edited Loading

mikedn commented Sep 22, 2018

echesakov commented Mar 15, 2022

tannergooding commented Mar 15, 2022

voinokin commented Sep 11, 2018 •

edited by tannergooding

Loading

voinokin commented Sep 11, 2018 •

edited

Loading

gdkchan commented Sep 22, 2018 •

edited

Loading

4creators commented Sep 22, 2018 •

edited

Loading