Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

Closed
voinokin opened this issue Sep 11, 2018 · 27 comments · Fixed by #102702 or #102827
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged optimization Priority:3 Work that is nice to have
Milestone

Comments

@voinokin
Copy link

voinokin commented Sep 11, 2018

In the repro code below sizeof(TChar) is known during JIT, still code is suboptimal in disasm - some helper (?) method is invoked instead of instruction and SIMD vector value is placed on stack due to this.
This issue is quite notable with shifts and typecasting during concrete value calculation, but to my memory there were similar cases with other instructions.


Repro code:

[MethodImpl(MethodImplOptions.NoInlining)]
static void Test<TChar>() where TChar : unmanaged
{
    Vector128<byte> v = Sse2.SetZeroVector128<byte>();
    v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar)); // <======
    Console.WriteLine(Sse41.Extract(v, 0));
}

static void Main(string[] args)
{
    Test<byte>();
    Test<ushort>();
}

Disasm:

--- ...\Program2.cs ---
Vector128 v = Sse2.SetZeroVector128();
000007FE75244A20 sub rsp,58h
000007FE75244A24 xor eax,eax
000007FE75244A26 mov qword ptr [rsp+40h],rax
000007FE75244A2B mov qword ptr [rsp+48h],rax
000007FE75244A30 pxor xmm0,xmm0
000007FE75244A34 movaps xmmword ptr [rsp+40h],xmm0

        v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar));

000007FE75244A39 movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A3E movaps xmmword ptr [rsp+30h],xmm0
000007FE75244A43 lea rcx,[rsp+40h]
000007FE75244A48 lea rdx,[rsp+20h]
000007FE75244A4D mov r8,qword ptr [rsp+30h]
000007FE75244A52 mov qword ptr [rdx],r8
000007FE75244A55 mov r8,qword ptr [rsp+38h]
000007FE75244A5A mov qword ptr [rdx+8],r8
000007FE75244A5E lea rdx,[rsp+20h]
000007FE75244A63 mov r8d,1
000007FE75244A69 call 000007FE752421A0 <=================

        Console.WriteLine(Sse41.Extract(v, 0));

000007FE75244A6E movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A73 pextrb ecx,xmm0,0
000007FE75244A79 call 000007FE75241AC8
000007FE75244A7E nop
000007FE75244A7F add rsp,58h
000007FE75244A83 ret
--- No source file -------------------------------------------------------------

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:large

@mikedn
Copy link
Contributor

mikedn commented Sep 11, 2018

That's because of the cast to byte.

@voinokin
Copy link
Author

voinokin commented Sep 11, 2018

...which makes things a bit funny in the context of C#/CLR due to roundtrips of most calculations on byte thru int ;-)

@4creators
Copy link
Contributor

That's because of the cast to byte.

Yup, importer fails to recognize arg2 as const bcs it does not walk beyond cast:

               [000008] ------------ arg2            \--*  CAST      int <- ubyte <- int
               [000007] ------------                    \--*  CNS_INT   int    1

Whole importer dump

*************** In impImport() for Repro.Program:Test()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ---XG-------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A-XG-------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) sizeof 1B000002
    [ 2]  13 (0x00d) conv.u1
    [ 2]  14 (0x00e) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>
  Known type Vector128<byte>
Calling impNormStructVal on:
               [000006] ------------              *  LCL_VAR   simd16 V00 loc0         
  Known type Vector128<byte>
resulting tree:
               [000012] x-----------              *  OBJ(16)   simd16
               [000011] L-----------              \--*  ADDR      byref 
               [000006] ------------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 1]  19 (0x013) stloc.0

               [000017] ------------              *  STMT      void  (IL 0x006...  ???)
               [000009] S-C-G-------              \--*  CALL      void   System.Runtime.Intrinsics.X86.Sse2.ShiftRightLogical128BitLane
               [000015] L----------- arg0            +--*  ADDR      byref 
               [000014] ------------                 |  \--*  LCL_VAR   simd16 V00 loc0         
               [000012] x----------- arg1            +--*  OBJ(16)   simd16
               [000011] L-----------                 |  \--*  ADDR      byref 
               [000006] ------------                 |     \--*  LCL_VAR   simd16 V00 loc0         
               [000008] ------------ arg2            \--*  CAST      int <- ubyte <- int
               [000007] ------------                    \--*  CNS_INT   int    1

@4creators
Copy link
Contributor

After simple fix:

Instructions as they come out of the scheduler


G_M44985_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop

G_M44985_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[F582F0A8] ptr arg pop  0
IN0004: 000016 E845000000           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop

G_M44985_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret
Allocated method code size =   33 , actual size =   33

4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
@4creators
Copy link
Contributor

No more poor codegen for HW intrinsics 😄

@voinokin
Copy link
Author

Will this fix also handle the following code ? (attempt to work this issue around)

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void Test<TChar>()
            where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();

            byte C;
            if (sizeof(TChar) == sizeof(byte))
            {
                C = 1;
            }
            else if (sizeof(TChar) == sizeof(ushort))
            {
                C = 2;
            }
            else
            {
                throw new InvalidOperationException();
            }

            v = Sse2.ShiftRightLogical128BitLane(v, C);

            Console.WriteLine(Sse41.Extract(v, 0));
        }

@4creators
Copy link
Contributor

Nope. It's too complex for codegen right now, but most importantly simple and equivalent approach works.

@voinokin
Copy link
Author

...once you know it :-)

@mikedn
Copy link
Contributor

mikedn commented Sep 11, 2018

..which makes things a bit funny in the context of CLR due to roundtrips of most calculations on byte thru int ;-)

Yeah. Making those parameters byte instead of int was likely a bad idea.

But then JIT's importer could and should probably do some trivial constant folding. constant/cast sequences do appear in other situations as well. Most commonly I've seen this with long constants where the C# compiler decides to save space and emits ldc.i4 and conv.i8. The importer obeys and generates a cast node, that then the optimizer removes.

Usually it's not a good idea to mix IL importing with optimizations but there are some cases where that's reasonable, especially when dealing with IL's own encoding specifics/limitations.

@4creators
Copy link
Contributor

But then JIT's importer could and should probably do some trivial constant folding.

In this fix importer does not do any const folding just recognizes that cast is done on const. Later stages take care of const folding sufficiently well to let them do the job. The problem here is that Roslyn does constant folding and evidently it missed arg2.

@4creators
Copy link
Contributor

...once you know it :-)

Yeah, right, it's not best of worlds we can have here.

@mikedn
Copy link
Contributor

mikedn commented Sep 11, 2018

In this fix importer does not do any const folding just recognizes that cast is done on const.

Until someone tries something like 2 * sizeof(TChar).

@4creators
Copy link
Contributor

Until someone tries something like 2 * sizeof(TChar).

This should be handled by Roslyn. Will check.

@mikedn
Copy link
Contributor

mikedn commented Sep 11, 2018

This should be handled by Roslyn. Will check.

It's impossible, sizeof(TChar) is not a compile time constant.

@4creators
Copy link
Contributor

@mikedn You are right. Added function to fold const expressions during import and it works as expected:

Test methods

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();
            v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar));
            Console.WriteLine(Sse41.Extract(v, 0));
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test1<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();
            v = Sse2.ShiftRightLogical128BitLane(v, (byte)((3 * sizeof(TChar)) / 2 + 4 - sizeof(TChar)));
            Console.WriteLine(Sse41.Extract(v, 0));
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static unsafe void Test2<TChar>() where TChar : unmanaged
        {
            Vector128<byte> v = Sse2.SetZeroVector128<byte>();

            byte C;
            if (sizeof(TChar) == sizeof(byte))
            {
                v = Sse2.ShiftRightLogical128BitLane(v, 1);
            }
            else if (sizeof(TChar) == sizeof(ushort))
            {
                v = Sse2.ShiftRightLogical128BitLane(v, 2);
            }
            else
            {
                throw new InvalidOperationException();
            }

            Console.WriteLine(Sse41.Extract(v, 0));
        }

Method Test

Instructions as they come out of the scheduler


G_M44985_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M44985_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A96D338] ptr arg pop  0
IN0004: 000016 E87DE6FFFF           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M44985_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

Method Test1

Instructions as they come out of the scheduler


G_M17352_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M17352_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D805         vpsrldq  xmm0, xmm0, 5
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A97D9A8] ptr arg pop  0
IN0004: 000016 E845000000           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M17352_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

Method Test2

Instructions as they come out of the scheduler


G_M17451_IG01:        ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28             sub      rsp, 40
IN0007: 000004 90                   nop      

G_M17451_IG02:        ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0           vpxor    xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802         vpsrldq  xmm0, xmm0, 2
IN0003: 000010 C4E37914C100         vpextrb  ecx, xmm0, 0
[6A97DE18] ptr arg pop  0
IN0004: 000016 E855FEFFFF           call     System.Console:WriteLine(int)
IN0005: 00001B 90                   nop      

G_M17451_IG03:        ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428             add      rsp, 40
IN0009: 000020 C3                   ret      
Allocated method code size =   33 , actual size =   33

@4creators
Copy link
Contributor

And in the importer we get:

Method Test:

*************** In impImport() for Repro.Program:Test()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) sizeof 1B000002
    [ 2]  13 (0x00d) conv.u1
    [ 2]  14 (0x00e) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

Folding operator with constant nodes into a constant:
               [000008] ------------              *  CAST      int <- ubyte <- int
               [000007] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000008] ------------              *  CNS_INT   int    2
  Known type Vector128<byte>

    [ 1]  19 (0x013) stloc.0

               [000013] ------------              *  STMT      void  (IL 0x006...  ???)
               [000008] ------------              |     /--*  CNS_INT   int    2
               [000009] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000006] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000012] -A----------              \--*  ASG       simd16 (copy)
               [000010] D-----------                 \--*  LCL_VAR   simd16 V00 loc0 

Method Test1

*************** In impImport() for Repro.Program:Test1()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test1()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) ldloc.0
    [ 1]   7 (0x007) ldc.i4.3 3
    [ 2]   8 (0x008) sizeof 1B000002
    [ 3]  14 (0x00e) mul
    [ 2]  15 (0x00f) ldc.i4.2 2
    [ 3]  16 (0x010) div
    [ 2]  17 (0x011) ldc.i4.4 4
    [ 3]  18 (0x012) add
    [ 2]  19 (0x013) sizeof 1B000002
    [ 3]  25 (0x019) sub
    [ 2]  26 (0x01a) conv.u1
    [ 2]  27 (0x01b) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

Folding operator with constant nodes into a constant:
               [000008] ------------              /--*  CNS_INT   int    2
               [000009] ------------              *  MUL       int   
               [000007] ------------              \--*  CNS_INT   int    3
Bashed to int constant:
               [000009] ------------              *  CNS_INT   int    6

Folding operator with constant nodes into a constant:
               [000010] ------------              /--*  CNS_INT   int    2
               [000011] ------------              *  DIV       int   
               [000009] ------------              \--*  CNS_INT   int    6
Bashed to int constant:
               [000011] ------------              *  CNS_INT   int    3

Folding operator with constant nodes into a constant:
               [000012] ------------              /--*  CNS_INT   int    4
               [000013] ------------              *  ADD       int   
               [000011] ------------              \--*  CNS_INT   int    3
Bashed to int constant:
               [000013] ------------              *  CNS_INT   int    7

Folding operator with constant nodes into a constant:
               [000014] ------------              /--*  CNS_INT   int    2
               [000015] ------------              *  SUB       int   
               [000013] ------------              \--*  CNS_INT   int    7
Bashed to int constant:
               [000015] ------------              *  CNS_INT   int    5

Folding operator with constant nodes into a constant:
               [000016] ------------              *  CAST      int <- ubyte <- int
               [000015] ------------              \--*  CNS_INT   int    5
Bashed to int constant:
               [000016] ------------              *  CNS_INT   int    5
  Known type Vector128<byte>

    [ 1]  32 (0x020) stloc.0

               [000021] ------------              *  STMT      void  (IL 0x006...  ???)
               [000016] ------------              |     /--*  CNS_INT   int    5
               [000017] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000006] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000020] -A----------              \--*  ASG       simd16 (copy)
               [000018] D-----------                 \--*  LCL_VAR   simd16 V00 loc0   

Method Test2

*************** In impImport() for Repro.Program:Test2()

impImportBlockPending for BB01

Importing BB01 (PC=000) of 'Repro.Program:Test2()'
    [ 0]   0 (0x000) call 2B000001
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>

    [ 1]   5 (0x005) stloc.0

               [000005] ------------              *  STMT      void  (IL 0x000...  ???)
               [000001] ------------              |  /--*  HWIntrinsic simd16 ubyte SetZeroVector128
               [000004] -A----------              \--*  ASG       simd16 (copy)
               [000002] D-----------                 \--*  LCL_VAR   simd16 V00 loc0         

    [ 0]   6 (0x006) sizeof 1B000002
    [ 1]  12 (0x00c) ldc.i4.1 1
    [ 2]  13 (0x00d) bne.un.s
Folding operator with constant nodes into a constant:
               [000007] ------------              /--*  CNS_INT   int    1
               [000008] N--------U--              *  NE        int   
               [000006] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000008] ------------              *  CNS_INT   int    1

The conditional jump becomes an unconditional jump to BB03

impImportBlockPending for BB03

Importing BB03 (PC=025) of 'Repro.Program:Test2()'
    [ 0]  25 (0x019) sizeof 1B000002
    [ 1]  31 (0x01f) ldc.i4.2 2
    [ 2]  32 (0x020) bne.un.s
Folding operator with constant nodes into a constant:
               [000011] ------------              /--*  CNS_INT   int    2
               [000012] N--------U--              *  NE        int   
               [000010] ------------              \--*  CNS_INT   int    2
Bashed to int constant:
               [000012] ------------              *  CNS_INT   int    0

The block falls through into the next BB04

impImportBlockPending for BB04

Importing BB04 (PC=034) of 'Repro.Program:Test2()'
    [ 0]  34 (0x022) ldloc.0
    [ 1]  35 (0x023) ldc.i4.2 2
    [ 2]  36 (0x024) call 0A000010
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
  Known type Vector128<byte>
  Known type Vector128<byte>

    [ 1]  41 (0x029) stloc.0

               [000020] ------------              *  STMT      void  (IL 0x022...  ???)
               [000015] ------------              |     /--*  CNS_INT   int    2
               [000016] ------------              |  /--*  HWIntrinsic simd16 ubyte ShiftRightLogical128BitLane
               [000014] ------------              |  |  \--*  LCL_VAR   simd16 V00 loc0         
               [000019] -A----------              \--*  ASG       simd16 (copy)
               [000017] D-----------                 \--*  LCL_VAR   simd16 V00 loc0

4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
Fixes #19888

This enables expansin of imm HW Intrinsics which otherwise would be emitted
with jump tables. Significantly improves user experience and performance
4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
…uring import

Fixes #19888

This enables expansion of imm HW Intrinsics which otherwise would be emitted
as function calls with jump tables. Folding of const expressions happens during
morphing and it means that directly before SSA generation and value numbering
HW intrinsics are in final shape allowing them to participate in later optimizations
once HW intrinsic support for value numbering is fully implemented.

Significantly improves user experience.
4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
…uring import

Fixes #19888

This enables expansion of imm HW Intrinsics which otherwise would be emitted
as function calls with jump tables. Folding of const expressions happens during
morphing and it means that directly before SSA generation and value numbering
HW intrinsics are in final shape allowing them to participate in later optimizations
once HW intrinsic support for value numbering is fully implemented.

Significantly improves user experience.
4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
…g import

Fixes #19888

Solution is based on a call to fgMorphTreeNode without using global morphing
and null context. The execution path goes exactly through complete constant
expression folding path used by jit during morphing. The chosen entry point
is significantly better than gtFoldExpr as it is specifically designed to
handle complete const expression tree folding including cast morphing.

There seems to be no need to upgrade gtFoldExpr to have expanded functionality
as it is called as a part of code execution path starting with fgMorphTreeNode
call. Additional benefit of not using gtFoldExpr is that this method was destructive
for it's operands when processed. In contrast fgMorphTreeNode returns new tree
leaving old tree untouched. Therefore, in the case of failed folding, what should
be a supported scenario, importer uses unmodified tree.

This PR replaces dotnet#19889
4creators referenced this issue in dotnetrt/coreclr Sep 11, 2018
… import

Fixes #19888

Solution is based on a call to fgMorphTree without using global morphing
and null context. The execution path goes exactly through complete constant
expression folding path used by jit during morphing. The chosen entry point
is significantly better than gtFoldExpr as it is specifically designed to
handle complete const expression tree folding including cast morphing.

There seems to be no need to upgrade gtFoldExpr to have expanded functionality
as it is called as a part of code execution path starting with fgMorphTree
call. Additional benefit of not using gtFoldExpr is that this method is destructive
for it's operands when processed. In contrast fgMorphTree returns new tree
leaving old tree untouched. Therefore, in the case of failed folding, what should
be a supported scenario, importer uses unmodified tree.

This PR replaces dotnet#19889
4creators referenced this issue in dotnetrt/coreclr Sep 12, 2018
… import

Fixes #19888

Solution is based on a call to fgMorphTree without using global morphing
and null context. The execution path goes exactly through complete constant
expression folding path used by jit during morphing. The chosen entry point
is significantly better than gtFoldExpr as it is specifically designed to
handle complete const expression tree folding including cast morphing.

There seems to be no need to upgrade gtFoldExpr to have expanded functionality
as it is called as a part of code execution path starting with fgMorphTree
call. Additional benefit of not using gtFoldExpr is that this method is destructive
for it's operands when processed. In contrast fgMorphTree returns new tree
leaving old tree untouched. Therefore, in the case of failed folding, what should
be a supported scenario, importer uses unmodified tree.

This PR replaces dotnet#19889
4creators referenced this issue in dotnetrt/coreclr Sep 13, 2018
… import

Fixes #19888

Solution is based on a call to fgMorphTree without using global morphing
and null context. The execution path goes exactly through complete constant
expression folding path used by jit during morphing. The chosen entry point
is significantly better than gtFoldExpr as it is specifically designed to
handle complete const expression tree folding including cast morphing.

There seems to be no need to upgrade gtFoldExpr to have expanded functionality
as it is called as a part of code execution path starting with fgMorphTree
call. Additional benefit of not using gtFoldExpr is that this method is destructive
for it's operands when processed. In contrast fgMorphTree returns new tree
leaving old tree untouched. Therefore, in the case of failed folding, what should
be a supported scenario, importer uses unmodified tree.

This PR replaces dotnet#19889
4creators referenced this issue in dotnetrt/coreclr Sep 13, 2018
… import

Fixes #19888

Solution is based on a call to fgMorphTree without using global morphing
and null context. The execution path goes exactly through complete constant
expression folding path used by jit during morphing. The chosen entry point
is significantly better than gtFoldExpr as it is specifically designed to
handle complete const expression tree folding including cast morphing.

There seems to be no need to upgrade gtFoldExpr to have expanded functionality
as it is called as a part of code execution path starting with fgMorphTree
call. Additional benefit of not using gtFoldExpr is that this method is destructive
for it's operands when processed. In contrast fgMorphTree returns new tree
leaving old tree untouched. Therefore, in the case of failed folding, what should
be a supported scenario, importer uses unmodified tree.

This PR replaces dotnet#19889
@gdkchan
Copy link

gdkchan commented Sep 22, 2018

I also had a similar problem with the following code:

private static void Test()
{
    Vector128<float> vec = Sse.SetZeroVector128();

    vec = VectorInsertSingle(vec, 0, 1.0f);
    vec = VectorInsertSingle(vec, 1, 1.5f);
    vec = VectorInsertSingle(vec, 2, 2.0f);
    vec = VectorInsertSingle(vec, 3, 2.5f);            
}

public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    //Produces sub-optimal code (jump table fallback,
    //JIT doesn't notice that the value is constant after inline).
    return Sse41.Insert(vector, value, (byte)(index << 4));
}

After inline, the index value is constant, however the jump table fallback is still called. This is the assembly produced for Test:

00007FFD4B14361B  vmovss      xmm2,dword ptr [7FFD4B1436E0h]  
00007FFD4B143624  xor         r9d,r9d  
00007FFD4B143627  call        00007FFD4B143360  
00007FFD4B14362C  mov         rcx,qword ptr [rsp+70h]  
00007FFD4B143631  mov         rdx,qword ptr [rsp+78h]  
00007FFD4B143636  lea         r9,[rsp+60h]  
00007FFD4B14363B  lea         rax,[rsp+20h]  
00007FFD4B143640  mov         qword ptr [rax],rcx  
00007FFD4B143643  mov         qword ptr [rax+8],rdx  
00007FFD4B143647  mov         rcx,r9  
00007FFD4B14364A  lea         rdx,[rsp+20h]  
00007FFD4B14364F  vmovss      xmm2,dword ptr [7FFD4B1436E4h]  
00007FFD4B143658  mov         r9d,10h  
00007FFD4B14365E  call        00007FFD4B143360  
00007FFD4B143663  mov         rcx,qword ptr [rsp+60h]  
00007FFD4B143668  mov         rdx,qword ptr [rsp+68h]  
00007FFD4B14366D  lea         r9,[rsp+50h]  
00007FFD4B143672  lea         rax,[rsp+20h]  
00007FFD4B143677  mov         qword ptr [rax],rcx  
00007FFD4B14367A  mov         qword ptr [rax+8],rdx  
00007FFD4B14367E  mov         rcx,r9  
00007FFD4B143681  lea         rdx,[rsp+20h]  
...

The values are constant (0, 0x10 etc), but it still produces the call.

Trying to workaround the issue I wrote another version using a switch case that use the constant values directly instead of relying on the JIT to do constant folding:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    switch (index)
    {
        case 0: return Sse41.Insert(vector, value, 0x00);
        case 1: return Sse41.Insert(vector, value, 0x10);
        case 2: return Sse41.Insert(vector, value, 0x20);
        case 3: return Sse41.Insert(vector, value, 0x30);

        default: throw new ArgumentOutOfRangeException(nameof(index));
    }
}

This also doesn't work well because the JIT is not capable of inlining methods with switch cases so the code produced is also sub-optimal.

Finally we have this version using if/else:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    if (index == 0)
    {
        return Sse41.Insert(vector, value, 0x00);
    }
    else if (index == 1)
    {
        return Sse41.Insert(vector, value, 0x10);
    }
    else if (index == 2)
    {
        return Sse41.Insert(vector, value, 0x20);
    }
    else if (index == 3)
    {
        return Sse41.Insert(vector, value, 0x30);
    }
    else
    {
        throw new ArgumentOutOfRangeException(nameof(index));
    }
}

This version produces more or less the expected code (this is the same Test method, calling the above VectorInsertSingle method):

00007FFD4B1235EC  vxorps      xmm0,xmm0,xmm0  
00007FFD4B1235F1  vmovss      xmm1,dword ptr [7FFD4B123668h]  
00007FFD4B1235FA  vinsertps   xmm0,xmm0,xmm1,0  
00007FFD4B123600  vmovapd     xmmword ptr [rsp+50h],xmm0  
00007FFD4B123607  vmovapd     xmm0,xmmword ptr [rsp+50h]  
00007FFD4B12360E  vmovss      xmm1,dword ptr [7FFD4B12366Ch]  
00007FFD4B123617  vinsertps   xmm0,xmm0,xmm1,10h  
00007FFD4B12361D  vmovapd     xmmword ptr [rsp+40h],xmm0  
00007FFD4B123624  vmovapd     xmm0,xmmword ptr [rsp+40h]  
00007FFD4B12362B  vmovss      xmm1,dword ptr [7FFD4B123670h]  
00007FFD4B123634  vinsertps   xmm0,xmm0,xmm1,20h  
00007FFD4B12363A  vmovapd     xmmword ptr [rsp+30h],xmm0  
00007FFD4B123641  vmovapd     xmm0,xmmword ptr [rsp+30h]  
00007FFD4B123648  vmovss      xmm1,dword ptr [7FFD4B123674h]  
00007FFD4B123651  vinsertps   xmm0,xmm0,xmm1,30h  
00007FFD4B123657  vmovapd     xmmword ptr [rsp+20h],xmm0  
00007FFD4B12365E  add         rsp,68h  
00007FFD4B123662  ret

The method was properly inlined, however there are still some register moves that I believe that are not necessary (for example XMM0 is spilled then immediately moved back). Furthermore this instruction also accepts a memory location, so the whole method after inline could be just a single instruction, something like this:

vxorps      xmm0,xmm0,xmm0
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123668h],0
vinsertps   xmm0,xmm0,dword ptr [7FFD4B12366Ch],10h
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123670h],20h  
vinsertps   xmm0,xmm0,dword ptr [7FFD4B123674h],30h

I know there are more efficient ways to load a vector, this was just to demonstrate the usage of the VectorInsertSingle method.

Those tests were made with dotnet version 2.2.100-preview2-009404.

Another thing that may be worth pointing out is that the index argument name on the Vector128<float> overload of the Insert intrinsic method is a bit misleading, since it is not really index, the low bits are a zero mask and only the high bits are the actual index. One could assume it is just the destination index but that's not the case. Maybe its a good idea to rename it, I don't know if its worth opening a issue for that.

fyi @fiigii

Sorry if this is not the right place for that, but the issue is similar so I felt that opening a separate one was not necessary.

@4creators
Copy link
Contributor

Those tests were made with dotnet version 2.2.100-preview2-009404.

All new work on HW intrinsics is in master branch what means that it is not available in 2.2 preview2 and will not be available in 2.2 RTM

HW Intrinsic with immediate operand expansion

The method below cannot be handled correctly by Jit even with current not accepted solution from dotnet/coreclr#19898.

public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
    //Produces sub-optimal code (jump table fallback,
    //JIT doesn't notice that the value is constant after inline).
    return Sse41.Insert(vector, value, (byte)(index << 4));
}

The first problem is that Jit needs to recognize index variable as a const to be able to trigger HW intrinsic expansion and generate efficient code. Currently this happens only when value passed into intrinsic is a constant. Constant expressions in the form (byte)(index << 4) like in your code are not folded currently but they are handled correctly dotnet/coreclr#19898 under the assumption that index is a const or const expression. Unfortunately method parameter syntactically is not a const. There is a language proposal which could solve this dotnet/csharplang#886 but it is not accepted.

There is still a second problem here caused by method inlining. Inlining happens after decision on intrinsic expansion is made and Jit currently will not reconsider if immediate HW intrinsic should be expanded after it is inlined and arguments semantics changes like in your code. As a result your method despite working on semantically valid assumptions for Jit is not syntactically equivalent to intrinsic with imm8 constant.

I have created issue to track the second problem as this may be important in many situations as long as we do not have const parameters implemented.

@fiigii
Copy link
Contributor

fiigii commented Sep 22, 2018

@gdkchan Thanks for reporting the codegen issue. The problem is different from this issue but also detected in https://github.com/dotnet/coreclr/issues/17108.

@CarolEidt @AndyAyersMS @mikedn @tannergooding
Currently, all the codegen issues of imm-intrinsic are from two aspects:

  1. the importer does not fold some constant expressions, like GT_CAST
  2. the const arguments flow from the function parameters and the function can be inlined, but we expend intrinsic before inlining.

The first issue can be solved by morphing the expressions in the importer (like dotnet/coreclr#19898), but both the two issues can be naturally solved by https://github.com/dotnet/coreclr/issues/17108. I know there is some difficulty to introduce call-node in lowering, can we just try to expend intrinsic behind inlining and constant propagation/folding?

@4creators
Copy link
Contributor

4creators commented Sep 22, 2018

but both the two issues can be naturally solved by #9989.

IMO we should reevaluate several problems:

  1. We have mismatch between semantic requirements of imm intrinsics and C#/VB/F# semantics which will be impacting all future development of any intrinsic with immediate operands.

  2. Current implementation despite providing ability to generate code for non const immediate parameters is undesired and avoided by users in other words it solves problem for compiler writers but does not solve users problems and users want high quality and fast code. The problem of codegen for non-const immediate parameters is raised in at least couple of issues.

  3. We have deficiencies in importer and later compiler stages related to HW intrinsics handling. We do not fold constant expressions in importer, we do not reevaluate HW intrinsics expansion at later stages i.e. during morphing, value numbering, all optimizations happening before lowering and during lowering.

IMHO there is no single simple solution to problems which will be haunting imm HW intrinsic implementation, whether it would be dotnet/coreclr#19898 or #9989 or dotnet/csharplang#886 or current implementation of non-const parameter handling.

Scenarios which as turns out should be supported are much more complex:

  • support complex const expression folding
  • support immediate parameters which are jit time const
  • support semantic changes introduced by inlining
  • support bona fide immediate non-const parameters

IMHO we should do:
(i) const expression folding in importer,
(ii) reevaluate HW intrinsic expansion once or more times after morphing, value numbering and following optimizations,
(iii) const parameter semantics in C#/VB/F# from dotnet/csharplang#886,
(iv) have HW intrinsic overloads with non-const parameters what would allow for supporting of const expression which are constant only to Jit.

What could be helpful for (ii) it would be HW intrinsic support implementation for value numbering and following optimizations what could form a basis for further optimizations and even in more distant future autovectorization support.

https://github.com/dotnet/coreclr/blob/09cc49e8cac72915b72240c766e25ada171e9fe7/src/jit/valuenum.cpp#L6502

@mikedn
Copy link
Contributor

mikedn commented Sep 22, 2018

I know there is some difficulty to introduce call-node in lowering, can we just try to expend intrinsic behind inlining and constant propagation/folding?

Basically what you want is to expand on import and then attempt to expand again during lowering, hoping that non-const parameters have become const. Not expanding on import at all is not an option, you really don't want to keep a bunch of call nodes if you can avoid it. Allways expanding on import and falling back to call in lowering is not an option as well, at least not a trivial one. So you really don't have much of a choice.

Well, there's also the choice of telling people - hey, that parameter must be a constant. Period.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Oct 13, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@JulieLeeMSFT JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021
@echesakov echesakov added the Priority:3 Work that is nice to have label Jul 8, 2021
@echesakov echesakov modified the milestones: 6.0.0, Future Jul 8, 2021
@echesakov echesakov removed their assignment Mar 15, 2022
@echesakov
Copy link
Contributor

Un-assigning myself
cc @BruceForstall

@tannergooding
Copy link
Member

This and #9989 are basically the same, they could probably be merged.

The rough requirement is that we support transforming a GT_HWINTRINSIC node back into a GT_CALL. We already do this for GT_INTRINSIC nodes in rationalization, but there were issues doing the same with HWIntrinsics. IIRC, the differences had to do with TYP_SIMD and some of the ABI transformations that would otherwise be required for out buffers, shadow copies, and the like. I wasn't familiar enough with the relevant parts of the JIT to make the necessary changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged optimization Priority:3 Work that is nice to have
Projects
None yet