Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X64] MihaZupan/runtime/cheaper-vector-narrow #390

Open
MihuBot opened this issue May 27, 2024 · 2 comments
Open

[X64] MihaZupan/runtime/cheaper-vector-narrow #390

MihuBot opened this issue May 27, 2024 · 2 comments

Comments

@MihuBot
Copy link
Owner

MihuBot commented May 27, 2024

Job completed in 22 minutes.

Diffs

Found 260 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 39663927
Total bytes of diff: 39663933
Total bytes of delta: 6 (0.00 % of base)
Total relative delta: -0.27
    diff is a regression.
    relative diff is an improvement.


Total byte diff includes 97 bytes from reconciling methods
	Base had    0 unique methods,        0 unique bytes
	Diff had    2 unique methods,       97 unique bytes

Top file regressions (bytes):
           6 : System.Private.CoreLib.dasm (0.00 % of base)

1 total files with Code Size differences (0 improved, 1 regressed), 255 unchanged.

Top method regressions (bytes):
          68 (Infinity of base) : System.Private.CoreLib.dasm - System.Text.Ascii:ExtractAsciiVector(System.Runtime.Intrinsics.Vector512`1[ushort],System.Runtime.Intrinsics.Vector512`1[ushort]):System.Runtime.Intrinsics.Vector512`1[ubyte] (FullOpts) (0 base, 1 diff methods)
          29 (Infinity of base) : System.Private.CoreLib.dasm - System.Text.Ascii:ExtractAsciiVector(System.Runtime.Intrinsics.Vector256`1[ushort],System.Runtime.Intrinsics.Vector256`1[ushort]):System.Runtime.Intrinsics.Vector256`1[ubyte] (FullOpts) (0 base, 1 diff methods)

Top method improvements (bytes):
         -36 (-16.00 % of base) : System.Private.CoreLib.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong (FullOpts)
         -36 (-5.99 % of base) : System.Private.CoreLib.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong (FullOpts)
         -19 (-5.40 % of base) : System.Private.CoreLib.dasm - System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte (FullOpts)

Top method regressions (percentages):
          29 (Infinity of base) : System.Private.CoreLib.dasm - System.Text.Ascii:ExtractAsciiVector(System.Runtime.Intrinsics.Vector256`1[ushort],System.Runtime.Intrinsics.Vector256`1[ushort]):System.Runtime.Intrinsics.Vector256`1[ubyte] (FullOpts) (0 base, 1 diff methods)
          68 (Infinity of base) : System.Private.CoreLib.dasm - System.Text.Ascii:ExtractAsciiVector(System.Runtime.Intrinsics.Vector512`1[ushort],System.Runtime.Intrinsics.Vector512`1[ushort]):System.Runtime.Intrinsics.Vector512`1[ubyte] (FullOpts) (0 base, 1 diff methods)

Top method improvements (percentages):
         -36 (-16.00 % of base) : System.Private.CoreLib.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong (FullOpts)
         -36 (-5.99 % of base) : System.Private.CoreLib.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong (FullOpts)
         -19 (-5.40 % of base) : System.Private.CoreLib.dasm - System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte (FullOpts)

5 total methods with Code Size differences (3 improved, 2 regressed), 244928 unchanged.

--------------------------------------------------------------------------------

Artifacts:

@MihuBot
Copy link
Owner Author

MihuBot commented May 27, 2024

Top method improvements

-36 (-16.00 % of base) - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong
 ; Assembly listing for method System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Unix
 ; FullOpts code
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
-; 0 inlinees with PGO data; 0 single block inlinees; 4 inlinees without PGO data
+; 0 inlinees with PGO data; 4 single block inlinees; 8 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T04] (  3,  3   )    long  ->  rdi         single-def
 ;  V01 arg1         [V01,T03] (  5,  3.50)    long  ->  rsi         single-def
 ;  V02 arg2         [V02,T05] (  3,  2.50)    long  ->  rdx         single-def
 ;  V03 loc0         [V03,T01] (  5, 10.50)   byref  ->  rdi         single-def
-;  V04 loc1         [V04,T07] ( 12, 17.50)  simd32  ->  mm0         <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;  V04 loc1         [V04,T07] ( 14, 18.50)  simd32  ->  mm0         <System.Runtime.Intrinsics.Vector256`1[ushort]>
 ;  V05 loc2         [V05,T02] (  5,  6   )   byref  ->  rax         single-def
 ;  V06 loc3         [V06,T00] ( 12, 27   )    long  ->  rcx        
 ;  V07 loc4         [V07,T06] (  2,  4.50)    long  ->  rdx        
-;  V08 loc5         [V08,T08] (  3, 12   )  simd32  ->  mm3         <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;  V08 loc5         [V08,T09] (  3, 12   )  simd32  ->  mm2         <System.Runtime.Intrinsics.Vector256`1[ushort]>
 ;# V09 OutArgs      [V09    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;* V10 tmp1         [V10    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V11 tmp2         [V11    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V12 tmp3         [V12    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V13 tmp4         [V13    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V10 tmp1         [V10    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
+;* V11 tmp2         [V11    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
+;  V12 tmp3         [V12,T08] (  2, 16   )  simd32  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
+;* V13 tmp4         [V13    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
 ;* V14 tmp5         [V14    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
 ;* V15 tmp6         [V15    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V16 tmp7         [V16    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V17 tmp8         [V17    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V18 tmp9         [V18    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;  V19 cse0         [V19,T11] (  3,  1.50)  simd32  ->  mm0         "CSE #02: moderate"
-;  V20 cse1         [V20,T12] (  3,  1.50)  simd32  ->  mm0         "CSE #04: moderate"
-;  V21 cse2         [V21,T09] (  7, 10.50)  simd32  ->  mm2         "CSE #01: moderate"
-;  V22 cse3         [V22,T10] (  5,  7   )  simd32  ->  mm1         "CSE #03: moderate"
+;* V16 tmp7         [V16    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V17 tmp8         [V17    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V18 tmp9         [V18    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V19 tmp10        [V19    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V20 tmp11        [V20    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V21 tmp12        [V21    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V22 tmp13        [V22    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V23 tmp14        [V23    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V24 tmp15        [V24    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V25 tmp16        [V25    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V26 tmp17        [V26    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V27 tmp18        [V27    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V28 tmp19        [V28    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V29 tmp20        [V29    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;  V30 cse0         [V30,T10] (  5,  7   )  simd32  ->  mm1         "CSE #01: moderate"
 ;
 ; Lcl frame size = 0
 
 G_M60588_IG01:
        push     rbp
        mov      rbp, rsp
 						;; size=4 bbWeight=1 PerfScore 1.25
 G_M60588_IG02:
        vmovups  ymm0, ymmword ptr [rdi]
        vmovups  ymm1, ymmword ptr [reloc @RWD00]
        vptest   ymm0, ymm1
        jne      G_M60588_IG10
 						;; size=23 bbWeight=1 PerfScore 15.00
 G_M60588_IG03:
        mov      rax, rsi
-       vmovups  ymm2, ymmword ptr [reloc @RWD32]
-       vpand    ymm0, ymm2, ymm0
        vpackuswb ymm0, ymm0, ymm0
        vpermq   ymm0, ymm0, -40
        vmovups  xmmword ptr [rax], xmm0
        mov      ecx, 16
        test     sil, 16
        jne      SHORT G_M60588_IG04
        vmovups  ymm0, ymmword ptr [rdi+0x20]
        vptest   ymm0, ymm1
-       jne      G_M60588_IG08
-       vpand    ymm0, ymm2, ymm0
+       jne      SHORT G_M60588_IG08
        vpackuswb ymm0, ymm0, ymm0
        vpermq   ymm0, ymm0, -40
        vmovups  xmmword ptr [rax+0x10], xmm0
-						;; size=75 bbWeight=0.50 PerfScore 13.71
+						;; size=55 bbWeight=0.50 PerfScore 11.38
 G_M60588_IG04:
        and      rsi, 31
        mov      rcx, rsi
        neg      rcx
        add      rcx, 32
        add      rdx, -32
        align    [0 bytes for IG05]
 						;; size=18 bbWeight=0.50 PerfScore 0.62
 G_M60588_IG05:
        vmovups  ymm0, ymmword ptr [rdi+2*rcx]
-       vmovups  ymm3, ymmword ptr [rdi+2*rcx+0x20]
-       vpor     ymm4, ymm0, ymm3
-       vptest   ymm4, ymm1
+       vmovups  ymm2, ymmword ptr [rdi+2*rcx+0x20]
+       vpor     ymm3, ymm0, ymm2
+       vptest   ymm3, ymm1
        je       SHORT G_M60588_IG07
 						;; size=22 bbWeight=4 PerfScore 65.33
 G_M60588_IG06:
        vptest   ymm0, ymm1
        jne      SHORT G_M60588_IG08
-       vpand    ymm3, ymm2, ymm0
-       vpand    ymm0, ymm2, ymm0
-       vpackuswb ymm2, ymm3, ymm0
-       vpermq   ymm1, ymm2, -40
-       vmovups  xmmword ptr [rax+rcx], xmm1
+       vpackuswb ymm0, ymm0, ymm0
+       vpermq   ymm2, ymm0, -40
+       vmovups  xmmword ptr [rax+rcx], xmm2
        add      rcx, 16
        jmp      SHORT G_M60588_IG08
-						;; size=36 bbWeight=0.50 PerfScore 6.96
+						;; size=28 bbWeight=0.50 PerfScore 6.62
 G_M60588_IG07:
-       vpand    ymm0, ymm2, ymm0
-       vpand    ymm3, ymm2, ymm3
-       vpackuswb ymm0, ymm0, ymm3
+       vpackuswb ymm0, ymm0, ymm2
        vpermq   ymm0, ymm0, -40
        vmovups  ymmword ptr [rax+rcx], ymm0
        add      rcx, 32
        cmp      rcx, rdx
        jbe      SHORT G_M60588_IG05
-						;; size=32 bbWeight=4 PerfScore 28.67
+						;; size=24 bbWeight=4 PerfScore 26.00
 G_M60588_IG08:
        mov      rax, rcx
 						;; size=3 bbWeight=0.50 PerfScore 0.12
 G_M60588_IG09:
        vzeroupper 
        pop      rbp
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 1.25
 G_M60588_IG10:
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
 G_M60588_IG11:
        vzeroupper 
        pop      rbp
        ret      
 						;; size=5 bbWeight=0.50 PerfScore 1.25
 RWD00  	dq	FF80FF80FF80FF80h, FF80FF80FF80FF80h, FF80FF80FF80FF80h, FF80FF80FF80FF80h
-RWD32  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh
 
 
-; Total bytes of code 225, prolog size 4, PerfScore 134.29, instruction count 58, allocated bytes for code 225 (MethodHash=910c1353) for method System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong (FullOpts)
+; Total bytes of code 189, prolog size 4, PerfScore 128.96, instruction count 51, allocated bytes for code 189 (MethodHash=910c1353) for method System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_256(ulong,ulong,ulong):ulong (FullOpts)
-36 (-5.99 % of base) - System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong
 ; Assembly listing for method System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Unix
 ; FullOpts code
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
-; 0 inlinees with PGO data; 9 single block inlinees; 16 inlinees without PGO data
+; 0 inlinees with PGO data; 13 single block inlinees; 20 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T05] (  8,  8.50)    long  ->  rdi         single-def
 ;  V01 arg1         [V01,T04] ( 12, 10.50)    long  ->  rsi         single-def
 ;  V02 arg2         [V02,T09] (  7,  5   )    long  ->  rdx         single-def
 ;  V03 loc0         [V03,T00] ( 22, 29.50)    long  ->  rax        
 ;  V04 loc1         [V04,T10] ( 13,  6.50)     int  ->  rcx        
 ;* V05 loc2         [V05    ] (  0,  0   )     int  ->  zero-ref   
 ;  V06 loc3         [V06,T03] (  7, 14   )    long  ->  registers  
 ;  V07 loc4         [V07,T18] (  5,  2.50)    long  ->  rdx        
 ;  V08 loc5         [V08,T13] (  2,  4.50)    long  ->  rcx        
 ;# V09 OutArgs      [V09    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V10 tmp1         [V10,T19] (  3,  1.50)    long  ->  rax         "Inline return value spill temp"
 ;  V11 tmp2         [V11,T06] (  5,  9.50)   byref  ->  rax         single-def "Inline stloc first use temp"
-;  V12 tmp3         [V12,T25] ( 12, 16.50)  simd32  ->  mm0         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;  V12 tmp3         [V12,T24] ( 14, 17.50)  simd32  ->  mm0         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
 ;  V13 tmp4         [V13,T11] (  5,  6   )   byref  ->  rcx         single-def "Inline stloc first use temp"
-;  V14 tmp5         [V14,T01] ( 12, 27   )    long  ->   r8         "Inline stloc first use temp"
-;  V15 tmp6         [V15,T14] (  2,  4.50)    long  ->   r9         "Inline stloc first use temp"
-;  V16 tmp7         [V16,T27] (  3, 12   )  simd32  ->  mm3         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V17 tmp8         [V17    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V18 tmp9         [V18    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V19 tmp10        [V19    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V20 tmp11        [V20    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V14 tmp5         [V14    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
+;  V15 tmp6         [V15,T01] ( 12, 27   )    long  ->   r8         "Inline stloc first use temp"
+;  V16 tmp7         [V16,T14] (  2,  4.50)    long  ->   r9         "Inline stloc first use temp"
+;  V17 tmp8         [V17,T28] (  3, 12   )  simd32  ->  mm2         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V18 tmp9         [V18    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
+;  V19 tmp10        [V19,T26] (  2, 16   )  simd32  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
+;* V20 tmp11        [V20    ] (  0,  0   )  simd32  ->  zero-ref    "spilled call-like call argument"
 ;* V21 tmp12        [V21    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
 ;* V22 tmp13        [V22    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V23 tmp14        [V23    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;* V24 tmp15        [V24    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V25 tmp16        [V25    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
-;  V26 tmp17        [V26,T20] (  3,  1.50)    long  ->  rax         "Inline return value spill temp"
-;* V27 tmp18        [V27,T22] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
-;* V28 tmp19        [V28    ] (  0,  0   )    long  ->  zero-ref    "Inline stloc first use temp"
-;  V29 tmp20        [V29,T07] (  5,  9.50)   byref  ->  rax         single-def "Inline stloc first use temp"
-;  V30 tmp21        [V30,T24] ( 14, 17.50)  simd16  ->  mm0         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;  V31 tmp22        [V31,T12] (  5,  6   )   byref  ->  rcx         single-def "Inline stloc first use temp"
-;* V32 tmp23        [V32    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
-;  V33 tmp24        [V33,T02] ( 11, 26.50)    long  ->   r8         "Inline stloc first use temp"
-;  V34 tmp25        [V34,T15] (  2,  4.50)    long  ->   r9         "Inline stloc first use temp"
-;  V35 tmp26        [V35,T28] (  3, 12   )  simd16  ->  mm2         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V36 tmp27        [V36    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
-;  V37 tmp28        [V37,T26] (  2, 16   )  simd16  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
-;* V38 tmp29        [V38    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
-;* V39 tmp30        [V39    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V40 tmp31        [V40    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V41 tmp32        [V41    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
-;* V42 tmp33        [V42    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V43 tmp34        [V43    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V44 tmp35        [V44    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
-;* V45 tmp36        [V45    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V46 tmp37        [V46    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V47 tmp38        [V47    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V48 tmp39        [V48    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;* V49 tmp40        [V49    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;* V50 tmp41        [V50    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
-;* V51 tmp42        [V51    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
-;  V52 tmp43        [V52,T23] (  3, 24   )  simd16  ->  mm0         "dup spill"
-;* V53 tmp44        [V53    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[uint]>
-;* V54 tmp45        [V54    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
-;  V55 tmp46        [V55,T16] (  3,  3   )   byref  ->   r8         single-def "Inlining Arg"
-;  V56 tmp47        [V56,T17] (  3,  3   )   byref  ->  rdx         "Inlining Arg"
-;* V57 tmp48        [V57,T21] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;  V58 cse0         [V58,T08] (  3,  8.50)    long  ->  r10         "CSE #07: moderate"
-;  V59 cse1         [V59,T32] (  3,  1.50)  simd32  ->  mm0         "CSE #02: moderate"
-;  V60 cse2         [V60,T33] (  3,  1.50)  simd32  ->  mm0         "CSE #04: moderate"
-;  V61 cse3         [V61,T29] (  7, 10.50)  simd32  ->  mm2         "CSE #01: aggressive"
-;  V62 cse4         [V62,T30] (  5,  6   )  simd32  ->  mm1         "CSE #03: moderate"
-;  V63 cse5         [V63,T31] (  5,  6   )  simd16  ->  mm1         "CSE #06: moderate"
+;* V23 tmp14        [V23    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V24 tmp15        [V24    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V25 tmp16        [V25    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V26 tmp17        [V26    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V27 tmp18        [V27    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V28 tmp19        [V28    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V29 tmp20        [V29    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V30 tmp21        [V30    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V31 tmp22        [V31    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V32 tmp23        [V32    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V33 tmp24        [V33    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V34 tmp25        [V34    ] (  0,  0   )  simd32  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector256`1[ushort]>
+;* V35 tmp26        [V35    ] (  0,  0   )  simd32  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;* V36 tmp27        [V36    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ubyte]>
+;  V37 tmp28        [V37,T20] (  3,  1.50)    long  ->  rax         "Inline return value spill temp"
+;* V38 tmp29        [V38,T22] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
+;* V39 tmp30        [V39    ] (  0,  0   )    long  ->  zero-ref    "Inline stloc first use temp"
+;  V40 tmp31        [V40,T07] (  5,  9.50)   byref  ->  rax         single-def "Inline stloc first use temp"
+;  V41 tmp32        [V41,T25] ( 14, 17.50)  simd16  ->  mm0         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;  V42 tmp33        [V42,T12] (  5,  6   )   byref  ->  rcx         single-def "Inline stloc first use temp"
+;* V43 tmp34        [V43    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
+;  V44 tmp35        [V44,T02] ( 11, 26.50)    long  ->   r8         "Inline stloc first use temp"
+;  V45 tmp36        [V45,T15] (  2,  4.50)    long  ->   r9         "Inline stloc first use temp"
+;  V46 tmp37        [V46,T29] (  3, 12   )  simd16  ->  mm2         "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V47 tmp38        [V47    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
+;  V48 tmp39        [V48,T27] (  2, 16   )  simd16  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
+;* V49 tmp40        [V49    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
+;* V50 tmp41        [V50    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V51 tmp42        [V51    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V52 tmp43        [V52    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
+;* V53 tmp44        [V53    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V54 tmp45        [V54    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V55 tmp46        [V55    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
+;* V56 tmp47        [V56    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V57 tmp48        [V57    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V58 tmp49        [V58    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V59 tmp50        [V59    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;* V60 tmp51        [V60    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;* V61 tmp52        [V61    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
+;* V62 tmp53        [V62    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
+;  V63 tmp54        [V63,T23] (  3, 24   )  simd16  ->  mm0         "dup spill"
+;* V64 tmp55        [V64    ] (  0,  0   )  simd16  ->  zero-ref    "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[uint]>
+;* V65 tmp56        [V65    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
+;  V66 tmp57        [V66,T16] (  3,  3   )   byref  ->   r8         single-def "Inlining Arg"
+;  V67 tmp58        [V67,T17] (  3,  3   )   byref  ->  rdx         "Inlining Arg"
+;* V68 tmp59        [V68,T21] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;  V69 cse0         [V69,T08] (  3,  8.50)    long  ->  r10         "CSE #03: moderate"
+;  V70 cse1         [V70,T30] (  5,  6   )  simd32  ->  mm1         "CSE #01: moderate"
+;  V71 cse2         [V71,T31] (  5,  6   )  simd16  ->  mm1         "CSE #02: moderate"
 ;
 ; Lcl frame size = 0
 
 G_M6063_IG01:
        push     rbp
        mov      rbp, rsp
 						;; size=4 bbWeight=1 PerfScore 1.25
 G_M6063_IG02:
        xor      eax, eax
        cmp      rdx, 32
        jb       G_M6063_IG18
 						;; size=12 bbWeight=1 PerfScore 1.50
 G_M6063_IG03:
        mov      rcx, qword ptr [rdi]
        mov      r8, 0xD1FFAB1E
        test     rcx, r8
        mov      r8, rcx
        jne      G_M6063_IG20
        cmp      rdx, 64
        jae      G_M6063_IG11
        mov      rax, rdi
        vmovups  xmm0, xmmword ptr [rax]
        vmovups  xmm1, xmmword ptr [reloc @RWD00]
        vptest   xmm0, xmm1
        jne      G_M6063_IG09
        mov      rcx, rsi
        vpackuswb xmm0, xmm0, xmm0
        vmovsd   qword ptr [rcx], xmm0
        mov      r8d, 8
        test     sil, 8
        jne      SHORT G_M6063_IG04
        vmovups  xmm0, xmmword ptr [rax+0x10]
        vptest   xmm0, xmm1
        jne      SHORT G_M6063_IG08
        vpackuswb xmm0, xmm0, xmm0
        vmovsd   qword ptr [rcx+0x08], xmm0
 						;; size=105 bbWeight=0.50 PerfScore 16.00
 G_M6063_IG04:
        mov      r8, rsi
        and      r8, 15
        neg      r8
        add      r8, 16
        lea      r9, [rdx-0x10]
        align    [0 bytes for IG05]
 						;; size=18 bbWeight=0.50 PerfScore 0.75
 G_M6063_IG05:
        vmovups  xmm0, xmmword ptr [rax+2*r8]
        lea      r10, [r8+0x08]
        vmovups  xmm2, xmmword ptr [rax+2*r10]
        vpor     xmm3, xmm0, xmm2
        vptest   xmm3, xmm1
        je       SHORT G_M6063_IG07
 						;; size=27 bbWeight=4 PerfScore 51.33
 G_M6063_IG06:
        vptest   xmm0, xmm1
        jne      SHORT G_M6063_IG08
        vpackuswb xmm0, xmm0, xmm0
        vmovsd   qword ptr [rcx+r8], xmm0
        mov      r8, r10
        jmp      SHORT G_M6063_IG08
        align    [0 bytes for IG13]
 						;; size=22 bbWeight=0.50 PerfScore 4.62
 G_M6063_IG07:
        vpackuswb xmm0, xmm0, xmm2
        vmovups  xmmword ptr [rcx+r8], xmm0
        add      r8, 16
        cmp      r8, r9
        jbe      SHORT G_M6063_IG05
 						;; size=19 bbWeight=4 PerfScore 18.00
 G_M6063_IG08:
        mov      rax, r8
        jmp      SHORT G_M6063_IG10
 						;; size=5 bbWeight=0.50 PerfScore 1.12
 G_M6063_IG09:
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
 G_M6063_IG10:
        jmp      G_M6063_IG18
 						;; size=5 bbWeight=0.50 PerfScore 1.00
 G_M6063_IG11:
        mov      rax, rdi
        vmovups  ymm0, ymmword ptr [rax]
        vmovups  ymm1, ymmword ptr [reloc @RWD32]
        vptest   ymm0, ymm1
        jne      G_M6063_IG17
        mov      rcx, rsi
-       vmovups  ymm2, ymmword ptr [reloc @RWD64]
-       vpand    ymm0, ymm2, ymm0
        vpackuswb ymm0, ymm0, ymm0
        vpermq   ymm0, ymm0, -40
        vmovups  xmmword ptr [rcx], xmm0
        mov      r8d, 16
        test     sil, 16
        jne      SHORT G_M6063_IG12
        vmovups  ymm0, ymmword ptr [rax+0x20]
        vptest   ymm0, ymm1
-       jne      G_M6063_IG16
-       vpand    ymm0, ymm2, ymm0
+       jne      SHORT G_M6063_IG16
        vpackuswb ymm0, ymm0, ymm0
        vpermq   ymm0, ymm0, -40
        vmovups  xmmword ptr [rcx+0x10], xmm0
-						;; size=102 bbWeight=0.50 PerfScore 21.33
+						;; size=82 bbWeight=0.50 PerfScore 19.00
 G_M6063_IG12:
        mov      r8, rsi
        and      r8, 31
        neg      r8
        add      r8, 32
        lea      r9, [rdx-0x20]
 						;; size=18 bbWeight=0.50 PerfScore 0.75
 G_M6063_IG13:
        vmovups  ymm0, ymmword ptr [rax+2*r8]
-       vmovups  ymm3, ymmword ptr [rax+2*r8+0x20]
-       vpor     ymm4, ymm0, ymm3
-       vptest   ymm4, ymm1
+       vmovups  ymm2, ymmword ptr [rax+2*r8+0x20]
+       vpor     ymm3, ymm0, ymm2
+       vptest   ymm3, ymm1
        je       SHORT G_M6063_IG15
 						;; size=24 bbWeight=4 PerfScore 65.33
 G_M6063_IG14:
        vptest   ymm0, ymm1
        jne      SHORT G_M6063_IG16
-       vpand    ymm3, ymm2, ymm0
-       vpand    ymm0, ymm2, ymm0
-       vpackuswb ymm2, ymm3, ymm0
-       vpermq   ymm1, ymm2, -40
-       vmovups  xmmword ptr [rcx+r8], xmm1
+       vpackuswb ymm0, ymm0, ymm0
+       vpermq   ymm2, ymm0, -40
+       vmovups  xmmword ptr [rcx+r8], xmm2
        add      r8, 16
        jmp      SHORT G_M6063_IG16
        align    [0 bytes for IG19]
-						;; size=37 bbWeight=0.50 PerfScore 6.96
+						;; size=29 bbWeight=0.50 PerfScore 6.62
 G_M6063_IG15:
-       vpand    ymm0, ymm2, ymm0
-       vpand    ymm3, ymm2, ymm3
-       vpackuswb ymm0, ymm0, ymm3
+       vpackuswb ymm0, ymm0, ymm2
        vpermq   ymm0, ymm0, -40
        vmovups  ymmword ptr [rcx+r8], ymm0
        add      r8, 32
        cmp      r8, r9
        jbe      SHORT G_M6063_IG13
-						;; size=33 bbWeight=4 PerfScore 28.67
+						;; size=25 bbWeight=4 PerfScore 26.00
 G_M6063_IG16:
        mov      rax, r8
        jmp      SHORT G_M6063_IG18
 						;; size=5 bbWeight=0.50 PerfScore 1.12
 G_M6063_IG17:
        xor      eax, eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
 G_M6063_IG18:
        sub      rdx, rax
        cmp      rdx, 4
        jb       SHORT G_M6063_IG22
        lea      rcx, [rax+rdx-0x04]
 						;; size=14 bbWeight=0.50 PerfScore 1.25
 G_M6063_IG19:
        mov      r8, qword ptr [rdi+2*rax]
        mov      r9, 0xD1FFAB1E
        test     r8, r9
        je       SHORT G_M6063_IG21
 						;; size=19 bbWeight=4 PerfScore 14.00
 G_M6063_IG20:
        mov      ecx, r8d
        test     ecx, 0xD1FFAB1E
        jne      SHORT G_M6063_IG23
        lea      rdx, [rsi+rax]
        mov      byte  ptr [rdx], cl
        shr      ecx, 16
        mov      byte  ptr [rdx+0x01], cl
        shr      r8, 32
        mov      ecx, r8d
        add      rax, 2
        jmp      SHORT G_M6063_IG23
 						;; size=36 bbWeight=0.50 PerfScore 3.75
 G_M6063_IG21:
        vmovd    xmm0, r8
        vpackuswb xmm0, xmm0, xmm0
        vmovd    dword ptr [rsi+rax], xmm0
        add      rax, 4
        cmp      rax, rcx
        jbe      SHORT G_M6063_IG19
 						;; size=23 bbWeight=4 PerfScore 26.00
 G_M6063_IG22:
        test     dl, 2
        je       SHORT G_M6063_IG25
        mov      ecx, dword ptr [rdi+2*rax]
        test     ecx, 0xD1FFAB1E
        je       SHORT G_M6063_IG24
 						;; size=16 bbWeight=0.50 PerfScore 2.25
 G_M6063_IG23:
        test     ecx, 0xFF80
        je       SHORT G_M6063_IG26
        jmp      SHORT G_M6063_IG27
 						;; size=10 bbWeight=0.50 PerfScore 1.62
 G_M6063_IG24:
        lea      r8, [rsi+rax]
        mov      byte  ptr [r8], cl
        shr      ecx, 16
        mov      byte  ptr [r8+0x01], cl
        add      rax, 2
 						;; size=18 bbWeight=0.50 PerfScore 1.62
 G_M6063_IG25:
        test     dl, 1
        je       SHORT G_M6063_IG27
        movzx    rcx, word  ptr [rdi+2*rax]
        cmp      ecx, 127
        ja       SHORT G_M6063_IG27
 						;; size=14 bbWeight=0.50 PerfScore 2.25
 G_M6063_IG26:
        mov      byte  ptr [rsi+rax], cl
        inc      rax
 						;; size=6 bbWeight=0.50 PerfScore 0.62
 G_M6063_IG27:
        vzeroupper 
        pop      rbp
        ret      
 						;; size=5 bbWeight=1 PerfScore 2.50
 RWD00  	dq	FF80FF80FF80FF80h, FF80FF80FF80FF80h
 RWD16  	dd	00000000h, 00000000h, 00000000h, 00000000h
 RWD32  	dq	FF80FF80FF80FF80h, FF80FF80FF80FF80h, FF80FF80FF80FF80h, FF80FF80FF80FF80h
-RWD64  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh
 
 
-; Total bytes of code 601, prolog size 4, PerfScore 275.88, instruction count 156, allocated bytes for code 605 (MethodHash=53fae850) for method System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong (FullOpts)
+; Total bytes of code 565, prolog size 4, PerfScore 270.54, instruction count 149, allocated bytes for code 573 (MethodHash=53fae850) for method System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong):ulong (FullOpts)
-19 (-5.40 % of base) - System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte
 ; Assembly listing for method System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX - Unix
 ; FullOpts code
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; No PGO data
 ; 0 inlinees with PGO data; 6 single block inlinees; 6 inlinees without PGO data
 ; Final local variable assignments
 ;
 ;* V00 arg0         [V00    ] (  0,  0   )  struct (16) zero-ref    multireg-arg ld-addr-op single-def <System.ReadOnlySpan`1[ushort]>
 ;* V01 arg1         [V01    ] (  0,  0   )  struct (16) zero-ref    multireg-arg ld-addr-op single-def <System.Span`1[ubyte]>
 ;  V02 arg2         [V02,T06] (  4,  3   )   byref  ->  rbx         single-def
 ;  V03 loc0         [V03,T00] ( 12, 42.50)    long  ->  r15        
 ;  V04 loc1         [V04,T02] (  3,  9   )    long  ->  r13        
 ;* V05 loc2         [V05,T19] (  0,  0   )   byref  ->  zero-ref    single-def
 ;* V06 loc3         [V06,T20] (  0,  0   )   byref  ->  zero-ref    single-def
 ;  V07 loc4         [V07    ] (  2,  1   )     int  ->  [rbp-0x28]  do-not-enreg[X] addr-exposed ld-addr-op
-;  V08 loc5         [V08,T22] (  3, 24   )  simd16  ->  mm8         <System.Runtime.Intrinsics.Vector128`1[ushort]>
-;  V09 loc6         [V09,T23] (  3, 24   )  simd16  ->  mm9         <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;  V08 loc5         [V08,T21] (  3, 24   )  simd16  ->  mm7         <System.Runtime.Intrinsics.Vector128`1[ushort]>
+;  V09 loc6         [V09,T22] (  3, 24   )  simd16  ->  mm8         <System.Runtime.Intrinsics.Vector128`1[ushort]>
 ;* V10 loc7         [V10    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V11 loc8         [V11    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[ubyte]>
-;  V12 loc9         [V12,T25] (  3, 16   )  simd16  ->  mm10         <System.Runtime.Intrinsics.Vector128`1[ubyte]>
+;  V12 loc9         [V12,T24] (  3, 16   )  simd16  ->  mm9         <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V13 loc10        [V13    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V14 loc11        [V14    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[short]>
 ;* V15 loc12        [V15    ] (  0,  0   )  simd16  ->  zero-ref    <System.Runtime.Intrinsics.Vector128`1[short]>
 ;# V16 OutArgs      [V16    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V17 tmp1         [V17,T21] (  3, 48   )  simd16  ->  mm10         "dup spill"
+;  V17 tmp1         [V17,T23] (  3, 24   )  simd16  ->  mm9        
 ;* V18 tmp2         [V18    ] (  0,  0   )  struct (16) zero-ref    "impAppendStmt" <System.ReadOnlySpan`1[ushort]>
 ;* V19 tmp3         [V19    ] (  0,  0   )  struct (16) zero-ref    "spilled call-like call argument" <System.Span`1[ubyte]>
 ;  V20 tmp4         [V20,T12] (  2,  2   )     int  ->  rax         "impAppendStmt"
 ;* V21 tmp5         [V21    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
 ;* V22 tmp6         [V22    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg" <System.ReadOnlySpan`1[ushort]>
 ;* V23 tmp7         [V23    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg" <System.Span`1[ubyte]>
 ;* V24 tmp8         [V24    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V25 tmp9         [V25    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V26 tmp10        [V26    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V27 tmp11        [V27    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;* V28 tmp12        [V28    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ushort]>
 ;* V29 tmp13        [V29    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
 ;* V30 tmp14        [V30    ] (  0,  0   )  simd16  ->  zero-ref    ld-addr-op "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ushort]>
 ;* V31 tmp15        [V31    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
 ;* V32 tmp16        [V32    ] (  0,  0   )  simd16  ->  zero-ref    "Inline return value spill temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
 ;  V33 tmp17        [V33,T07] (  4,  4   )     int  ->   r8         "Inlining Arg"
 ;* V34 tmp18        [V34    ] (  0,  0   )  struct (16) zero-ref    multireg-arg ld-addr-op "NewObj constructor temp" <System.ReadOnlySpan`1[ushort]>
 ;  V35 tmp19        [V35,T10] (  2,  2   )   byref  ->  rdi         single-def "Inlining Arg"
 ;  V36 tmp20        [V36,T13] (  2,  2   )     int  ->  rsi         "Inlining Arg"
 ;  V37 tmp21        [V37,T08] (  4,  4   )     int  ->   r8         "Inlining Arg"
 ;* V38 tmp22        [V38    ] (  0,  0   )  struct (16) zero-ref    multireg-arg ld-addr-op "NewObj constructor temp" <System.Span`1[ubyte]>
 ;  V39 tmp23        [V39,T11] (  2,  2   )   byref  ->  rdx         single-def "Inlining Arg"
 ;  V40 tmp24        [V40,T14] (  2,  2   )     int  ->  rcx         "Inlining Arg"
 ;  V41 tmp25        [V41,T01] (  4, 17.50)   byref  ->  rdi         single-def "field V00._reference (fldOffset=0x0)" P-INDEP
 ;  V42 tmp26        [V42,T05] (  5,  3.50)     int  ->  rsi         single-def "field V00._length (fldOffset=0x8)" P-INDEP
 ;  V43 tmp27        [V43,T03] (  3,  5.50)   byref  ->  rdx         single-def "field V01._reference (fldOffset=0x0)" P-INDEP
 ;  V44 tmp28        [V44,T09] (  3,  2   )     int  ->  rcx         single-def "field V01._length (fldOffset=0x8)" P-INDEP
 ;* V45 tmp29        [V45    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V18._reference (fldOffset=0x0)" P-INDEP
 ;* V46 tmp30        [V46    ] (  0,  0   )     int  ->  zero-ref    "field V18._length (fldOffset=0x8)" P-INDEP
 ;* V47 tmp31        [V47    ] (  0,  0   )   byref  ->  zero-ref    "field V19._reference (fldOffset=0x0)" P-INDEP
 ;* V48 tmp32        [V48    ] (  0,  0   )     int  ->  zero-ref    "field V19._length (fldOffset=0x8)" P-INDEP
 ;* V49 tmp33        [V49    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V22._reference (fldOffset=0x0)" P-INDEP
 ;* V50 tmp34        [V50    ] (  0,  0   )     int  ->  zero-ref    "field V22._length (fldOffset=0x8)" P-INDEP
 ;* V51 tmp35        [V51    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V23._reference (fldOffset=0x0)" P-INDEP
 ;* V52 tmp36        [V52    ] (  0,  0   )     int  ->  zero-ref    "field V23._length (fldOffset=0x8)" P-INDEP
 ;  V53 tmp37        [V53,T15] (  2,  1   )   byref  ->  rdi         single-def "field V34._reference (fldOffset=0x0)" P-INDEP
 ;  V54 tmp38        [V54,T17] (  2,  1   )     int  ->  rsi         "field V34._length (fldOffset=0x8)" P-INDEP
 ;  V55 tmp39        [V55,T16] (  2,  1   )   byref  ->  rdx         single-def "field V38._reference (fldOffset=0x0)" P-INDEP
 ;  V56 tmp40        [V56,T18] (  2,  1   )     int  ->  rcx         "field V38._length (fldOffset=0x8)" P-INDEP
-;  V57 cse0         [V57,T24] (  3, 17   )  simd16  ->  mm0         hoist "CSE #01: aggressive"
+;  V57 cse0         [V57,T25] (  2,  9   )  simd16  ->  mm0         hoist "CSE #01: aggressive"
 ;  V58 cse1         [V58,T26] (  2,  9   )  simd16  ->  mm1         hoist "CSE #02: aggressive"
 ;  V59 cse2         [V59,T27] (  2,  9   )  simd16  ->  mm2         hoist "CSE #03: aggressive"
 ;  V60 cse3         [V60,T28] (  2,  9   )  simd16  ->  mm3         hoist "CSE #04: aggressive"
 ;  V61 cse4         [V61,T29] (  2,  9   )  simd16  ->  mm4         hoist "CSE #05: aggressive"
 ;  V62 cse5         [V62,T30] (  2,  9   )  simd16  ->  mm5         hoist "CSE #06: aggressive"
 ;  V63 cse6         [V63,T31] (  2,  9   )  simd16  ->  mm6         hoist "CSE #07: aggressive"
-;  V64 cse7         [V64,T32] (  2,  9   )  simd16  ->  mm7         hoist "CSE #08: aggressive"
-;  V65 cse8         [V65,T04] (  3,  6   )    long  ->  r14         "CSE #09: aggressive"
+;  V64 cse7         [V64,T04] (  3,  6   )    long  ->  r14         "CSE #08: aggressive"
 ;
 ; Lcl frame size = 16
 
 G_M6966_IG01:
        push     rbp
        push     r15
        push     r14
        push     r13
        push     rbx
        sub      rsp, 16
        lea      rbp, [rsp+0x30]
        mov      rbx, r8
 						;; size=20 bbWeight=1 PerfScore 6.00
 G_M6966_IG02:
        xor      r15d, r15d
        mov      r14d, esi
        lea      r13, [r14-0x10]
        vmovups  xmm0, xmmword ptr [reloc @RWD00]
        vmovups  xmm1, xmmword ptr [reloc @RWD16]
        vmovups  xmm2, xmmword ptr [reloc @RWD32]
        vmovups  xmm3, xmmword ptr [reloc @RWD48]
        vmovups  xmm4, xmmword ptr [reloc @RWD64]
        vmovups  xmm5, xmmword ptr [reloc @RWD80]
        vmovups  xmm6, xmmword ptr [reloc @RWD96]
-       vmovups  xmm7, xmmword ptr [reloc @RWD112]
        jmp      SHORT G_M6966_IG04
        align    [0 bytes for IG03]
-						;; size=76 bbWeight=1 PerfScore 27.00
+						;; size=68 bbWeight=1 PerfScore 24.00
 G_M6966_IG03:
        mov      r15, r13
 						;; size=3 bbWeight=4 PerfScore 1.00
 G_M6966_IG04:
-       vmovups  xmm8, xmmword ptr [rdi+2*r15]
-       vmovups  xmm9, xmmword ptr [rdi+2*r15+0x10]
-       vpand    xmm10, xmm0, xmm8
-       vpand    xmm11, xmm0, xmm9
-       vpackuswb xmm10, xmm10, xmm11
-       vpaddb   xmm11, xmm1, xmm10
-       vpsubusb xmm11, xmm11, xmm2
-       vpsubb   xmm11, xmm11, xmm3
-       vpand    xmm10, xmm4, xmm10
-       vpsubb   xmm10, xmm10, xmm5
-       vpaddusb xmm10, xmm10, xmm6
-       vpminub  xmm10, xmm11, xmm10
-       vpor     xmm8, xmm8, xmm9
-       vptest   xmm8, xmm7
+       vmovups  xmm7, xmmword ptr [rdi+2*r15]
+       vmovups  xmm8, xmmword ptr [rdi+2*r15+0x10]
+       vpackuswb xmm9, xmm7, xmm8
+       vpaddb   xmm10, xmm0, xmm9
+       vpsubusb xmm10, xmm10, xmm1
+       vpsubb   xmm10, xmm10, xmm2
+       vpand    xmm9, xmm3, xmm9
+       vpsubb   xmm9, xmm9, xmm4
+       vpaddusb xmm9, xmm9, xmm5
+       vpminub  xmm9, xmm10, xmm9
+       vpor     xmm7, xmm7, xmm8
+       vptest   xmm7, xmm6
        jne      SHORT G_M6966_IG06
-						;; size=71 bbWeight=8 PerfScore 132.00
+						;; size=61 bbWeight=8 PerfScore 126.67
 G_M6966_IG05:
-       vpaddusb xmm8, xmm10, xmmword ptr [reloc @RWD128]
-       vpmovmskb r8d, xmm8
+       vpaddusb xmm7, xmm9, xmmword ptr [reloc @RWD112]
+       vpmovmskb r8d, xmm7
        test     r8d, r8d
        je       SHORT G_M6966_IG08
-						;; size=18 bbWeight=4 PerfScore 21.00
+						;; size=17 bbWeight=4 PerfScore 21.00
 G_M6966_IG06:
        mov      r8d, r15d
        cmp      r8d, esi
        ja       G_M6966_IG11
        mov      eax, r8d
        lea      rdi, bword ptr [rdi+2*rax]
        sub      esi, r8d
        mov      r8, r15
        shr      r8, 1
        cmp      r8d, ecx
        ja       SHORT G_M6966_IG11
        mov      eax, r8d
        add      rdx, rax
        sub      ecx, r8d
        lea      r8, [rbp-0x28]
        mov      rax, 0xD1FFAB1E      ; code for System.HexConverter:TryDecodeFromUtf16_Scalar(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte
        call     [rax]System.HexConverter:TryDecodeFromUtf16_Scalar(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte
        add      r15d, dword ptr [rbp-0x28]
        mov      dword ptr [rbx], r15d
 						;; size=65 bbWeight=0.50 PerfScore 6.00
 G_M6966_IG07:
        add      rsp, 16
        pop      rbx
        pop      r13
        pop      r14
        pop      r15
        pop      rbp
        ret      
 						;; size=13 bbWeight=0.50 PerfScore 1.88
 G_M6966_IG08:
-       vpmaddubsw xmm8, xmm10, xmmword ptr [reloc @RWD144]
-       vpshufb  xmm8, xmm8, xmmword ptr [reloc @RWD160]
+       vpmaddubsw xmm7, xmm9, xmmword ptr [reloc @RWD128]
+       vpshufb  xmm7, xmm7, xmmword ptr [reloc @RWD144]
        mov      rax, r15
        shr      rax, 1
-       vmovd    qword ptr [rdx+rax], xmm8
+       vmovd    qword ptr [rdx+rax], xmm7
        add      r15, 16
        cmp      r15, r14
        je       SHORT G_M6966_IG09
        cmp      r15, r13
        jbe      G_M6966_IG04
        jmp      G_M6966_IG03
 						;; size=53 bbWeight=4 PerfScore 62.00
 G_M6966_IG09:
        mov      dword ptr [rbx], esi
        mov      eax, 1
 						;; size=7 bbWeight=0.50 PerfScore 0.62
 G_M6966_IG10:
        add      rsp, 16
        pop      rbx
        pop      r13
        pop      r14
        pop      r15
        pop      rbp
        ret      
 						;; size=13 bbWeight=0.50 PerfScore 1.88
 G_M6966_IG11:
        mov      rax, 0xD1FFAB1E      ; code for System.ThrowHelper:ThrowArgumentOutOfRangeException()
        call     [rax]System.ThrowHelper:ThrowArgumentOutOfRangeException()
        int3     
 						;; size=13 bbWeight=0 PerfScore 0.00
-RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh
-RWD16  	dq	C6C6C6C6C6C6C6C6h, C6C6C6C6C6C6C6C6h
-RWD32  	dq	0606060606060606h, 0606060606060606h
-RWD48  	dq	F0F0F0F0F0F0F0F0h, F0F0F0F0F0F0F0F0h
-RWD64  	dq	DFDFDFDFDFDFDFDFh, DFDFDFDFDFDFDFDFh
-RWD80  	dq	4141414141414141h, 4141414141414141h
-RWD96  	dq	0A0A0A0A0A0A0A0Ah, 0A0A0A0A0A0A0A0Ah
-RWD112 	dq	FF80FF80FF80FF80h, FF80FF80FF80FF80h
-RWD128 	dq	7070707070707070h, 7070707070707070h
-RWD144 	dq	0110011001100110h, 0110011001100110h
-RWD160 	dq	0E0C0A0806040200h, 0000000000000000h
+RWD00  	dq	C6C6C6C6C6C6C6C6h, C6C6C6C6C6C6C6C6h
+RWD16  	dq	0606060606060606h, 0606060606060606h
+RWD32  	dq	F0F0F0F0F0F0F0F0h, F0F0F0F0F0F0F0F0h
+RWD48  	dq	DFDFDFDFDFDFDFDFh, DFDFDFDFDFDFDFDFh
+RWD64  	dq	4141414141414141h, 4141414141414141h
+RWD80  	dq	0A0A0A0A0A0A0A0Ah, 0A0A0A0A0A0A0A0Ah
+RWD96  	dq	FF80FF80FF80FF80h, FF80FF80FF80FF80h
+RWD112 	dq	7070707070707070h, 7070707070707070h
+RWD128 	dq	0110011001100110h, 0110011001100110h
+RWD144 	dq	0E0C0A0806040200h, 0000000000000000h
 
 
-; Total bytes of code 352, prolog size 20, PerfScore 259.38, instruction count 89, allocated bytes for code 352 (MethodHash=bb7ae4c9) for method System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte (FullOpts)
+; Total bytes of code 333, prolog size 20, PerfScore 251.04, instruction count 86, allocated bytes for code 333 (MethodHash=bb7ae4c9) for method System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte],byref):ubyte (FullOpts)

@MihuBot
Copy link
Owner Author

MihuBot commented May 27, 2024

@MihaZupan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant