-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark SIMD assignments as related to SIMD intrinsics #51731
Conversation
An example of a gain is in Before; Assembly listing for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 27 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 RetBuf [V00,T04] ( 4, 4 ) byref -> rcx
; V01 arg0 [V01,T02] ( 3, 6 ) byref -> rdx
; V02 arg1 [V02,T03] ( 3, 6 ) byref -> r8
; V03 arg2 [V03,T00] ( 13, 16 ) byref -> r9
; V04 arg3 [V04,T05] ( 1, 1 ) byref -> [rsp+140H]
; V05 arg4 [V05,T06] ( 1, 1 ) byref -> [rsp+148H]
; V06 loc0 [V06,T07] ( 8, 6 ) simd12 -> mm0 ld-addr-op
; V07 loc1 [V07,T18] ( 3, 2.50) float -> mm2
; V08 loc2 [V08,T19] ( 3, 2.50) simd12 -> mm2
; V09 loc3 [V09,T14] ( 5, 3 ) simd12 -> mm0
; V10 loc4 [V10,T08] ( 7, 4 ) simd12 -> mm4
; V11 loc5 [V11,T15] ( 4, 3 ) float -> registers
; V12 loc6 [V12,T01] ( 9, 9 ) struct (64) [rsp+B0H] do-not-enreg[SFB] ld-addr-op
;# V13 OutArgs [V13 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;* V14 tmp1 [V14 ] ( 0, 0 ) simd12 -> zero-ref do-not-enreg[SB] "struct address for call/obj"
;* V15 tmp2 [V15 ] ( 0, 0 ) simd12 -> zero-ref do-not-enreg[SB] "struct address for call/obj"
;* V16 tmp3 [V16 ] ( 0, 0 ) simd12 -> zero-ref do-not-enreg[SB] "struct address for call/obj"
;* V17 tmp4 [V17 ] ( 0, 0 ) simd12 -> zero-ref do-not-enreg[SB] "struct address for call/obj"
; V18 tmp5 [V18,T35] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V19 tmp6 [V19 ] ( 3, 1.50) simd12 -> [rsp+A0H] do-not-enreg[SB]
; V20 tmp7 [V20,T36] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V21 tmp8 [V21,T37] ( 2, 2 ) simd12 -> mm2 "Inlining Arg"
; V22 tmp9 [V22 ] ( 2, 2 ) simd12 -> [rsp+90H] do-not-enreg[SB] "Inlining Arg"
;* V23 tmp10 [V23 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V24 tmp11 [V24,T38] ( 2, 2 ) simd12 -> mm0 "Inlining Arg"
;* V25 tmp12 [V25 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V26 tmp13 [V26,T39] ( 2, 2 ) simd12 -> mm2 "NewObj constructor temp"
; V27 tmp14 [V27,T40] ( 2, 2 ) float -> mm4 "Inlining Arg"
;* V28 tmp15 [V28 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
; V29 tmp16 [V29 ] ( 7, 7 ) simd12 -> [rsp+80H] do-not-enreg[SB] "Inlining Arg"
; V30 tmp17 [V30,T41] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V31 tmp18 [V31,T09] ( 4, 4 ) simd12 -> mm0 ld-addr-op "Inlining Arg"
; V32 tmp19 [V32 ] ( 2, 2 ) simd12 -> [rsp+70H] do-not-enreg[SB] "impAppendStmt"
; V33 tmp20 [V33,T80] ( 2, 1 ) float -> mm0 "Inline stloc first use temp"
; V34 tmp21 [V34,T42] ( 2, 2 ) simd12 -> mm3 "Inlining Arg"
;* V35 tmp22 [V35 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V36 tmp23 [V36,T43] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V37 tmp24 [V37 ] ( 7, 7 ) simd12 -> [rsp+60H] do-not-enreg[SB] "Inlining Arg"
;* V38 tmp25 [V38 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
; V39 tmp26 [V39,T44] ( 2, 2 ) simd12 -> mm3 "NewObj constructor temp"
; V40 tmp27 [V40,T10] ( 4, 4 ) simd12 -> mm3 ld-addr-op "Inlining Arg"
; V41 tmp28 [V41 ] ( 2, 2 ) simd12 -> [rsp+50H] do-not-enreg[SB] "impAppendStmt"
; V42 tmp29 [V42,T81] ( 2, 1 ) float -> mm3 "Inline stloc first use temp"
; V43 tmp30 [V43,T45] ( 2, 2 ) simd12 -> mm4 "Inlining Arg"
;* V44 tmp31 [V44 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V45 tmp32 [V45,T46] ( 2, 2 ) simd12 -> mm3 "NewObj constructor temp"
;* V46 tmp33 [V46 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
; V47 tmp34 [V47 ] ( 7, 7 ) simd12 -> [rsp+40H] do-not-enreg[SB] "Inlining Arg"
; V48 tmp35 [V48,T47] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V49 tmp36 [V49,T11] ( 4, 4 ) simd12 -> mm0 ld-addr-op "Inlining Arg"
; V50 tmp37 [V50 ] ( 2, 2 ) simd12 -> [rsp+30H] do-not-enreg[SB] "impAppendStmt"
; V51 tmp38 [V51,T82] ( 2, 1 ) float -> mm0 "Inline stloc first use temp"
; V52 tmp39 [V52,T48] ( 2, 2 ) simd12 -> mm3 "Inlining Arg"
;* V53 tmp40 [V53 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V54 tmp41 [V54,T49] ( 2, 2 ) simd12 -> mm0 "NewObj constructor temp"
; V55 tmp42 [V55 ] ( 7, 7 ) simd12 -> [rsp+20H] do-not-enreg[SB] "Inlining Arg"
; V56 tmp43 [V56 ] ( 7, 7 ) simd12 -> [rsp+10H] do-not-enreg[SB] "Inlining Arg"
; V57 tmp44 [V57,T50] ( 2, 2 ) simd12 -> mm3 "NewObj constructor temp"
; V58 tmp45 [V58,T12] ( 4, 4 ) simd12 -> mm3 ld-addr-op "Inlining Arg"
; V59 tmp46 [V59 ] ( 2, 2 ) simd12 -> [rsp+00H] do-not-enreg[SB] "impAppendStmt"
; V60 tmp47 [V60,T83] ( 2, 1 ) float -> mm3 "Inline stloc first use temp"
; V61 tmp48 [V61,T51] ( 2, 2 ) simd12 -> mm4 "Inlining Arg"
;* V62 tmp49 [V62 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V63 tmp50 [V63,T52] ( 2, 2 ) simd12 -> mm3 "NewObj constructor temp"
;* V64 tmp51 [V64 ] ( 0, 0 ) float -> zero-ref V124.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V65 tmp52 [V65 ] ( 0, 0 ) float -> zero-ref V124.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;* V66 tmp53 [V66 ] ( 0, 0 ) float -> zero-ref V124.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;* V67 tmp54 [V67 ] ( 0, 0 ) float -> zero-ref V125.X(offs=0x00) P-INDEP "field V05.X (fldOffset=0x0)"
;* V68 tmp55 [V68 ] ( 0, 0 ) float -> zero-ref V125.Y(offs=0x04) P-INDEP "field V05.Y (fldOffset=0x4)"
;* V69 tmp56 [V69 ] ( 0, 0 ) float -> zero-ref V125.Z(offs=0x08) P-INDEP "field V05.Z (fldOffset=0x8)"
;* V70 tmp57 [V70 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V14.X(offs=0x00) P-DEP "field V14.X (fldOffset=0x0)"
;* V71 tmp58 [V71 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V14.Y(offs=0x04) P-DEP "field V14.Y (fldOffset=0x4)"
;* V72 tmp59 [V72 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V14.Z(offs=0x08) P-DEP "field V14.Z (fldOffset=0x8)"
;* V73 tmp60 [V73 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V15.X(offs=0x00) P-DEP "field V15.X (fldOffset=0x0)"
;* V74 tmp61 [V74 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V15.Y(offs=0x04) P-DEP "field V15.Y (fldOffset=0x4)"
;* V75 tmp62 [V75 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V15.Z(offs=0x08) P-DEP "field V15.Z (fldOffset=0x8)"
;* V76 tmp63 [V76 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V16.X(offs=0x00) P-DEP "field V16.X (fldOffset=0x0)"
;* V77 tmp64 [V77 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V16.Y(offs=0x04) P-DEP "field V16.Y (fldOffset=0x4)"
;* V78 tmp65 [V78 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V16.Z(offs=0x08) P-DEP "field V16.Z (fldOffset=0x8)"
;* V79 tmp66 [V79 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V17.X(offs=0x00) P-DEP "field V17.X (fldOffset=0x0)"
;* V80 tmp67 [V80 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V17.Y(offs=0x04) P-DEP "field V17.Y (fldOffset=0x4)"
;* V81 tmp68 [V81 ] ( 0, 0 ) float -> zero-ref do-not-enreg[] V17.Z(offs=0x08) P-DEP "field V17.Z (fldOffset=0x8)"
; V82 tmp69 [V82,T68] ( 3, 1.50) float -> [rsp+A0H] do-not-enreg[] V19.X(offs=0x00) P-DEP "field V19.X (fldOffset=0x0)"
; V83 tmp70 [V83,T69] ( 3, 1.50) float -> [rsp+A4H] do-not-enreg[] V19.Y(offs=0x04) P-DEP "field V19.Y (fldOffset=0x4)"
; V84 tmp71 [V84,T70] ( 3, 1.50) float -> [rsp+A8H] do-not-enreg[] V19.Z(offs=0x08) P-DEP "field V19.Z (fldOffset=0x8)"
; V85 tmp72 [V85,T53] ( 2, 2 ) float -> [rsp+90H] do-not-enreg[] V22.X(offs=0x00) P-DEP "field V22.X (fldOffset=0x0)"
; V86 tmp73 [V86,T54] ( 2, 2 ) float -> [rsp+94H] do-not-enreg[] V22.Y(offs=0x04) P-DEP "field V22.Y (fldOffset=0x4)"
; V87 tmp74 [V87,T55] ( 2, 2 ) float -> [rsp+98H] do-not-enreg[] V22.Z(offs=0x08) P-DEP "field V22.Z (fldOffset=0x8)"
; V88 tmp75 [V88,T71] ( 3, 1.50) float -> mm0 V28.X(offs=0x00) P-INDEP "field V28.X (fldOffset=0x0)"
; V89 tmp76 [V89,T72] ( 3, 1.50) float -> mm3 V28.Y(offs=0x04) P-INDEP "field V28.Y (fldOffset=0x4)"
; V90 tmp77 [V90,T73] ( 3, 1.50) float -> mm5 V28.Z(offs=0x08) P-INDEP "field V28.Z (fldOffset=0x8)"
; V91 tmp78 [V91,T20] ( 3, 2 ) float -> [rsp+80H] do-not-enreg[] V29.X(offs=0x00) P-DEP "field V29.X (fldOffset=0x0)"
; V92 tmp79 [V92,T21] ( 3, 2 ) float -> [rsp+84H] do-not-enreg[] V29.Y(offs=0x04) P-DEP "field V29.Y (fldOffset=0x4)"
; V93 tmp80 [V93,T22] ( 3, 2 ) float -> [rsp+88H] do-not-enreg[] V29.Z(offs=0x08) P-DEP "field V29.Z (fldOffset=0x8)"
; V94 tmp81 [V94,T56] ( 2, 2 ) float -> [rsp+70H] do-not-enreg[] V32.X(offs=0x00) P-DEP "field V32.X (fldOffset=0x0)"
; V95 tmp82 [V95,T57] ( 2, 2 ) float -> [rsp+74H] do-not-enreg[] V32.Y(offs=0x04) P-DEP "field V32.Y (fldOffset=0x4)"
; V96 tmp83 [V96,T58] ( 2, 2 ) float -> [rsp+78H] do-not-enreg[] V32.Z(offs=0x08) P-DEP "field V32.Z (fldOffset=0x8)"
; V97 tmp84 [V97,T23] ( 3, 2 ) float -> [rsp+60H] do-not-enreg[] V37.X(offs=0x00) P-DEP "field V37.X (fldOffset=0x0)"
; V98 tmp85 [V98,T24] ( 3, 2 ) float -> [rsp+64H] do-not-enreg[] V37.Y(offs=0x04) P-DEP "field V37.Y (fldOffset=0x4)"
; V99 tmp86 [V99,T25] ( 3, 2 ) float -> [rsp+68H] do-not-enreg[] V37.Z(offs=0x08) P-DEP "field V37.Z (fldOffset=0x8)"
; V100 tmp87 [V100,T74] ( 3, 1.50) float -> mm3 V38.X(offs=0x00) P-INDEP "field V38.X (fldOffset=0x0)"
; V101 tmp88 [V101,T75] ( 3, 1.50) float -> mm4 V38.Y(offs=0x04) P-INDEP "field V38.Y (fldOffset=0x4)"
; V102 tmp89 [V102,T76] ( 3, 1.50) float -> mm5 V38.Z(offs=0x08) P-INDEP "field V38.Z (fldOffset=0x8)"
; V103 tmp90 [V103,T59] ( 2, 2 ) float -> [rsp+50H] do-not-enreg[] V41.X(offs=0x00) P-DEP "field V41.X (fldOffset=0x0)"
; V104 tmp91 [V104,T60] ( 2, 2 ) float -> [rsp+54H] do-not-enreg[] V41.Y(offs=0x04) P-DEP "field V41.Y (fldOffset=0x4)"
; V105 tmp92 [V105,T61] ( 2, 2 ) float -> [rsp+58H] do-not-enreg[] V41.Z(offs=0x08) P-DEP "field V41.Z (fldOffset=0x8)"
; V106 tmp93 [V106,T77] ( 3, 1.50) float -> mm4 V46.X(offs=0x00) P-INDEP "field V46.X (fldOffset=0x0)"
; V107 tmp94 [V107,T78] ( 3, 1.50) float -> mm3 V46.Y(offs=0x04) P-INDEP "field V46.Y (fldOffset=0x4)"
; V108 tmp95 [V108,T79] ( 3, 1.50) float -> mm5 V46.Z(offs=0x08) P-INDEP "field V46.Z (fldOffset=0x8)"
; V109 tmp96 [V109,T26] ( 3, 2 ) float -> [rsp+40H] do-not-enreg[] V47.X(offs=0x00) P-DEP "field V47.X (fldOffset=0x0)"
; V110 tmp97 [V110,T27] ( 3, 2 ) float -> [rsp+44H] do-not-enreg[] V47.Y(offs=0x04) P-DEP "field V47.Y (fldOffset=0x4)"
; V111 tmp98 [V111,T28] ( 3, 2 ) float -> [rsp+48H] do-not-enreg[] V47.Z(offs=0x08) P-DEP "field V47.Z (fldOffset=0x8)"
; V112 tmp99 [V112,T62] ( 2, 2 ) float -> [rsp+30H] do-not-enreg[] V50.X(offs=0x00) P-DEP "field V50.X (fldOffset=0x0)"
; V113 tmp100 [V113,T63] ( 2, 2 ) float -> [rsp+34H] do-not-enreg[] V50.Y(offs=0x04) P-DEP "field V50.Y (fldOffset=0x4)"
; V114 tmp101 [V114,T64] ( 2, 2 ) float -> [rsp+38H] do-not-enreg[] V50.Z(offs=0x08) P-DEP "field V50.Z (fldOffset=0x8)"
; V115 tmp102 [V115,T29] ( 3, 2 ) float -> [rsp+20H] do-not-enreg[] V55.X(offs=0x00) P-DEP "field V55.X (fldOffset=0x0)"
; V116 tmp103 [V116,T30] ( 3, 2 ) float -> [rsp+24H] do-not-enreg[] V55.Y(offs=0x04) P-DEP "field V55.Y (fldOffset=0x4)"
; V117 tmp104 [V117,T31] ( 3, 2 ) float -> [rsp+28H] do-not-enreg[] V55.Z(offs=0x08) P-DEP "field V55.Z (fldOffset=0x8)"
; V118 tmp105 [V118,T32] ( 3, 2 ) float -> [rsp+10H] do-not-enreg[] V56.X(offs=0x00) P-DEP "field V56.X (fldOffset=0x0)"
; V119 tmp106 [V119,T33] ( 3, 2 ) float -> [rsp+14H] do-not-enreg[] V56.Y(offs=0x04) P-DEP "field V56.Y (fldOffset=0x4)"
; V120 tmp107 [V120,T34] ( 3, 2 ) float -> [rsp+18H] do-not-enreg[] V56.Z(offs=0x08) P-DEP "field V56.Z (fldOffset=0x8)"
; V121 tmp108 [V121,T65] ( 2, 2 ) float -> [rsp+00H] do-not-enreg[] V59.X(offs=0x00) P-DEP "field V59.X (fldOffset=0x0)"
; V122 tmp109 [V122,T66] ( 2, 2 ) float -> [rsp+04H] do-not-enreg[] V59.Y(offs=0x04) P-DEP "field V59.Y (fldOffset=0x4)"
; V123 tmp110 [V123,T67] ( 2, 2 ) float -> [rsp+08H] do-not-enreg[] V59.Z(offs=0x08) P-DEP "field V59.Z (fldOffset=0x8)"
;* V124 tmp111 [V124 ] ( 0, 0 ) simd12 -> zero-ref "Promoted implicit byref"
;* V125 tmp112 [V125 ] ( 0, 0 ) simd12 -> zero-ref "Promoted implicit byref"
; V126 cse0 [V126,T17] ( 3, 3 ) simd12 -> mm1 "CSE - moderate"
; V127 cse1 [V127,T13] ( 4, 3.50) simd12 -> mm3 "CSE - moderate"
; V128 cse2 [V128,T16] ( 4, 3 ) float -> mm5 "CSE - moderate"
;
; Lcl frame size = 280
G_M27508_IG01:
sub rsp, 280
vzeroupper
vmovaps qword ptr [rsp+100H], xmm6
vmovaps qword ptr [rsp+F0H], xmm7
;; bbWeight=1 PerfScore 7.25
G_M27508_IG02:
vmovss xmm0, dword ptr [rdx+8]
vmovsd xmm1, qword ptr [rdx]
vshufps xmm1, xmm0, 68
vmovss xmm0, dword ptr [r8+8]
vmovsd xmm2, qword ptr [r8]
vshufps xmm2, xmm0, 68
vsubps xmm0, xmm1, xmm2
vdpps xmm2, xmm0, xmm0, 113
vmovss xmm3, dword ptr [reloc @RWD32]
vucomiss xmm3, xmm2
jbe SHORT G_M27508_IG04
;; bbWeight=1 PerfScore 29.00
G_M27508_IG03:
mov rax, bword ptr [rsp+140H]
vmovss xmm0, dword ptr [rax+8]
vmovsd xmm2, qword ptr [rax]
vshufps xmm2, xmm0, 68
vxorps xmm0, xmm0, xmm0
vsubps xmm0, xmm0, xmm2
jmp SHORT G_M27508_IG05
;; bbWeight=0.50 PerfScore 5.67
G_M27508_IG04:
vmovapd xmmword ptr [rsp+90H], xmm0
vmovapd xmm0, xmmword ptr [rsp+90H]
vsqrtss xmm2, xmm2
vmovss xmm3, dword ptr [reloc @RWD16]
vdivss xmm2, xmm3, xmm2
vinsertps xmm2, xmm2, 14
vshufps xmm2, xmm2, 64
vmulps xmm0, xmm0, xmm2
;; bbWeight=0.50 PerfScore 16.50
G_M27508_IG05:
vmovss xmm2, dword ptr [r9+8]
vmovsd xmm3, qword ptr [r9]
vshufps xmm3, xmm2, 68
vmovaps xmm2, xmm3
vdpps xmm4, xmm3, xmm0, 113
vandps xmm4, xmm4, dword ptr [reloc @RWD48]
vmovss xmm5, dword ptr [reloc @RWD64]
vucomiss xmm4, xmm5
jbe G_M27508_IG10
;; bbWeight=1 PerfScore 23.25
G_M27508_IG06:
mov rax, bword ptr [rsp+148H]
vmovss xmm0, dword ptr [rax+8]
vmovsd xmm4, qword ptr [rax]
vshufps xmm4, xmm0, 68
vdpps xmm0, xmm3, xmm4, 113
vandps xmm0, xmm0, dword ptr [reloc @RWD48]
vucomiss xmm0, xmm5
jbe SHORT G_M27508_IG09
vmovss xmm4, dword ptr [r9+8]
vandps xmm4, xmm4, dword ptr [reloc @RWD48]
vucomiss xmm4, xmm5
ja SHORT G_M27508_IG07
vmovupd xmm0, xmmword ptr [reloc @RWD00]
vmovapd xmmword ptr [rsp+A0H], xmm0
jmp SHORT G_M27508_IG08
;; bbWeight=0.50 PerfScore 17.50
G_M27508_IG07:
vmovupd xmm0, xmmword ptr [reloc @RWD16]
vmovapd xmmword ptr [rsp+A0H], xmm0
;; bbWeight=0.50 PerfScore 2.50
G_M27508_IG08:
vmovapd xmm4, xmmword ptr [rsp+A0H]
;; bbWeight=0.50 PerfScore 1.50
G_M27508_IG09:
vmovss xmm0, dword ptr [r9]
vmovss xmm3, dword ptr [r9+4]
vmovss xmm5, dword ptr [r9+8]
vmovapd xmmword ptr [rsp+80H], xmm4
vmulss xmm4, xmm3, dword ptr [rsp+88H]
vmulss xmm6, xmm5, dword ptr [rsp+84H]
vsubss xmm4, xmm4, xmm6
vmulss xmm5, xmm5, dword ptr [rsp+80H]
vmulss xmm6, xmm0, dword ptr [rsp+88H]
vsubss xmm5, xmm5, xmm6
vmulss xmm0, xmm0, dword ptr [rsp+84H]
vmulss xmm3, xmm3, dword ptr [rsp+80H]
vsubss xmm0, xmm0, xmm3
vxorps xmm3, xmm3
vmovss xmm3, xmm3, xmm0
vpslldq xmm3, 4
vmovss xmm3, xmm3, xmm5
vpslldq xmm3, 4
vmovss xmm3, xmm3, xmm4
vmovaps xmm0, xmm3
vmovapd xmmword ptr [rsp+70H], xmm0
vdpps xmm0, xmm0, xmm0, 113
vmovapd xmm3, xmmword ptr [rsp+70H]
vsqrtss xmm0, xmm0
vinsertps xmm0, xmm0, 14
vshufps xmm0, xmm0, 64
vdivps xmm0, xmm3, xmm0
vpslldq xmm0, xmm0, 4
vpsrldq xmm0, xmm0, 4
vmovapd xmmword ptr [rsp+60H], xmm0
vmovss xmm3, dword ptr [r9]
vmovss xmm4, dword ptr [r9+4]
vmovss xmm5, dword ptr [r9+8]
vmulss xmm6, xmm5, dword ptr [rsp+64H]
vmulss xmm7, xmm4, dword ptr [rsp+68H]
vsubss xmm6, xmm6, xmm7
vmulss xmm7, xmm3, dword ptr [rsp+68H]
vmulss xmm5, xmm5, dword ptr [rsp+60H]
vsubss xmm5, xmm7, xmm5
vmulss xmm4, xmm4, dword ptr [rsp+60H]
vmulss xmm3, xmm3, dword ptr [rsp+64H]
vsubss xmm3, xmm4, xmm3
vxorps xmm4, xmm4
vmovss xmm4, xmm4, xmm3
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm5
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm6
vmovaps xmm3, xmm4
vmovapd xmmword ptr [rsp+50H], xmm3
vdpps xmm3, xmm3, xmm3, 113
vmovapd xmm4, xmmword ptr [rsp+50H]
vsqrtss xmm3, xmm3
vinsertps xmm3, xmm3, 14
vshufps xmm3, xmm3, 64
vdivps xmm3, xmm4, xmm3
vpslldq xmm3, xmm3, 4
vpsrldq xmm4, xmm3, 4
jmp G_M27508_IG11
;; bbWeight=0.50 PerfScore 95.58
G_M27508_IG10:
vmovss xmm4, dword ptr [r9]
vmovss xmm3, dword ptr [r9+4]
vmovss xmm5, dword ptr [r9+8]
vmovapd xmmword ptr [rsp+40H], xmm0
vmulss xmm0, xmm3, dword ptr [rsp+48H]
vmulss xmm6, xmm5, dword ptr [rsp+44H]
vsubss xmm0, xmm0, xmm6
vmulss xmm5, xmm5, dword ptr [rsp+40H]
vmulss xmm6, xmm4, dword ptr [rsp+48H]
vsubss xmm5, xmm5, xmm6
vmulss xmm4, xmm4, dword ptr [rsp+44H]
vmulss xmm3, xmm3, dword ptr [rsp+40H]
vsubss xmm3, xmm4, xmm3
vxorps xmm4, xmm4
vmovss xmm4, xmm4, xmm3
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm5
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm0
vmovaps xmm0, xmm4
vmovapd xmmword ptr [rsp+30H], xmm0
vdpps xmm0, xmm0, xmm0, 113
vmovapd xmm3, xmmword ptr [rsp+30H]
vsqrtss xmm0, xmm0
vinsertps xmm0, xmm0, 14
vshufps xmm0, xmm0, 64
vdivps xmm0, xmm3, xmm0
vpslldq xmm0, xmm0, 4
vpsrldq xmm0, xmm0, 4
vmovapd xmmword ptr [rsp+20H], xmm0
vmovapd xmmword ptr [rsp+10H], xmm2
vmovss xmm3, dword ptr [rsp+24H]
vmulss xmm3, xmm3, dword ptr [rsp+18H]
vmovss xmm4, dword ptr [rsp+28H]
vmulss xmm4, xmm4, dword ptr [rsp+14H]
vsubss xmm3, xmm3, xmm4
vmovss xmm4, dword ptr [rsp+28H]
vmulss xmm4, xmm4, dword ptr [rsp+10H]
vmovss xmm5, dword ptr [rsp+20H]
vmulss xmm5, xmm5, dword ptr [rsp+18H]
vsubss xmm4, xmm4, xmm5
vmovss xmm5, dword ptr [rsp+20H]
vmulss xmm5, xmm5, dword ptr [rsp+14H]
vmovss xmm6, dword ptr [rsp+24H]
vmulss xmm6, xmm6, dword ptr [rsp+10H]
vsubss xmm5, xmm5, xmm6
vxorps xmm6, xmm6
vmovss xmm6, xmm6, xmm5
vpslldq xmm6, 4
vmovss xmm6, xmm6, xmm4
vpslldq xmm6, 4
vmovss xmm6, xmm6, xmm3
vmovaps xmm3, xmm6
vmovapd xmmword ptr [rsp], xmm3
vdpps xmm3, xmm3, xmm3, 113
vmovapd xmm4, xmmword ptr [rsp]
vsqrtss xmm3, xmm3
vinsertps xmm3, xmm3, 14
vshufps xmm3, xmm3, 64
vdivps xmm3, xmm4, xmm3
vpslldq xmm3, xmm3, 4
vpsrldq xmm4, xmm3, 4
;; bbWeight=0.50 PerfScore 98.58
G_M27508_IG11:
vmovsd qword ptr [rsp+B0H], xmm0
vpshufd xmm3, xmm0, 2
vmovss dword ptr [rsp+B8H], xmm3
vxorps xmm0, xmm0
vmovss dword ptr [rsp+BCH], xmm0
vmovsd qword ptr [rsp+C0H], xmm2
vpshufd xmm0, xmm2, 2
vmovss dword ptr [rsp+C8H], xmm0
vxorps xmm0, xmm0
vmovss dword ptr [rsp+CCH], xmm0
vmovsd qword ptr [rsp+D0H], xmm4
vpshufd xmm0, xmm4, 2
vmovss dword ptr [rsp+D8H], xmm0
vxorps xmm0, xmm0
vmovss dword ptr [rsp+DCH], xmm0
vmovsd qword ptr [rsp+E0H], xmm1
vpshufd xmm0, xmm1, 2
vmovss dword ptr [rsp+E8H], xmm0
vmovss xmm0, dword ptr [reloc @RWD16]
vmovss dword ptr [rsp+ECH], xmm0
vmovdqu xmm0, xmmword ptr [rsp+B0H]
vmovdqu xmmword ptr [rcx], xmm0
vmovdqu xmm0, xmmword ptr [rsp+C0H]
vmovdqu xmmword ptr [rcx+16], xmm0
vmovdqu xmm0, xmmword ptr [rsp+D0H]
vmovdqu xmmword ptr [rcx+32], xmm0
vmovdqu xmm0, xmmword ptr [rsp+E0H]
vmovdqu xmmword ptr [rcx+48], xmm0
mov rax, rcx
;; bbWeight=1 PerfScore 21.25
G_M27508_IG12:
vmovaps xmm6, qword ptr [rsp+100H]
vmovaps xmm7, qword ptr [rsp+F0H]
add rsp, 280
ret
;; bbWeight=1 PerfScore 9.25
RWD00 dd 00000000h ; 0
dd 00000000h ; 0
dd BF800000h ; -1
dd 00000000h ; 0
RWD16 dd 3F800000h ; 1
dd 00000000h ; 0
dd 00000000h ; 0
dd 00000000h ; 0
RWD32 dd 38D1B717h ; 0.0001
RWD36 dd 00000000h, 00000000h, 00000000h
RWD48 dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
RWD64 dd 3F7F8D9Eh ; 0.998255
; Total bytes of code 1195, prolog size 28, PerfScore 464.43, instruction count 211, allocated bytes for code 1366 (MethodHash=17ab948b) for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x1E
CountOfUnwindCodes: 6
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x1E UnwindOp: UWOP_SAVE_XMM128 (8) OpInfo: XMM7 (7)
Scaled Small Offset: 15 * 16 = 240 = 0x000F0
CodeOffset: 0x14 UnwindOp: UWOP_SAVE_XMM128 (8) OpInfo: XMM6 (6)
Scaled Small Offset: 16 * 16 = 256 = 0x00100
CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_LARGE (1) OpInfo: 0 - Scaled small
Size: 35 * 8 = 280 = 0x00118 After; Assembly listing for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 27 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 RetBuf [V00,T04] ( 4, 4 ) byref -> rcx
; V01 arg0 [V01,T02] ( 3, 6 ) byref -> rdx
; V02 arg1 [V02,T03] ( 3, 6 ) byref -> r8
; V03 arg2 [V03,T01] ( 4, 7 ) byref -> r9
; V04 arg3 [V04,T05] ( 1, 1 ) byref -> [rsp+A0H]
; V05 arg4 [V05,T06] ( 1, 1 ) byref -> [rsp+A8H]
; V06 loc0 [V06,T10] ( 8, 6 ) simd12 -> mm0 ld-addr-op
; V07 loc1 [V07,T25] ( 3, 2.50) float -> mm2
; V08 loc2 [V08,T26] ( 2, 2 ) simd12 -> mm2
; V09 loc3 [V09,T07] ( 15, 8 ) simd12 -> registers
; V10 loc4 [V10,T19] ( 7, 4 ) simd12 -> registers
; V11 loc5 [V11,T22] ( 4, 3 ) float -> registers
; V12 loc6 [V12,T00] ( 9, 9 ) struct (64) [rsp+00H] do-not-enreg[SFB] ld-addr-op
;# V13 OutArgs [V13 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;* V14 tmp1 [V14 ] ( 0, 0 ) simd12 -> zero-ref "struct address for call/obj"
;* V15 tmp2 [V15 ] ( 0, 0 ) simd12 -> zero-ref "struct address for call/obj"
;* V16 tmp3 [V16 ] ( 0, 0 ) simd12 -> zero-ref "struct address for call/obj"
;* V17 tmp4 [V17 ] ( 0, 0 ) simd12 -> zero-ref "struct address for call/obj"
; V18 tmp5 [V18,T27] ( 2, 2 ) simd12 -> mm4 "NewObj constructor temp"
; V19 tmp6 [V19,T36] ( 3, 1.50) simd12 -> mm4
; V20 tmp7 [V20,T28] ( 2, 2 ) simd12 -> mm4 "NewObj constructor temp"
; V21 tmp8 [V21,T29] ( 2, 2 ) simd12 -> mm2 "Inlining Arg"
;* V22 tmp9 [V22 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V23 tmp10 [V23 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
;* V24 tmp11 [V24 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V25 tmp12 [V25 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V26 tmp13 [V26,T30] ( 2, 2 ) simd12 -> mm2 "NewObj constructor temp"
; V27 tmp14 [V27,T31] ( 2, 2 ) float -> mm4 "Inlining Arg"
; V28 tmp15 [V28,T20] ( 4, 4 ) simd12 -> mm3 "Inlining Arg"
; V29 tmp16 [V29,T08] ( 7, 7 ) simd12 -> mm4 "Inlining Arg"
; V30 tmp17 [V30,T15] ( 4, 4 ) simd12 -> mm4 "NewObj constructor temp"
;* V31 tmp18 [V31,T41] ( 0, 0 ) simd12 -> zero-ref ld-addr-op "Inlining Arg"
;* V32 tmp19 [V32 ] ( 0, 0 ) simd12 -> zero-ref "impAppendStmt"
; V33 tmp20 [V33,T37] ( 2, 1 ) float -> mm5 "Inline stloc first use temp"
;* V34 tmp21 [V34 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V35 tmp22 [V35 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V36 tmp23 [V36,T32] ( 2, 2 ) simd12 -> mm5 "NewObj constructor temp"
;* V37 tmp24 [V37,T42] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V38 tmp25 [V38,T43] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
; V39 tmp26 [V39,T16] ( 4, 4 ) simd12 -> mm0 "NewObj constructor temp"
;* V40 tmp27 [V40,T44] ( 0, 0 ) simd12 -> zero-ref ld-addr-op "Inlining Arg"
;* V41 tmp28 [V41 ] ( 0, 0 ) simd12 -> zero-ref "impAppendStmt"
; V42 tmp29 [V42,T38] ( 2, 1 ) float -> mm3 "Inline stloc first use temp"
;* V43 tmp30 [V43 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V44 tmp31 [V44 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V45 tmp32 [V45,T33] ( 2, 2 ) simd12 -> mm3 "NewObj constructor temp"
; V46 tmp33 [V46,T21] ( 4, 4 ) simd12 -> mm3 "Inlining Arg"
; V47 tmp34 [V47,T09] ( 7, 7 ) simd12 -> mm0 "Inlining Arg"
; V48 tmp35 [V48,T17] ( 4, 4 ) simd12 -> mm0 "NewObj constructor temp"
;* V49 tmp36 [V49,T45] ( 0, 0 ) simd12 -> zero-ref ld-addr-op "Inlining Arg"
;* V50 tmp37 [V50 ] ( 0, 0 ) simd12 -> zero-ref "impAppendStmt"
; V51 tmp38 [V51,T39] ( 2, 1 ) float -> mm5 "Inline stloc first use temp"
;* V52 tmp39 [V52 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V53 tmp40 [V53 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V54 tmp41 [V54,T34] ( 2, 2 ) simd12 -> mm5 "NewObj constructor temp"
;* V55 tmp42 [V55,T46] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V56 tmp43 [V56,T47] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
; V57 tmp44 [V57,T18] ( 4, 4 ) simd12 -> mm3 "NewObj constructor temp"
;* V58 tmp45 [V58,T48] ( 0, 0 ) simd12 -> zero-ref ld-addr-op "Inlining Arg"
;* V59 tmp46 [V59 ] ( 0, 0 ) simd12 -> zero-ref "impAppendStmt"
; V60 tmp47 [V60,T40] ( 2, 1 ) float -> mm4 "Inline stloc first use temp"
;* V61 tmp48 [V61 ] ( 0, 0 ) simd12 -> zero-ref "Inlining Arg"
;* V62 tmp49 [V62 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
; V63 tmp50 [V63,T35] ( 2, 2 ) simd12 -> mm4 "NewObj constructor temp"
; V64 cse0 [V64,T24] ( 3, 3 ) simd12 -> mm1 "CSE - moderate"
; V65 cse1 [V65,T14] ( 6, 4.50) simd12 -> mm3 "CSE - moderate"
; V66 cse2 [V66,T11] ( 10, 5 ) float -> mm6 "CSE - moderate"
; V67 cse3 [V67,T12] ( 10, 5 ) float -> mm3 "CSE - moderate"
; V68 cse4 [V68,T13] ( 10, 5 ) float -> registers "CSE - moderate"
; V69 cse5 [V69,T23] ( 4, 3 ) float -> mm5 "CSE - moderate"
;
; Lcl frame size = 120
G_M27508_IG01:
sub rsp, 120
vzeroupper
vmovaps qword ptr [rsp+60H], xmm6
vmovaps qword ptr [rsp+50H], xmm7
vmovaps qword ptr [rsp+40H], xmm8
;; bbWeight=1 PerfScore 10.25
G_M27508_IG02:
vmovss xmm0, dword ptr [rdx+8]
vmovsd xmm1, qword ptr [rdx]
vshufps xmm1, xmm0, 68
vmovss xmm0, dword ptr [r8+8]
vmovsd xmm2, qword ptr [r8]
vshufps xmm2, xmm0, 68
vsubps xmm0, xmm1, xmm2
vdpps xmm2, xmm0, xmm0, 113
vmovss xmm3, dword ptr [reloc @RWD32]
vucomiss xmm3, xmm2
jbe SHORT G_M27508_IG04
;; bbWeight=1 PerfScore 29.00
G_M27508_IG03:
mov rax, bword ptr [rsp+A0H]
vmovss xmm0, dword ptr [rax+8]
vmovsd xmm2, qword ptr [rax]
vshufps xmm2, xmm0, 68
vxorps xmm0, xmm0, xmm0
vsubps xmm0, xmm0, xmm2
jmp SHORT G_M27508_IG05
;; bbWeight=0.50 PerfScore 5.67
G_M27508_IG04:
vsqrtss xmm2, xmm2
vmovss xmm3, dword ptr [reloc @RWD16]
vdivss xmm2, xmm3, xmm2
vinsertps xmm2, xmm2, 14
vshufps xmm2, xmm2, 64
vmulps xmm0, xmm0, xmm2
;; bbWeight=0.50 PerfScore 14.00
G_M27508_IG05:
vmovss xmm2, dword ptr [r9+8]
vmovsd xmm3, qword ptr [r9]
vshufps xmm3, xmm2, 68
vmovaps xmm2, xmm3
vdpps xmm4, xmm3, xmm0, 113
vandps xmm4, xmm4, dword ptr [reloc @RWD48]
vmovss xmm5, dword ptr [reloc @RWD64]
vucomiss xmm4, xmm5
jbe G_M27508_IG09
;; bbWeight=1 PerfScore 23.25
G_M27508_IG06:
mov rax, bword ptr [rsp+A8H]
vmovss xmm0, dword ptr [rax+8]
vmovsd xmm4, qword ptr [rax]
vshufps xmm4, xmm0, 68
vdpps xmm0, xmm3, xmm4, 113
vandps xmm0, xmm0, dword ptr [reloc @RWD48]
vucomiss xmm0, xmm5
jbe SHORT G_M27508_IG08
vmovss xmm4, dword ptr [r9+8]
vandps xmm4, xmm4, dword ptr [reloc @RWD48]
vucomiss xmm4, xmm5
ja SHORT G_M27508_IG07
vmovupd xmm4, xmmword ptr [reloc @RWD00]
jmp SHORT G_M27508_IG08
;; bbWeight=0.50 PerfScore 16.50
G_M27508_IG07:
vmovupd xmm4, xmmword ptr [reloc @RWD16]
;; bbWeight=0.50 PerfScore 1.50
G_M27508_IG08:
vmovaps xmm0, xmm3
vpsrldq xmm0, 4
vmovaps xmm5, xmm4
vpsrldq xmm5, 8
vmulss xmm5, xmm0, xmm5
vmovaps xmm6, xmm3
vpsrldq xmm6, 8
vmovaps xmm7, xmm4
vpsrldq xmm7, 4
vmulss xmm7, xmm6, xmm7
vsubss xmm5, xmm5, xmm7
vmovaps xmm7, xmm4
vmulss xmm7, xmm6, xmm7
vmovaps xmm8, xmm4
vpsrldq xmm8, 8
vmulss xmm8, xmm3, xmm8
vsubss xmm7, xmm7, xmm8
vmovaps xmm8, xmm4
vpsrldq xmm8, 4
vmulss xmm8, xmm3, xmm8
vmulss xmm4, xmm0, xmm4
vsubss xmm4, xmm8, xmm4
vxorps xmm8, xmm8
vmovss xmm8, xmm8, xmm4
vpslldq xmm8, 4
vmovss xmm8, xmm8, xmm7
vpslldq xmm8, 4
vmovss xmm8, xmm8, xmm5
vmovaps xmm4, xmm8
vdpps xmm5, xmm4, xmm4, 113
vsqrtss xmm5, xmm5
vinsertps xmm5, xmm5, 14
vshufps xmm5, xmm5, 64
vdivps xmm4, xmm4, xmm5
vpslldq xmm4, xmm4, 4
vpsrldq xmm4, xmm4, 4
vmovaps xmm5, xmm4
vpsrldq xmm5, 4
vmulss xmm5, xmm5, xmm6
vmovaps xmm7, xmm4
vpsrldq xmm7, 8
vmulss xmm7, xmm7, xmm0
vsubss xmm5, xmm5, xmm7
vmovaps xmm7, xmm4
vpsrldq xmm7, 8
vmulss xmm7, xmm7, xmm3
vmovaps xmm8, xmm4
vmulss xmm6, xmm8, xmm6
vsubss xmm6, xmm7, xmm6
vmovaps xmm7, xmm4
vmulss xmm0, xmm7, xmm0
vmovaps xmm7, xmm4
vpsrldq xmm7, 4
vmulss xmm3, xmm7, xmm3
vsubss xmm0, xmm0, xmm3
vxorps xmm3, xmm3
vmovss xmm3, xmm3, xmm0
vpslldq xmm3, 4
vmovss xmm3, xmm3, xmm6
vpslldq xmm3, 4
vmovss xmm3, xmm3, xmm5
vmovaps xmm0, xmm3
vdpps xmm3, xmm0, xmm0, 113
vsqrtss xmm3, xmm3
vinsertps xmm3, xmm3, 14
vshufps xmm3, xmm3, 64
vdivps xmm0, xmm0, xmm3
vpslldq xmm0, xmm0, 4
vpsrldq xmm0, xmm0, 4
jmp G_M27508_IG10
;; bbWeight=0.50 PerfScore 77.21
G_M27508_IG09:
vmovaps xmm4, xmm3
vpsrldq xmm4, 4
vmovaps xmm5, xmm0
vpsrldq xmm5, 8
vmulss xmm5, xmm4, xmm5
vmovaps xmm6, xmm3
vpsrldq xmm6, 8
vmovaps xmm7, xmm0
vpsrldq xmm7, 4
vmulss xmm7, xmm6, xmm7
vsubss xmm5, xmm5, xmm7
vmovaps xmm7, xmm0
vmulss xmm7, xmm6, xmm7
vmovaps xmm8, xmm0
vpsrldq xmm8, 8
vmulss xmm8, xmm3, xmm8
vsubss xmm7, xmm7, xmm8
vmovaps xmm8, xmm0
vpsrldq xmm8, 4
vmulss xmm8, xmm3, xmm8
vmulss xmm0, xmm4, xmm0
vsubss xmm0, xmm8, xmm0
vxorps xmm8, xmm8
vmovss xmm8, xmm8, xmm0
vpslldq xmm8, 4
vmovss xmm8, xmm8, xmm7
vpslldq xmm8, 4
vmovss xmm8, xmm8, xmm5
vmovaps xmm0, xmm8
vdpps xmm5, xmm0, xmm0, 113
vsqrtss xmm5, xmm5
vinsertps xmm5, xmm5, 14
vshufps xmm5, xmm5, 64
vdivps xmm0, xmm0, xmm5
vpslldq xmm0, xmm0, 4
vpsrldq xmm0, xmm0, 4
vmovaps xmm5, xmm0
vpsrldq xmm5, 4
vmulss xmm5, xmm5, xmm6
vmovaps xmm7, xmm0
vpsrldq xmm7, 8
vmulss xmm7, xmm7, xmm4
vsubss xmm5, xmm5, xmm7
vmovaps xmm7, xmm0
vpsrldq xmm7, 8
vmulss xmm7, xmm7, xmm3
vmovaps xmm8, xmm0
vmulss xmm6, xmm8, xmm6
vsubss xmm6, xmm7, xmm6
vmovaps xmm7, xmm0
vmulss xmm4, xmm7, xmm4
vmovaps xmm7, xmm0
vpsrldq xmm7, 4
vmulss xmm3, xmm7, xmm3
vsubss xmm3, xmm4, xmm3
vxorps xmm4, xmm4
vmovss xmm4, xmm4, xmm3
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm6
vpslldq xmm4, 4
vmovss xmm4, xmm4, xmm5
vmovaps xmm3, xmm4
vdpps xmm4, xmm3, xmm3, 113
vsqrtss xmm4, xmm4
vinsertps xmm4, xmm4, 14
vshufps xmm4, xmm4, 64
vdivps xmm3, xmm3, xmm4
vpslldq xmm3, xmm3, 4
vpsrldq xmm3, xmm3, 4
vmovaps xmm4, xmm0
vmovaps xmm0, xmm3
;; bbWeight=0.50 PerfScore 76.46
G_M27508_IG10:
vmovsd qword ptr [rsp], xmm4
vpshufd xmm3, xmm4, 2
vmovss dword ptr [rsp+08H], xmm3
vxorps xmm3, xmm3
vmovss dword ptr [rsp+0CH], xmm3
vmovsd qword ptr [rsp+10H], xmm2
vpshufd xmm3, xmm2, 2
vmovss dword ptr [rsp+18H], xmm3
vxorps xmm2, xmm2
vmovss dword ptr [rsp+1CH], xmm2
vmovsd qword ptr [rsp+20H], xmm0
vpshufd xmm2, xmm0, 2
vmovss dword ptr [rsp+28H], xmm2
vxorps xmm0, xmm0
vmovss dword ptr [rsp+2CH], xmm0
vmovsd qword ptr [rsp+30H], xmm1
vpshufd xmm0, xmm1, 2
vmovss dword ptr [rsp+38H], xmm0
vmovss xmm0, dword ptr [reloc @RWD16]
vmovss dword ptr [rsp+3CH], xmm0
vmovdqu xmm0, xmmword ptr [rsp]
vmovdqu xmmword ptr [rcx], xmm0
vmovdqu xmm0, xmmword ptr [rsp+10H]
vmovdqu xmmword ptr [rcx+16], xmm0
vmovdqu xmm0, xmmword ptr [rsp+20H]
vmovdqu xmmword ptr [rcx+32], xmm0
vmovdqu xmm0, xmmword ptr [rsp+30H]
vmovdqu xmmword ptr [rcx+48], xmm0
mov rax, rcx
;; bbWeight=1 PerfScore 21.25
G_M27508_IG11:
vmovaps xmm6, qword ptr [rsp+60H]
vmovaps xmm7, qword ptr [rsp+50H]
vmovaps xmm8, qword ptr [rsp+40H]
add rsp, 120
ret
;; bbWeight=1 PerfScore 13.25
RWD00 dd 00000000h ; 0
dd 00000000h ; 0
dd BF800000h ; -1
dd 00000000h ; 0
RWD16 dd 3F800000h ; 1
dd 00000000h ; 0
dd 00000000h ; 0
dd 00000000h ; 0
RWD32 dd 38D1B717h ; 0.0001
RWD36 dd 00000000h, 00000000h, 00000000h
RWD48 dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
dd 7FFFFFFFh ; nan
RWD64 dd 3F7F8D9Eh ; 0.998255
; Total bytes of code 1092, prolog size 25, PerfScore 415.43, instruction count 228, allocated bytes for code 1271 (MethodHash=17ab948b) for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x1C
CountOfUnwindCodes: 7
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x1C UnwindOp: UWOP_SAVE_XMM128 (8) OpInfo: XMM8 (8)
Scaled Small Offset: 4 * 16 = 64 = 0x00040
CodeOffset: 0x15 UnwindOp: UWOP_SAVE_XMM128 (8) OpInfo: XMM7 (7)
Scaled Small Offset: 5 * 16 = 80 = 0x00050
CodeOffset: 0x0E UnwindOp: UWOP_SAVE_XMM128 (8) OpInfo: XMM6 (6)
Scaled Small Offset: 6 * 16 = 96 = 0x00060
CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 14 * 8 + 8 = 120 = 0x78 |
The regression in In particular I believe these few regressions are reasonable and will be easily fixed by migrating the functions to use the newer |
CC. @sandreenko, @echesakovMSFT |
else if (op->OperIs(GT_OBJ)) | ||
{ | ||
GenTree* addr = op->AsIndir()->Addr(); | ||
|
||
if (addr->OperIs(GT_ADDR)) | ||
{ | ||
setLclRelatedToSIMDIntrinsic(op->AsOp()->gtOp1->AsOp()->gtOp1); | ||
GenTree* addrOp1 = addr->AsOp()->gtGetOp1(); | ||
|
||
if (addrOp1->OperIsLocal()) | ||
{ | ||
setLclRelatedToSIMDIntrinsic(addrOp1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear why we were only checking for OBJ(ADDR(LCL))
here. We also sometimes generate BLK(ADDR(LCL))
or other nodes and so something like the following seems like it might be a better choice, given that it will cover all indirections over locals:
else if (op->OperIsIndir())
{
GenTree* lcl = op->AsInidr()->Addr()->IsLocalAddrExpr();
if (lcl != null)
{
setLclRelatedToSIMDIntrinsic(lcl);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to be very consistent on what we check for calling setLclRelatedToSIMDIntrinsic
either. The paths that create SIMD
or HWINTRINSIC
nodes all call SetOpLclRelatedToSIMDIntrinsic
, while several other paths call setLclRelatedToSIMDIntrinsic
directly.
Of those that call setLclRelatedToSIMDIntrinsic
directly, they vary on what they check before calling setLclRelatedToSIMDIntrinsic
. Sometimes simply checking OperIsLocal
, sometimes checking for specific kinds of indirections, and sometimes checking all indirections. It would be nice (assuming nothing is blocking it), if all of those checks could centrally be handled here so they are consistent.
@@ -4927,6 +4927,9 @@ struct GenTreeJitIntrinsic : public GenTreeOp | |||
GenTreeJitIntrinsic( | |||
genTreeOps oper, var_types type, GenTree* op1, GenTree* op2, CorInfoType simdBaseJitType, unsigned simdSize) | |||
: GenTreeOp(oper, type, op1, op2) | |||
, gtLayout(nullptr) | |||
, gtAuxiliaryJitType(CORINFO_TYPE_UNDEF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed due to an assert encountered while running PMI diffs, its related to the refactoring that happened in #50832.
If these aren't explicitly zeroed, they may be 0xDD
in debug builds and cause assertions when calling the helper functions used in hashing the trees.
because if you have "Vector4 v; v.x = something;` you have the field accessed and can't do the promotion, so the logic is:
|
@sandreenko, sorry, I'm not following here. My question is explicitly why we would ever want to promote here. Promotion of the SIMD types is bad and causes it to be spilled to memory or split amongst multiple registers. We instead want it to be enregistered to a single SIMD register and for field accesses to be morphed into This is because SIMD is special and we support getting or setting any field from an enregistered struct via an intrinsic and so we should never need to promote, even when an individual field is accessed or assigned. |
that is a lot of text to scroll so I could have missed it but if the question was why the condition at
then it could be a historical reason that we did not have logic for |
12df420
to
a4c5ca3
Compare
This should be ready for review. It provides some decent improvements and resolves #50939 by improving the codegen for copies across inlined code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tannergooding could you keep an eye on the outerloop and stress job results with your change just in case? |
Will do. |
Outerloop jobs looks to have all passed with regards to this change. JitStress jobs will run this coming Sunday. |
This resolves #50939 by ensuring that SIMD assignments are treated as related to SIMD intrinsics to avoid promotion. More details are listed here: #51569 (comment)
Treating SIMD assignment as an intrinsic and avoiding promotion makes sense because a SIMD copy is logically an intrinsic operation and will generate a
movaps
ormovups
. We already handle similar scenarios forblockOpInit
,simdInit
, and others by doing similar checks and callingsetLclRelatedToSIMDIntrinsic
. -- Noting that SIMD types are spilled to the stack as 8, 16, or 32-byte arguments. When loading from a field, array, or byrefVector2
generates amovsd
andVector3
generates amovsd
+movss
instead.Notably, we only really allow promotion for
Vector2/3/4
. We don't currently allow it for the "opaque" kinds, meaningVector<T>
,Vector64<T>
,Vector128<T>
, orVector256<T>
. This leads to poorer code quality forVector2/3/4
particularly when inlining is involved due to additional copies inserted for passing the arguments around.It also isn't clear to me why we would want to promote SIMD types anyways, as it will almost certainly be more efficient to keep the entire value enregistered (when registers are available) and to perform the relevant insertions/extractions using the appropriate SIMD instructions instead.
PMI Diffs
The gains listed below are all from no longer promoting.
Frameworks
Noting that there was an assert that fired which I logged under: #51728
Benchmarks
Tests
dotnet/performance - MicroBenchmarks.dll