Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark SIMD assignments as related to SIMD intrinsics #51731

Merged
merged 5 commits into from
Apr 27, 2021

Conversation

tannergooding
Copy link
Member

This resolves #50939 by ensuring that SIMD assignments are treated as related to SIMD intrinsics to avoid promotion. More details are listed here: #51569 (comment)

Treating SIMD assignment as an intrinsic and avoiding promotion makes sense because a SIMD copy is logically an intrinsic operation and will generate a movaps or movups. We already handle similar scenarios for blockOpInit, simdInit, and others by doing similar checks and calling setLclRelatedToSIMDIntrinsic. -- Noting that SIMD types are spilled to the stack as 8, 16, or 32-byte arguments. When loading from a field, array, or byref Vector2 generates a movsd and Vector3 generates a movsd+movss instead.

Notably, we only really allow promotion for Vector2/3/4. We don't currently allow it for the "opaque" kinds, meaning Vector<T>, Vector64<T>, Vector128<T>, or Vector256<T>. This leads to poorer code quality for Vector2/3/4 particularly when inlining is involved due to additional copies inserted for passing the arguments around.

It also isn't clear to me why we would want to promote SIMD types anyways, as it will almost certainly be more efficient to keep the entire value enregistered (when registers are available) and to perform the relevant insertions/extractions using the appropriate SIMD instructions instead.

PMI Diffs

The gains listed below are all from no longer promoting.

Frameworks

Noting that there was an assert that fired which I logged under: #51728

Found 273 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 51225304
Total bytes of diff: 51224803
Total bytes of delta: -501 (-0.00% of base)
    diff is an improvement.


Top file improvements (bytes):
        -426 : System.Private.CoreLib.dasm (-0.02% of base)
         -75 : System.Drawing.Common.dasm (-0.02% of base)

2 total files with Code Size differences (2 improved, 0 regressed), 269 unchanged.

Top method regressions (bytes):
          59 ( 3.61% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
          13 ( 3.29% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Quaternion):Plane

Top method improvements (bytes):
        -103 (-8.62% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -76 (-12.82% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
         -50 (-5.43% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref,bool,byref):this
         -47 (-37.01% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
         -40 (-35.09% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
         -31 (-37.35% of base) : System.Private.CoreLib.dasm - Vector3:Multiply(float,Vector3):Vector3
         -23 (-4.65% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateReflection(Plane):Matrix4x4
         -21 (-2.79% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
         -16 (-16.84% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref):this
         -15 (-53.57% of base) : System.Private.CoreLib.dasm - Vector2:get_Zero():Vector2
         -15 (-34.09% of base) : System.Private.CoreLib.dasm - Vector2:Multiply(float,Vector2):Vector2
         -15 (-23.44% of base) : System.Private.CoreLib.dasm - Vector2:Normalize(Vector2):Vector2
         -15 (-23.81% of base) : System.Private.CoreLib.dasm - Vector2:Reflect(Vector2,Vector2):Vector2
         -14 (-2.31% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -12 (-2.20% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateShadow(Vector3,Plane):Matrix4x4
         -12 (-14.29% of base) : System.Private.CoreLib.dasm - Plane:Dot(Plane,Vector4):float
         -12 (-2.54% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Matrix4x4):Plane
         -11 (-3.86% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
         -10 (-7.69% of base) : System.Private.CoreLib.dasm - Plane:op_Equality(Plane,Plane):bool
         -10 (-22.22% of base) : System.Private.CoreLib.dasm - Vector2:CopyTo(Span`1):this

Top method regressions (percentages):
          59 ( 3.61% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
          13 ( 3.29% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Quaternion):Plane

Top method improvements (percentages):
         -15 (-53.57% of base) : System.Private.CoreLib.dasm - Vector2:get_Zero():Vector2
         -31 (-37.35% of base) : System.Private.CoreLib.dasm - Vector3:Multiply(float,Vector3):Vector3
         -47 (-37.01% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
         -40 (-35.09% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
         -15 (-34.09% of base) : System.Private.CoreLib.dasm - Vector2:Multiply(float,Vector2):Vector2
         -10 (-25.64% of base) : System.Private.CoreLib.dasm - Vector2:TryCopyTo(Span`1):bool:this
         -15 (-23.81% of base) : System.Private.CoreLib.dasm - Vector2:Reflect(Vector2,Vector2):Vector2
         -15 (-23.44% of base) : System.Private.CoreLib.dasm - Vector2:Normalize(Vector2):Vector2
         -10 (-22.22% of base) : System.Private.CoreLib.dasm - Vector2:CopyTo(Span`1):this
         -16 (-16.84% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref):this
         -12 (-14.29% of base) : System.Private.CoreLib.dasm - Plane:Dot(Plane,Vector4):float
         -76 (-12.82% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
          -9 (-10.00% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref,byref):this
        -103 (-8.62% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -10 (-7.69% of base) : System.Private.CoreLib.dasm - Plane:op_Equality(Plane,Plane):bool
         -50 (-5.43% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref,bool,byref):this
         -23 (-4.65% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateReflection(Plane):Matrix4x4
          -6 (-4.44% of base) : System.Private.CoreLib.dasm - Plane:op_Inequality(Plane,Plane):bool
         -11 (-3.86% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
         -21 (-2.79% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4

25 total methods with Code Size differences (23 improved, 2 regressed), 258893 unchanged.

Benchmarks

Found 84 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 505974
Total bytes of diff: 505594
Total bytes of delta: -380 (-0.08% of base)
    diff is an improvement.


Top file improvements (bytes):
        -380 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-1.51% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 81 unchanged.

Top method improvements (bytes):
         -84 (-7.19% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
         -66 (-7.28% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -63 (-6.50% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:Shade(ISect,Scene,int):Color:this
         -42 (-13.04% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -31 (-35.63% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Times(double,Vector):Vector
         -21 (-5.57% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetReflectionColor(SceneObject,Vector,Vector,Vector,Scene,int):Color:this
         -18 (-11.84% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
         -18 (-14.06% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Norm(Vector):Vector
          -4 (-1.79% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Cross(Vector,Vector):Vector
          -1 (-0.43% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_0(Vector):Color:this
          -1 (-0.94% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_2(Vector):double:this

Top method improvements (percentages):
         -31 (-35.63% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Times(double,Vector):Vector
         -18 (-14.06% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Norm(Vector):Vector
         -42 (-13.04% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -18 (-11.84% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
         -66 (-7.28% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -84 (-7.19% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
         -63 (-6.50% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:Shade(ISect,Scene,int):Color:this
         -21 (-5.57% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetReflectionColor(SceneObject,Vector,Vector,Vector,Scene,int):Color:this
          -4 (-1.79% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Cross(Vector,Vector):Vector
          -1 (-0.94% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_2(Vector):double:this
          -1 (-0.43% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_0(Vector):Color:this

12 total methods with Code Size differences (12 improved, 0 regressed), 1886 unchanged.

Tests

Found 3585 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 137419378
Total bytes of diff: 137417151
Total bytes of delta: -2227 (-0.00% of base)
    diff is an improvement.


Top file regressions (bytes):
          27 : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm (0.65% of base)
           4 : JIT\SIMD\Dup_r\Dup_r.dasm (1.98% of base)

Top file improvements (bytes):
        -759 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm (-5.71% of base)
        -380 : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm (-1.51% of base)
        -293 : JIT\Regression\JitBlue\GitHub_21546\GitHub_21546\GitHub_21546.dasm (-17.27% of base)
        -292 : Interop\PInvoke\Vector2_3_4\Vector2_3_4\Vector2_3_4.dasm (-6.62% of base)
        -158 : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm (-3.35% of base)
         -60 : JIT\SIMD\VectorReturn_r\VectorReturn_r.dasm (-0.07% of base)
         -51 : JIT\Regression\JitBlue\GitHub_8220\GitHub_8220\GitHub_8220.dasm (-1.56% of base)
         -28 : JIT\HardwareIntrinsics\General\Vector128_1\Vector128_1_ro\Vector128_1_ro.dasm (-0.01% of base)
         -24 : JIT\SIMD\VectorAbs_r\VectorAbs_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorMin_r\VectorMin_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorMul_r\VectorMul_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorDiv_r\VectorDiv_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorAdd_r\VectorAdd_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorSub_r\VectorSub_r.dasm (-0.03% of base)
         -24 : JIT\SIMD\VectorMax_r\VectorMax_r.dasm (-0.03% of base)
         -16 : JIT\SIMD\Ldfld_ro\Ldfld_ro.dasm (-4.04% of base)
         -12 : JIT\SIMD\CircleInConvex_r\CircleInConvex_r.dasm (-0.17% of base)
         -11 : JIT\Regression\JitBlue\GitHub_7508\Vector3Test\Vector3Test.dasm (-0.67% of base)
         -11 : JIT\SIMD\Plane_ro\Plane_ro.dasm (-3.49% of base)
         -10 : JIT\SIMD\Dup_ro\Dup_ro.dasm (-8.62% of base)

23 total files with Code Size differences (21 improved, 2 regressed), 3519 unchanged.

Top method regressions (bytes):
          22 (48.89% of base) : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm - Runtime_31615:G3():Vector3
          13 (22.41% of base) : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm - Runtime_31615:G4():Vector4
          10 ( 1.95% of base) : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm - test:y_radius(float,List`1,List`1,byref):float
           4 ( 2.55% of base) : JIT\SIMD\Dup_r\Dup_r.dasm - Program:Main(ref):int

Top method improvements (bytes):
        -311 (-20.18% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:Main():int
        -209 (-44.95% of base) : JIT\Regression\JitBlue\GitHub_21546\GitHub_21546\GitHub_21546.dasm - test:FailureCase(List`1)
        -187 (-16.33% of base) : Interop\PInvoke\Vector2_3_4\Vector2_3_4\Vector2_3_4.dasm - Vector2_3_4Test:RunVector2Tests()
        -133 (-7.93% of base) : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm - test:convex_hull(List`1)
        -114 (-24.36% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F1_v3(float):Vector3
        -114 (-13.35% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F2_v3(float):Vector3
        -110 (-27.85% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F1_v2(float):Vector2
        -110 (-16.64% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F2_v2(float):Vector2
        -105 (-6.79% of base) : Interop\PInvoke\Vector2_3_4\Vector2_3_4\Vector2_3_4.dasm - Vector2_3_4Test:RunVector3Tests()
         -84 (-7.19% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
         -84 (-15.70% of base) : JIT\Regression\JitBlue\GitHub_21546\GitHub_21546\GitHub_21546.dasm - test:Main():int
         -66 (-7.28% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -63 (-6.50% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:Shade(ISect,Scene,int):Color:this
         -52 (-3.66% of base) : JIT\SIMD\VectorReturn_r\VectorReturn_r.dasm - VectorTest:Main():int
         -51 (-9.43% of base) : JIT\Regression\JitBlue\GitHub_8220\GitHub_8220\GitHub_8220.dasm - Program:testDotProduct(Vector3):int
         -42 (-13.04% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -31 (-35.63% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Times(double,Vector):Vector
         -27 (-2.34% of base) : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm - test:FindCircle(List`1,byref,byref):bool
         -21 (-5.57% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetReflectionColor(SceneObject,Vector,Vector,Vector,Scene,int):Color:this

Top method regressions (percentages):
          22 (48.89% of base) : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm - Runtime_31615:G3():Vector3
          13 (22.41% of base) : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm - Runtime_31615:G4():Vector4
           4 ( 2.55% of base) : JIT\SIMD\Dup_r\Dup_r.dasm - Program:Main(ref):int
          10 ( 1.95% of base) : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm - test:y_radius(float,List`1,List`1,byref):float

Top method improvements (percentages):
        -209 (-44.95% of base) : JIT\Regression\JitBlue\GitHub_21546\GitHub_21546\GitHub_21546.dasm - test:FailureCase(List`1)
         -31 (-35.63% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Times(double,Vector):Vector
        -110 (-27.85% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F1_v2(float):Vector2
        -114 (-24.36% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F1_v3(float):Vector3
        -311 (-20.18% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:Main():int
        -110 (-16.64% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F2_v2(float):Vector2
        -187 (-16.33% of base) : Interop\PInvoke\Vector2_3_4\Vector2_3_4\Vector2_3_4.dasm - Vector2_3_4Test:RunVector2Tests()
         -84 (-15.70% of base) : JIT\Regression\JitBlue\GitHub_21546\GitHub_21546\GitHub_21546.dasm - test:Main():int
         -18 (-14.06% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Norm(Vector):Vector
        -114 (-13.35% of base) : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:F2_v3(float):Vector3
         -42 (-13.04% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -18 (-11.84% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
         -20 (-11.05% of base) : JIT\HardwareIntrinsics\General\Vector128_1\Vector128_1_ro\Vector128_1_ro.dasm - VectorAs__AsVector2Single:RunBasicScenario():this
          -5 (-10.87% of base) : JIT\Regression\JitBlue\Runtime_31615\Runtime_31615\Runtime_31615.dasm - Runtime_31615:G2():Vector2
         -51 (-9.43% of base) : JIT\Regression\JitBlue\GitHub_8220\GitHub_8220\GitHub_8220.dasm - Program:testDotProduct(Vector3):int
         -10 (-8.70% of base) : JIT\SIMD\Dup_ro\Dup_ro.dasm - Program:Main(ref):int
        -133 (-7.93% of base) : JIT\SIMD\CircleInConvex_ro\CircleInConvex_ro.dasm - test:convex_hull(List`1)
         -66 (-7.28% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -84 (-7.19% of base) : JIT\Performance\CodeQuality\SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this

63 total methods with Code Size differences (59 improved, 4 regressed), 484444 unchanged.

dotnet/performance - MicroBenchmarks.dll

Found 2 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1817262
Total bytes of diff: 1816306
Total bytes of delta: -956 (-0.05% of base)
    diff is an improvement.


Top file improvements (bytes):
        -956 : MicroBenchmarks.dasm (-0.05% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method regressions (bytes):
          36 (17.48% of base) : MicroBenchmarks.dasm - Perf_Vector3:CrossBenchmark():Vector3:this
          18 (10.29% of base) : MicroBenchmarks.dasm - Perf_Vector3:TransformByQuaternionBenchmark():Vector3:this
          18 ( 9.89% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformByQuaternionBenchmark():Vector4:this
          18 ( 9.89% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformVector3ByQuaternionBenchmark():Vector4:this
          12 ( 7.74% of base) : MicroBenchmarks.dasm - Perf_Plane:EqualityOperatorBenchmark():bool:this
           7 ( 4.93% of base) : MicroBenchmarks.dasm - Perf_Plane:DotBenchmark():float:this
           6 ( 4.88% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateScaleFromVectorBenchmark():Matrix4x4:this
           3 ( 3.45% of base) : MicroBenchmarks.dasm - Perf_Matrix3x2:CreateScaleFromVectorBenchmark():Matrix3x2:this
           1 ( 0.93% of base) : MicroBenchmarks.dasm - Perf_Vector2:TransformByQuaternionBenchmark():Vector2:this
           1 ( 0.69% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformVector2ByQuaternionBenchmark():Vector4:this

Top method improvements (bytes):
         -84 (-7.19% of base) : MicroBenchmarks.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
         -66 (-7.28% of base) : MicroBenchmarks.dasm - Camera:Create(Vector,Vector):Camera
         -63 (-6.50% of base) : MicroBenchmarks.dasm - RayTracer:Shade(ISect,Scene,int):Color:this
         -54 (-28.57% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():Matrix4x4:this
         -52 (-38.81% of base) : MicroBenchmarks.dasm - Perf_Vector2:DistanceBenchmark():float:this
         -52 (-31.14% of base) : MicroBenchmarks.dasm - Perf_Vector2:LerpBenchmark():Vector2:this
         -48 (-30.77% of base) : MicroBenchmarks.dasm - Perf_Vector3:DistanceBenchmark():float:this
         -48 (-23.30% of base) : MicroBenchmarks.dasm - Perf_Vector3:LerpBenchmark():Vector3:this
         -42 (-13.04% of base) : MicroBenchmarks.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -36 (-25.53% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateBillboardBenchmark():Matrix4x4:this
         -35 (-12.20% of base) : MicroBenchmarks.dasm - Perf_Plane:CreateFromVerticesBenchmark():Plane:this
         -31 (-35.63% of base) : MicroBenchmarks.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : MicroBenchmarks.dasm - Vector:Times(double,Vector):Vector
         -26 (-24.07% of base) : MicroBenchmarks.dasm - Perf_Vector2:DivideByScalarBenchmark():Vector2:this
         -26 (-29.21% of base) : MicroBenchmarks.dasm - Perf_Vector2:NegateBenchmark():Vector2:this
         -24 (-17.65% of base) : MicroBenchmarks.dasm - Perf_Vector3:DivideByScalarBenchmark():Vector3:this
         -24 (-19.05% of base) : MicroBenchmarks.dasm - Perf_Vector3:MultiplyByScalarBenchmark():Vector3:this
         -24 (-20.51% of base) : MicroBenchmarks.dasm - Perf_Vector3:NegateBenchmark():Vector3:this
         -21 (-5.57% of base) : MicroBenchmarks.dasm - RayTracer:GetReflectionColor(SceneObject,Vector,Vector,Vector,Scene,int):Color:this
         -18 (-11.84% of base) : MicroBenchmarks.dasm - Sphere:Normal(Vector):Vector:this

Top method regressions (percentages):
          36 (17.48% of base) : MicroBenchmarks.dasm - Perf_Vector3:CrossBenchmark():Vector3:this
          18 (10.29% of base) : MicroBenchmarks.dasm - Perf_Vector3:TransformByQuaternionBenchmark():Vector3:this
          18 ( 9.89% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformByQuaternionBenchmark():Vector4:this
          18 ( 9.89% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformVector3ByQuaternionBenchmark():Vector4:this
          12 ( 7.74% of base) : MicroBenchmarks.dasm - Perf_Plane:EqualityOperatorBenchmark():bool:this
           7 ( 4.93% of base) : MicroBenchmarks.dasm - Perf_Plane:DotBenchmark():float:this
           6 ( 4.88% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateScaleFromVectorBenchmark():Matrix4x4:this
           3 ( 3.45% of base) : MicroBenchmarks.dasm - Perf_Matrix3x2:CreateScaleFromVectorBenchmark():Matrix3x2:this
           1 ( 0.93% of base) : MicroBenchmarks.dasm - Perf_Vector2:TransformByQuaternionBenchmark():Vector2:this
           1 ( 0.69% of base) : MicroBenchmarks.dasm - Perf_Vector4:TransformVector2ByQuaternionBenchmark():Vector4:this

Top method improvements (percentages):
         -52 (-38.81% of base) : MicroBenchmarks.dasm - Perf_Vector2:DistanceBenchmark():float:this
         -31 (-35.63% of base) : MicroBenchmarks.dasm - Color:Times(double,Color):Color
         -31 (-35.63% of base) : MicroBenchmarks.dasm - Vector:Times(double,Vector):Vector
         -52 (-31.14% of base) : MicroBenchmarks.dasm - Perf_Vector2:LerpBenchmark():Vector2:this
         -48 (-30.77% of base) : MicroBenchmarks.dasm - Perf_Vector3:DistanceBenchmark():float:this
         -26 (-29.21% of base) : MicroBenchmarks.dasm - Perf_Vector2:NegateBenchmark():Vector2:this
         -54 (-28.57% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():Matrix4x4:this
         -36 (-25.53% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateBillboardBenchmark():Matrix4x4:this
         -26 (-24.07% of base) : MicroBenchmarks.dasm - Perf_Vector2:DivideByScalarBenchmark():Vector2:this
         -48 (-23.30% of base) : MicroBenchmarks.dasm - Perf_Vector3:LerpBenchmark():Vector3:this
         -24 (-20.51% of base) : MicroBenchmarks.dasm - Perf_Vector3:NegateBenchmark():Vector3:this
         -24 (-19.05% of base) : MicroBenchmarks.dasm - Perf_Vector3:MultiplyByScalarBenchmark():Vector3:this
         -24 (-17.65% of base) : MicroBenchmarks.dasm - Perf_Vector3:DivideByScalarBenchmark():Vector3:this
         -18 (-14.06% of base) : MicroBenchmarks.dasm - Vector:Norm(Vector):Vector
         -42 (-13.04% of base) : MicroBenchmarks.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
         -12 (-12.90% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateLookAtBenchmark():Matrix4x4:this
         -12 (-12.90% of base) : MicroBenchmarks.dasm - Perf_Matrix4x4:CreateWorldBenchmark():Matrix4x4:this
         -12 (-12.50% of base) : MicroBenchmarks.dasm - Perf_Quaternion:CreateFromAxisAngleBenchmark():Quaternion:this
         -35 (-12.20% of base) : MicroBenchmarks.dasm - Perf_Plane:CreateFromVerticesBenchmark():Plane:this
         -18 (-11.84% of base) : MicroBenchmarks.dasm - Sphere:Normal(Vector):Vector:this

52 total methods with Code Size differences (42 improved, 10 regressed), 8728 unchanged

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 23, 2021
@tannergooding
Copy link
Member Author

An example of a gain is in Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4

Before

; Assembly listing for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 27 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T04] (  4,  4   )   byref  ->  rcx        
;  V01 arg0         [V01,T02] (  3,  6   )   byref  ->  rdx        
;  V02 arg1         [V02,T03] (  3,  6   )   byref  ->   r8        
;  V03 arg2         [V03,T00] ( 13, 16   )   byref  ->   r9        
;  V04 arg3         [V04,T05] (  1,  1   )   byref  ->  [rsp+140H]  
;  V05 arg4         [V05,T06] (  1,  1   )   byref  ->  [rsp+148H]  
;  V06 loc0         [V06,T07] (  8,  6   )  simd12  ->  mm0         ld-addr-op
;  V07 loc1         [V07,T18] (  3,  2.50)   float  ->  mm2        
;  V08 loc2         [V08,T19] (  3,  2.50)  simd12  ->  mm2        
;  V09 loc3         [V09,T14] (  5,  3   )  simd12  ->  mm0        
;  V10 loc4         [V10,T08] (  7,  4   )  simd12  ->  mm4        
;  V11 loc5         [V11,T15] (  4,  3   )   float  ->  registers  
;  V12 loc6         [V12,T01] (  9,  9   )  struct (64) [rsp+B0H]   do-not-enreg[SFB] ld-addr-op
;# V13 OutArgs      [V13    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V14 tmp1         [V14    ] (  0,  0   )  simd12  ->  zero-ref    do-not-enreg[SB] "struct address for call/obj"
;* V15 tmp2         [V15    ] (  0,  0   )  simd12  ->  zero-ref    do-not-enreg[SB] "struct address for call/obj"
;* V16 tmp3         [V16    ] (  0,  0   )  simd12  ->  zero-ref    do-not-enreg[SB] "struct address for call/obj"
;* V17 tmp4         [V17    ] (  0,  0   )  simd12  ->  zero-ref    do-not-enreg[SB] "struct address for call/obj"
;  V18 tmp5         [V18,T35] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V19 tmp6         [V19    ] (  3,  1.50)  simd12  ->  [rsp+A0H]   do-not-enreg[SB]
;  V20 tmp7         [V20,T36] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V21 tmp8         [V21,T37] (  2,  2   )  simd12  ->  mm2         "Inlining Arg"
;  V22 tmp9         [V22    ] (  2,  2   )  simd12  ->  [rsp+90H]   do-not-enreg[SB] "Inlining Arg"
;* V23 tmp10        [V23    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V24 tmp11        [V24,T38] (  2,  2   )  simd12  ->  mm0         "Inlining Arg"
;* V25 tmp12        [V25    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V26 tmp13        [V26,T39] (  2,  2   )  simd12  ->  mm2         "NewObj constructor temp"
;  V27 tmp14        [V27,T40] (  2,  2   )   float  ->  mm4         "Inlining Arg"
;* V28 tmp15        [V28    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;  V29 tmp16        [V29    ] (  7,  7   )  simd12  ->  [rsp+80H]   do-not-enreg[SB] "Inlining Arg"
;  V30 tmp17        [V30,T41] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V31 tmp18        [V31,T09] (  4,  4   )  simd12  ->  mm0         ld-addr-op "Inlining Arg"
;  V32 tmp19        [V32    ] (  2,  2   )  simd12  ->  [rsp+70H]   do-not-enreg[SB] "impAppendStmt"
;  V33 tmp20        [V33,T80] (  2,  1   )   float  ->  mm0         "Inline stloc first use temp"
;  V34 tmp21        [V34,T42] (  2,  2   )  simd12  ->  mm3         "Inlining Arg"
;* V35 tmp22        [V35    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V36 tmp23        [V36,T43] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V37 tmp24        [V37    ] (  7,  7   )  simd12  ->  [rsp+60H]   do-not-enreg[SB] "Inlining Arg"
;* V38 tmp25        [V38    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;  V39 tmp26        [V39,T44] (  2,  2   )  simd12  ->  mm3         "NewObj constructor temp"
;  V40 tmp27        [V40,T10] (  4,  4   )  simd12  ->  mm3         ld-addr-op "Inlining Arg"
;  V41 tmp28        [V41    ] (  2,  2   )  simd12  ->  [rsp+50H]   do-not-enreg[SB] "impAppendStmt"
;  V42 tmp29        [V42,T81] (  2,  1   )   float  ->  mm3         "Inline stloc first use temp"
;  V43 tmp30        [V43,T45] (  2,  2   )  simd12  ->  mm4         "Inlining Arg"
;* V44 tmp31        [V44    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V45 tmp32        [V45,T46] (  2,  2   )  simd12  ->  mm3         "NewObj constructor temp"
;* V46 tmp33        [V46    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;  V47 tmp34        [V47    ] (  7,  7   )  simd12  ->  [rsp+40H]   do-not-enreg[SB] "Inlining Arg"
;  V48 tmp35        [V48,T47] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V49 tmp36        [V49,T11] (  4,  4   )  simd12  ->  mm0         ld-addr-op "Inlining Arg"
;  V50 tmp37        [V50    ] (  2,  2   )  simd12  ->  [rsp+30H]   do-not-enreg[SB] "impAppendStmt"
;  V51 tmp38        [V51,T82] (  2,  1   )   float  ->  mm0         "Inline stloc first use temp"
;  V52 tmp39        [V52,T48] (  2,  2   )  simd12  ->  mm3         "Inlining Arg"
;* V53 tmp40        [V53    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V54 tmp41        [V54,T49] (  2,  2   )  simd12  ->  mm0         "NewObj constructor temp"
;  V55 tmp42        [V55    ] (  7,  7   )  simd12  ->  [rsp+20H]   do-not-enreg[SB] "Inlining Arg"
;  V56 tmp43        [V56    ] (  7,  7   )  simd12  ->  [rsp+10H]   do-not-enreg[SB] "Inlining Arg"
;  V57 tmp44        [V57,T50] (  2,  2   )  simd12  ->  mm3         "NewObj constructor temp"
;  V58 tmp45        [V58,T12] (  4,  4   )  simd12  ->  mm3         ld-addr-op "Inlining Arg"
;  V59 tmp46        [V59    ] (  2,  2   )  simd12  ->  [rsp+00H]   do-not-enreg[SB] "impAppendStmt"
;  V60 tmp47        [V60,T83] (  2,  1   )   float  ->  mm3         "Inline stloc first use temp"
;  V61 tmp48        [V61,T51] (  2,  2   )  simd12  ->  mm4         "Inlining Arg"
;* V62 tmp49        [V62    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V63 tmp50        [V63,T52] (  2,  2   )  simd12  ->  mm3         "NewObj constructor temp"
;* V64 tmp51        [V64    ] (  0,  0   )   float  ->  zero-ref    V124.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V65 tmp52        [V65    ] (  0,  0   )   float  ->  zero-ref    V124.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;* V66 tmp53        [V66    ] (  0,  0   )   float  ->  zero-ref    V124.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;* V67 tmp54        [V67    ] (  0,  0   )   float  ->  zero-ref    V125.X(offs=0x00) P-INDEP "field V05.X (fldOffset=0x0)"
;* V68 tmp55        [V68    ] (  0,  0   )   float  ->  zero-ref    V125.Y(offs=0x04) P-INDEP "field V05.Y (fldOffset=0x4)"
;* V69 tmp56        [V69    ] (  0,  0   )   float  ->  zero-ref    V125.Z(offs=0x08) P-INDEP "field V05.Z (fldOffset=0x8)"
;* V70 tmp57        [V70    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V14.X(offs=0x00) P-DEP "field V14.X (fldOffset=0x0)"
;* V71 tmp58        [V71    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V14.Y(offs=0x04) P-DEP "field V14.Y (fldOffset=0x4)"
;* V72 tmp59        [V72    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V14.Z(offs=0x08) P-DEP "field V14.Z (fldOffset=0x8)"
;* V73 tmp60        [V73    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V15.X(offs=0x00) P-DEP "field V15.X (fldOffset=0x0)"
;* V74 tmp61        [V74    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V15.Y(offs=0x04) P-DEP "field V15.Y (fldOffset=0x4)"
;* V75 tmp62        [V75    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V15.Z(offs=0x08) P-DEP "field V15.Z (fldOffset=0x8)"
;* V76 tmp63        [V76    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V16.X(offs=0x00) P-DEP "field V16.X (fldOffset=0x0)"
;* V77 tmp64        [V77    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V16.Y(offs=0x04) P-DEP "field V16.Y (fldOffset=0x4)"
;* V78 tmp65        [V78    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V16.Z(offs=0x08) P-DEP "field V16.Z (fldOffset=0x8)"
;* V79 tmp66        [V79    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V17.X(offs=0x00) P-DEP "field V17.X (fldOffset=0x0)"
;* V80 tmp67        [V80    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V17.Y(offs=0x04) P-DEP "field V17.Y (fldOffset=0x4)"
;* V81 tmp68        [V81    ] (  0,  0   )   float  ->  zero-ref    do-not-enreg[] V17.Z(offs=0x08) P-DEP "field V17.Z (fldOffset=0x8)"
;  V82 tmp69        [V82,T68] (  3,  1.50)   float  ->  [rsp+A0H]   do-not-enreg[] V19.X(offs=0x00) P-DEP "field V19.X (fldOffset=0x0)"
;  V83 tmp70        [V83,T69] (  3,  1.50)   float  ->  [rsp+A4H]   do-not-enreg[] V19.Y(offs=0x04) P-DEP "field V19.Y (fldOffset=0x4)"
;  V84 tmp71        [V84,T70] (  3,  1.50)   float  ->  [rsp+A8H]   do-not-enreg[] V19.Z(offs=0x08) P-DEP "field V19.Z (fldOffset=0x8)"
;  V85 tmp72        [V85,T53] (  2,  2   )   float  ->  [rsp+90H]   do-not-enreg[] V22.X(offs=0x00) P-DEP "field V22.X (fldOffset=0x0)"
;  V86 tmp73        [V86,T54] (  2,  2   )   float  ->  [rsp+94H]   do-not-enreg[] V22.Y(offs=0x04) P-DEP "field V22.Y (fldOffset=0x4)"
;  V87 tmp74        [V87,T55] (  2,  2   )   float  ->  [rsp+98H]   do-not-enreg[] V22.Z(offs=0x08) P-DEP "field V22.Z (fldOffset=0x8)"
;  V88 tmp75        [V88,T71] (  3,  1.50)   float  ->  mm0         V28.X(offs=0x00) P-INDEP "field V28.X (fldOffset=0x0)"
;  V89 tmp76        [V89,T72] (  3,  1.50)   float  ->  mm3         V28.Y(offs=0x04) P-INDEP "field V28.Y (fldOffset=0x4)"
;  V90 tmp77        [V90,T73] (  3,  1.50)   float  ->  mm5         V28.Z(offs=0x08) P-INDEP "field V28.Z (fldOffset=0x8)"
;  V91 tmp78        [V91,T20] (  3,  2   )   float  ->  [rsp+80H]   do-not-enreg[] V29.X(offs=0x00) P-DEP "field V29.X (fldOffset=0x0)"
;  V92 tmp79        [V92,T21] (  3,  2   )   float  ->  [rsp+84H]   do-not-enreg[] V29.Y(offs=0x04) P-DEP "field V29.Y (fldOffset=0x4)"
;  V93 tmp80        [V93,T22] (  3,  2   )   float  ->  [rsp+88H]   do-not-enreg[] V29.Z(offs=0x08) P-DEP "field V29.Z (fldOffset=0x8)"
;  V94 tmp81        [V94,T56] (  2,  2   )   float  ->  [rsp+70H]   do-not-enreg[] V32.X(offs=0x00) P-DEP "field V32.X (fldOffset=0x0)"
;  V95 tmp82        [V95,T57] (  2,  2   )   float  ->  [rsp+74H]   do-not-enreg[] V32.Y(offs=0x04) P-DEP "field V32.Y (fldOffset=0x4)"
;  V96 tmp83        [V96,T58] (  2,  2   )   float  ->  [rsp+78H]   do-not-enreg[] V32.Z(offs=0x08) P-DEP "field V32.Z (fldOffset=0x8)"
;  V97 tmp84        [V97,T23] (  3,  2   )   float  ->  [rsp+60H]   do-not-enreg[] V37.X(offs=0x00) P-DEP "field V37.X (fldOffset=0x0)"
;  V98 tmp85        [V98,T24] (  3,  2   )   float  ->  [rsp+64H]   do-not-enreg[] V37.Y(offs=0x04) P-DEP "field V37.Y (fldOffset=0x4)"
;  V99 tmp86        [V99,T25] (  3,  2   )   float  ->  [rsp+68H]   do-not-enreg[] V37.Z(offs=0x08) P-DEP "field V37.Z (fldOffset=0x8)"
;  V100 tmp87       [V100,T74] (  3,  1.50)   float  ->  mm3         V38.X(offs=0x00) P-INDEP "field V38.X (fldOffset=0x0)"
;  V101 tmp88       [V101,T75] (  3,  1.50)   float  ->  mm4         V38.Y(offs=0x04) P-INDEP "field V38.Y (fldOffset=0x4)"
;  V102 tmp89       [V102,T76] (  3,  1.50)   float  ->  mm5         V38.Z(offs=0x08) P-INDEP "field V38.Z (fldOffset=0x8)"
;  V103 tmp90       [V103,T59] (  2,  2   )   float  ->  [rsp+50H]   do-not-enreg[] V41.X(offs=0x00) P-DEP "field V41.X (fldOffset=0x0)"
;  V104 tmp91       [V104,T60] (  2,  2   )   float  ->  [rsp+54H]   do-not-enreg[] V41.Y(offs=0x04) P-DEP "field V41.Y (fldOffset=0x4)"
;  V105 tmp92       [V105,T61] (  2,  2   )   float  ->  [rsp+58H]   do-not-enreg[] V41.Z(offs=0x08) P-DEP "field V41.Z (fldOffset=0x8)"
;  V106 tmp93       [V106,T77] (  3,  1.50)   float  ->  mm4         V46.X(offs=0x00) P-INDEP "field V46.X (fldOffset=0x0)"
;  V107 tmp94       [V107,T78] (  3,  1.50)   float  ->  mm3         V46.Y(offs=0x04) P-INDEP "field V46.Y (fldOffset=0x4)"
;  V108 tmp95       [V108,T79] (  3,  1.50)   float  ->  mm5         V46.Z(offs=0x08) P-INDEP "field V46.Z (fldOffset=0x8)"
;  V109 tmp96       [V109,T26] (  3,  2   )   float  ->  [rsp+40H]   do-not-enreg[] V47.X(offs=0x00) P-DEP "field V47.X (fldOffset=0x0)"
;  V110 tmp97       [V110,T27] (  3,  2   )   float  ->  [rsp+44H]   do-not-enreg[] V47.Y(offs=0x04) P-DEP "field V47.Y (fldOffset=0x4)"
;  V111 tmp98       [V111,T28] (  3,  2   )   float  ->  [rsp+48H]   do-not-enreg[] V47.Z(offs=0x08) P-DEP "field V47.Z (fldOffset=0x8)"
;  V112 tmp99       [V112,T62] (  2,  2   )   float  ->  [rsp+30H]   do-not-enreg[] V50.X(offs=0x00) P-DEP "field V50.X (fldOffset=0x0)"
;  V113 tmp100      [V113,T63] (  2,  2   )   float  ->  [rsp+34H]   do-not-enreg[] V50.Y(offs=0x04) P-DEP "field V50.Y (fldOffset=0x4)"
;  V114 tmp101      [V114,T64] (  2,  2   )   float  ->  [rsp+38H]   do-not-enreg[] V50.Z(offs=0x08) P-DEP "field V50.Z (fldOffset=0x8)"
;  V115 tmp102      [V115,T29] (  3,  2   )   float  ->  [rsp+20H]   do-not-enreg[] V55.X(offs=0x00) P-DEP "field V55.X (fldOffset=0x0)"
;  V116 tmp103      [V116,T30] (  3,  2   )   float  ->  [rsp+24H]   do-not-enreg[] V55.Y(offs=0x04) P-DEP "field V55.Y (fldOffset=0x4)"
;  V117 tmp104      [V117,T31] (  3,  2   )   float  ->  [rsp+28H]   do-not-enreg[] V55.Z(offs=0x08) P-DEP "field V55.Z (fldOffset=0x8)"
;  V118 tmp105      [V118,T32] (  3,  2   )   float  ->  [rsp+10H]   do-not-enreg[] V56.X(offs=0x00) P-DEP "field V56.X (fldOffset=0x0)"
;  V119 tmp106      [V119,T33] (  3,  2   )   float  ->  [rsp+14H]   do-not-enreg[] V56.Y(offs=0x04) P-DEP "field V56.Y (fldOffset=0x4)"
;  V120 tmp107      [V120,T34] (  3,  2   )   float  ->  [rsp+18H]   do-not-enreg[] V56.Z(offs=0x08) P-DEP "field V56.Z (fldOffset=0x8)"
;  V121 tmp108      [V121,T65] (  2,  2   )   float  ->  [rsp+00H]   do-not-enreg[] V59.X(offs=0x00) P-DEP "field V59.X (fldOffset=0x0)"
;  V122 tmp109      [V122,T66] (  2,  2   )   float  ->  [rsp+04H]   do-not-enreg[] V59.Y(offs=0x04) P-DEP "field V59.Y (fldOffset=0x4)"
;  V123 tmp110      [V123,T67] (  2,  2   )   float  ->  [rsp+08H]   do-not-enreg[] V59.Z(offs=0x08) P-DEP "field V59.Z (fldOffset=0x8)"
;* V124 tmp111      [V124    ] (  0,  0   )  simd12  ->  zero-ref    "Promoted implicit byref"
;* V125 tmp112      [V125    ] (  0,  0   )  simd12  ->  zero-ref    "Promoted implicit byref"
;  V126 cse0        [V126,T17] (  3,  3   )  simd12  ->  mm1         "CSE - moderate"
;  V127 cse1        [V127,T13] (  4,  3.50)  simd12  ->  mm3         "CSE - moderate"
;  V128 cse2        [V128,T16] (  4,  3   )   float  ->  mm5         "CSE - moderate"
;
; Lcl frame size = 280

G_M27508_IG01:
       sub      rsp, 280
       vzeroupper 
       vmovaps  qword ptr [rsp+100H], xmm6
       vmovaps  qword ptr [rsp+F0H], xmm7
						;; bbWeight=1    PerfScore 7.25
G_M27508_IG02:
       vmovss   xmm0, dword ptr [rdx+8]
       vmovsd   xmm1, qword ptr [rdx]
       vshufps  xmm1, xmm0, 68
       vmovss   xmm0, dword ptr [r8+8]
       vmovsd   xmm2, qword ptr [r8]
       vshufps  xmm2, xmm0, 68
       vsubps   xmm0, xmm1, xmm2
       vdpps    xmm2, xmm0, xmm0, 113
       vmovss   xmm3, dword ptr [reloc @RWD32]
       vucomiss xmm3, xmm2
       jbe      SHORT G_M27508_IG04
						;; bbWeight=1    PerfScore 29.00
G_M27508_IG03:
       mov      rax, bword ptr [rsp+140H]
       vmovss   xmm0, dword ptr [rax+8]
       vmovsd   xmm2, qword ptr [rax]
       vshufps  xmm2, xmm0, 68
       vxorps   xmm0, xmm0, xmm0
       vsubps   xmm0, xmm0, xmm2
       jmp      SHORT G_M27508_IG05
						;; bbWeight=0.50 PerfScore 5.67
G_M27508_IG04:
       vmovapd  xmmword ptr [rsp+90H], xmm0
       vmovapd  xmm0, xmmword ptr [rsp+90H]
       vsqrtss  xmm2, xmm2
       vmovss   xmm3, dword ptr [reloc @RWD16]
       vdivss   xmm2, xmm3, xmm2
       vinsertps xmm2, xmm2, 14
       vshufps  xmm2, xmm2, 64
       vmulps   xmm0, xmm0, xmm2
						;; bbWeight=0.50 PerfScore 16.50
G_M27508_IG05:
       vmovss   xmm2, dword ptr [r9+8]
       vmovsd   xmm3, qword ptr [r9]
       vshufps  xmm3, xmm2, 68
       vmovaps  xmm2, xmm3
       vdpps    xmm4, xmm3, xmm0, 113
       vandps   xmm4, xmm4, dword ptr [reloc @RWD48]
       vmovss   xmm5, dword ptr [reloc @RWD64]
       vucomiss xmm4, xmm5
       jbe      G_M27508_IG10
						;; bbWeight=1    PerfScore 23.25
G_M27508_IG06:
       mov      rax, bword ptr [rsp+148H]
       vmovss   xmm0, dword ptr [rax+8]
       vmovsd   xmm4, qword ptr [rax]
       vshufps  xmm4, xmm0, 68
       vdpps    xmm0, xmm3, xmm4, 113
       vandps   xmm0, xmm0, dword ptr [reloc @RWD48]
       vucomiss xmm0, xmm5
       jbe      SHORT G_M27508_IG09
       vmovss   xmm4, dword ptr [r9+8]
       vandps   xmm4, xmm4, dword ptr [reloc @RWD48]
       vucomiss xmm4, xmm5
       ja       SHORT G_M27508_IG07
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovapd  xmmword ptr [rsp+A0H], xmm0
       jmp      SHORT G_M27508_IG08
						;; bbWeight=0.50 PerfScore 17.50
G_M27508_IG07:
       vmovupd  xmm0, xmmword ptr [reloc @RWD16]
       vmovapd  xmmword ptr [rsp+A0H], xmm0
						;; bbWeight=0.50 PerfScore 2.50
G_M27508_IG08:
       vmovapd  xmm4, xmmword ptr [rsp+A0H]
						;; bbWeight=0.50 PerfScore 1.50
G_M27508_IG09:
       vmovss   xmm0, dword ptr [r9]
       vmovss   xmm3, dword ptr [r9+4]
       vmovss   xmm5, dword ptr [r9+8]
       vmovapd  xmmword ptr [rsp+80H], xmm4
       vmulss   xmm4, xmm3, dword ptr [rsp+88H]
       vmulss   xmm6, xmm5, dword ptr [rsp+84H]
       vsubss   xmm4, xmm4, xmm6
       vmulss   xmm5, xmm5, dword ptr [rsp+80H]
       vmulss   xmm6, xmm0, dword ptr [rsp+88H]
       vsubss   xmm5, xmm5, xmm6
       vmulss   xmm0, xmm0, dword ptr [rsp+84H]
       vmulss   xmm3, xmm3, dword ptr [rsp+80H]
       vsubss   xmm0, xmm0, xmm3
       vxorps   xmm3, xmm3
       vmovss   xmm3, xmm3, xmm0
       vpslldq  xmm3, 4
       vmovss   xmm3, xmm3, xmm5
       vpslldq  xmm3, 4
       vmovss   xmm3, xmm3, xmm4
       vmovaps  xmm0, xmm3
       vmovapd  xmmword ptr [rsp+70H], xmm0
       vdpps    xmm0, xmm0, xmm0, 113
       vmovapd  xmm3, xmmword ptr [rsp+70H]
       vsqrtss  xmm0, xmm0
       vinsertps xmm0, xmm0, 14
       vshufps  xmm0, xmm0, 64
       vdivps   xmm0, xmm3, xmm0
       vpslldq  xmm0, xmm0, 4
       vpsrldq  xmm0, xmm0, 4
       vmovapd  xmmword ptr [rsp+60H], xmm0
       vmovss   xmm3, dword ptr [r9]
       vmovss   xmm4, dword ptr [r9+4]
       vmovss   xmm5, dword ptr [r9+8]
       vmulss   xmm6, xmm5, dword ptr [rsp+64H]
       vmulss   xmm7, xmm4, dword ptr [rsp+68H]
       vsubss   xmm6, xmm6, xmm7
       vmulss   xmm7, xmm3, dword ptr [rsp+68H]
       vmulss   xmm5, xmm5, dword ptr [rsp+60H]
       vsubss   xmm5, xmm7, xmm5
       vmulss   xmm4, xmm4, dword ptr [rsp+60H]
       vmulss   xmm3, xmm3, dword ptr [rsp+64H]
       vsubss   xmm3, xmm4, xmm3
       vxorps   xmm4, xmm4
       vmovss   xmm4, xmm4, xmm3
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm5
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm6
       vmovaps  xmm3, xmm4
       vmovapd  xmmword ptr [rsp+50H], xmm3
       vdpps    xmm3, xmm3, xmm3, 113
       vmovapd  xmm4, xmmword ptr [rsp+50H]
       vsqrtss  xmm3, xmm3
       vinsertps xmm3, xmm3, 14
       vshufps  xmm3, xmm3, 64
       vdivps   xmm3, xmm4, xmm3
       vpslldq  xmm3, xmm3, 4
       vpsrldq  xmm4, xmm3, 4
       jmp      G_M27508_IG11
						;; bbWeight=0.50 PerfScore 95.58
G_M27508_IG10:
       vmovss   xmm4, dword ptr [r9]
       vmovss   xmm3, dword ptr [r9+4]
       vmovss   xmm5, dword ptr [r9+8]
       vmovapd  xmmword ptr [rsp+40H], xmm0
       vmulss   xmm0, xmm3, dword ptr [rsp+48H]
       vmulss   xmm6, xmm5, dword ptr [rsp+44H]
       vsubss   xmm0, xmm0, xmm6
       vmulss   xmm5, xmm5, dword ptr [rsp+40H]
       vmulss   xmm6, xmm4, dword ptr [rsp+48H]
       vsubss   xmm5, xmm5, xmm6
       vmulss   xmm4, xmm4, dword ptr [rsp+44H]
       vmulss   xmm3, xmm3, dword ptr [rsp+40H]
       vsubss   xmm3, xmm4, xmm3
       vxorps   xmm4, xmm4
       vmovss   xmm4, xmm4, xmm3
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm5
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm0
       vmovaps  xmm0, xmm4
       vmovapd  xmmword ptr [rsp+30H], xmm0
       vdpps    xmm0, xmm0, xmm0, 113
       vmovapd  xmm3, xmmword ptr [rsp+30H]
       vsqrtss  xmm0, xmm0
       vinsertps xmm0, xmm0, 14
       vshufps  xmm0, xmm0, 64
       vdivps   xmm0, xmm3, xmm0
       vpslldq  xmm0, xmm0, 4
       vpsrldq  xmm0, xmm0, 4
       vmovapd  xmmword ptr [rsp+20H], xmm0
       vmovapd  xmmword ptr [rsp+10H], xmm2
       vmovss   xmm3, dword ptr [rsp+24H]
       vmulss   xmm3, xmm3, dword ptr [rsp+18H]
       vmovss   xmm4, dword ptr [rsp+28H]
       vmulss   xmm4, xmm4, dword ptr [rsp+14H]
       vsubss   xmm3, xmm3, xmm4
       vmovss   xmm4, dword ptr [rsp+28H]
       vmulss   xmm4, xmm4, dword ptr [rsp+10H]
       vmovss   xmm5, dword ptr [rsp+20H]
       vmulss   xmm5, xmm5, dword ptr [rsp+18H]
       vsubss   xmm4, xmm4, xmm5
       vmovss   xmm5, dword ptr [rsp+20H]
       vmulss   xmm5, xmm5, dword ptr [rsp+14H]
       vmovss   xmm6, dword ptr [rsp+24H]
       vmulss   xmm6, xmm6, dword ptr [rsp+10H]
       vsubss   xmm5, xmm5, xmm6
       vxorps   xmm6, xmm6
       vmovss   xmm6, xmm6, xmm5
       vpslldq  xmm6, 4
       vmovss   xmm6, xmm6, xmm4
       vpslldq  xmm6, 4
       vmovss   xmm6, xmm6, xmm3
       vmovaps  xmm3, xmm6
       vmovapd  xmmword ptr [rsp], xmm3
       vdpps    xmm3, xmm3, xmm3, 113
       vmovapd  xmm4, xmmword ptr [rsp]
       vsqrtss  xmm3, xmm3
       vinsertps xmm3, xmm3, 14
       vshufps  xmm3, xmm3, 64
       vdivps   xmm3, xmm4, xmm3
       vpslldq  xmm3, xmm3, 4
       vpsrldq  xmm4, xmm3, 4
						;; bbWeight=0.50 PerfScore 98.58
G_M27508_IG11:
       vmovsd   qword ptr [rsp+B0H], xmm0
       vpshufd  xmm3, xmm0, 2
       vmovss   dword ptr [rsp+B8H], xmm3
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+BCH], xmm0
       vmovsd   qword ptr [rsp+C0H], xmm2
       vpshufd  xmm0, xmm2, 2
       vmovss   dword ptr [rsp+C8H], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+CCH], xmm0
       vmovsd   qword ptr [rsp+D0H], xmm4
       vpshufd  xmm0, xmm4, 2
       vmovss   dword ptr [rsp+D8H], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+DCH], xmm0
       vmovsd   qword ptr [rsp+E0H], xmm1
       vpshufd  xmm0, xmm1, 2
       vmovss   dword ptr [rsp+E8H], xmm0
       vmovss   xmm0, dword ptr [reloc @RWD16]
       vmovss   dword ptr [rsp+ECH], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+B0H]
       vmovdqu  xmmword ptr [rcx], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+C0H]
       vmovdqu  xmmword ptr [rcx+16], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+D0H]
       vmovdqu  xmmword ptr [rcx+32], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+E0H]
       vmovdqu  xmmword ptr [rcx+48], xmm0
       mov      rax, rcx
						;; bbWeight=1    PerfScore 21.25
G_M27508_IG12:
       vmovaps  xmm6, qword ptr [rsp+100H]
       vmovaps  xmm7, qword ptr [rsp+F0H]
       add      rsp, 280
       ret      
						;; bbWeight=1    PerfScore 9.25
RWD00  	dd	00000000h		;         0
	dd	00000000h		;         0
	dd	BF800000h		;        -1
	dd	00000000h		;         0
RWD16  	dd	3F800000h		;         1
	dd	00000000h		;         0
	dd	00000000h		;         0
	dd	00000000h		;         0
RWD32  	dd	38D1B717h		;    0.0001
RWD36  	dd	00000000h, 00000000h, 00000000h
RWD48  	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
RWD64  	dd	3F7F8D9Eh		;  0.998255


; Total bytes of code 1195, prolog size 28, PerfScore 464.43, instruction count 211, allocated bytes for code 1366 (MethodHash=17ab948b) for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x1E
  CountOfUnwindCodes: 6
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x1E UnwindOp: UWOP_SAVE_XMM128 (8)     OpInfo: XMM7 (7)
      Scaled Small Offset: 15 * 16 = 240 = 0x000F0
    CodeOffset: 0x14 UnwindOp: UWOP_SAVE_XMM128 (8)     OpInfo: XMM6 (6)
      Scaled Small Offset: 16 * 16 = 256 = 0x00100
    CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_LARGE (1)     OpInfo: 0 - Scaled small  
      Size: 35 * 8 = 280 = 0x00118

After

; Assembly listing for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 27 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T04] (  4,  4   )   byref  ->  rcx        
;  V01 arg0         [V01,T02] (  3,  6   )   byref  ->  rdx        
;  V02 arg1         [V02,T03] (  3,  6   )   byref  ->   r8        
;  V03 arg2         [V03,T01] (  4,  7   )   byref  ->   r9        
;  V04 arg3         [V04,T05] (  1,  1   )   byref  ->  [rsp+A0H]  
;  V05 arg4         [V05,T06] (  1,  1   )   byref  ->  [rsp+A8H]  
;  V06 loc0         [V06,T10] (  8,  6   )  simd12  ->  mm0         ld-addr-op
;  V07 loc1         [V07,T25] (  3,  2.50)   float  ->  mm2        
;  V08 loc2         [V08,T26] (  2,  2   )  simd12  ->  mm2        
;  V09 loc3         [V09,T07] ( 15,  8   )  simd12  ->  registers  
;  V10 loc4         [V10,T19] (  7,  4   )  simd12  ->  registers  
;  V11 loc5         [V11,T22] (  4,  3   )   float  ->  registers  
;  V12 loc6         [V12,T00] (  9,  9   )  struct (64) [rsp+00H]   do-not-enreg[SFB] ld-addr-op
;# V13 OutArgs      [V13    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V14 tmp1         [V14    ] (  0,  0   )  simd12  ->  zero-ref    "struct address for call/obj"
;* V15 tmp2         [V15    ] (  0,  0   )  simd12  ->  zero-ref    "struct address for call/obj"
;* V16 tmp3         [V16    ] (  0,  0   )  simd12  ->  zero-ref    "struct address for call/obj"
;* V17 tmp4         [V17    ] (  0,  0   )  simd12  ->  zero-ref    "struct address for call/obj"
;  V18 tmp5         [V18,T27] (  2,  2   )  simd12  ->  mm4         "NewObj constructor temp"
;  V19 tmp6         [V19,T36] (  3,  1.50)  simd12  ->  mm4        
;  V20 tmp7         [V20,T28] (  2,  2   )  simd12  ->  mm4         "NewObj constructor temp"
;  V21 tmp8         [V21,T29] (  2,  2   )  simd12  ->  mm2         "Inlining Arg"
;* V22 tmp9         [V22    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V23 tmp10        [V23    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;* V24 tmp11        [V24    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V25 tmp12        [V25    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V26 tmp13        [V26,T30] (  2,  2   )  simd12  ->  mm2         "NewObj constructor temp"
;  V27 tmp14        [V27,T31] (  2,  2   )   float  ->  mm4         "Inlining Arg"
;  V28 tmp15        [V28,T20] (  4,  4   )  simd12  ->  mm3         "Inlining Arg"
;  V29 tmp16        [V29,T08] (  7,  7   )  simd12  ->  mm4         "Inlining Arg"
;  V30 tmp17        [V30,T15] (  4,  4   )  simd12  ->  mm4         "NewObj constructor temp"
;* V31 tmp18        [V31,T41] (  0,  0   )  simd12  ->  zero-ref    ld-addr-op "Inlining Arg"
;* V32 tmp19        [V32    ] (  0,  0   )  simd12  ->  zero-ref    "impAppendStmt"
;  V33 tmp20        [V33,T37] (  2,  1   )   float  ->  mm5         "Inline stloc first use temp"
;* V34 tmp21        [V34    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V35 tmp22        [V35    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V36 tmp23        [V36,T32] (  2,  2   )  simd12  ->  mm5         "NewObj constructor temp"
;* V37 tmp24        [V37,T42] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V38 tmp25        [V38,T43] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;  V39 tmp26        [V39,T16] (  4,  4   )  simd12  ->  mm0         "NewObj constructor temp"
;* V40 tmp27        [V40,T44] (  0,  0   )  simd12  ->  zero-ref    ld-addr-op "Inlining Arg"
;* V41 tmp28        [V41    ] (  0,  0   )  simd12  ->  zero-ref    "impAppendStmt"
;  V42 tmp29        [V42,T38] (  2,  1   )   float  ->  mm3         "Inline stloc first use temp"
;* V43 tmp30        [V43    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V44 tmp31        [V44    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V45 tmp32        [V45,T33] (  2,  2   )  simd12  ->  mm3         "NewObj constructor temp"
;  V46 tmp33        [V46,T21] (  4,  4   )  simd12  ->  mm3         "Inlining Arg"
;  V47 tmp34        [V47,T09] (  7,  7   )  simd12  ->  mm0         "Inlining Arg"
;  V48 tmp35        [V48,T17] (  4,  4   )  simd12  ->  mm0         "NewObj constructor temp"
;* V49 tmp36        [V49,T45] (  0,  0   )  simd12  ->  zero-ref    ld-addr-op "Inlining Arg"
;* V50 tmp37        [V50    ] (  0,  0   )  simd12  ->  zero-ref    "impAppendStmt"
;  V51 tmp38        [V51,T39] (  2,  1   )   float  ->  mm5         "Inline stloc first use temp"
;* V52 tmp39        [V52    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V53 tmp40        [V53    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V54 tmp41        [V54,T34] (  2,  2   )  simd12  ->  mm5         "NewObj constructor temp"
;* V55 tmp42        [V55,T46] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V56 tmp43        [V56,T47] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;  V57 tmp44        [V57,T18] (  4,  4   )  simd12  ->  mm3         "NewObj constructor temp"
;* V58 tmp45        [V58,T48] (  0,  0   )  simd12  ->  zero-ref    ld-addr-op "Inlining Arg"
;* V59 tmp46        [V59    ] (  0,  0   )  simd12  ->  zero-ref    "impAppendStmt"
;  V60 tmp47        [V60,T40] (  2,  1   )   float  ->  mm4         "Inline stloc first use temp"
;* V61 tmp48        [V61    ] (  0,  0   )  simd12  ->  zero-ref    "Inlining Arg"
;* V62 tmp49        [V62    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
;  V63 tmp50        [V63,T35] (  2,  2   )  simd12  ->  mm4         "NewObj constructor temp"
;  V64 cse0         [V64,T24] (  3,  3   )  simd12  ->  mm1         "CSE - moderate"
;  V65 cse1         [V65,T14] (  6,  4.50)  simd12  ->  mm3         "CSE - moderate"
;  V66 cse2         [V66,T11] ( 10,  5   )   float  ->  mm6         "CSE - moderate"
;  V67 cse3         [V67,T12] ( 10,  5   )   float  ->  mm3         "CSE - moderate"
;  V68 cse4         [V68,T13] ( 10,  5   )   float  ->  registers   "CSE - moderate"
;  V69 cse5         [V69,T23] (  4,  3   )   float  ->  mm5         "CSE - moderate"
;
; Lcl frame size = 120

G_M27508_IG01:
       sub      rsp, 120
       vzeroupper 
       vmovaps  qword ptr [rsp+60H], xmm6
       vmovaps  qword ptr [rsp+50H], xmm7
       vmovaps  qword ptr [rsp+40H], xmm8
						;; bbWeight=1    PerfScore 10.25
G_M27508_IG02:
       vmovss   xmm0, dword ptr [rdx+8]
       vmovsd   xmm1, qword ptr [rdx]
       vshufps  xmm1, xmm0, 68
       vmovss   xmm0, dword ptr [r8+8]
       vmovsd   xmm2, qword ptr [r8]
       vshufps  xmm2, xmm0, 68
       vsubps   xmm0, xmm1, xmm2
       vdpps    xmm2, xmm0, xmm0, 113
       vmovss   xmm3, dword ptr [reloc @RWD32]
       vucomiss xmm3, xmm2
       jbe      SHORT G_M27508_IG04
						;; bbWeight=1    PerfScore 29.00
G_M27508_IG03:
       mov      rax, bword ptr [rsp+A0H]
       vmovss   xmm0, dword ptr [rax+8]
       vmovsd   xmm2, qword ptr [rax]
       vshufps  xmm2, xmm0, 68
       vxorps   xmm0, xmm0, xmm0
       vsubps   xmm0, xmm0, xmm2
       jmp      SHORT G_M27508_IG05
						;; bbWeight=0.50 PerfScore 5.67
G_M27508_IG04:
       vsqrtss  xmm2, xmm2
       vmovss   xmm3, dword ptr [reloc @RWD16]
       vdivss   xmm2, xmm3, xmm2
       vinsertps xmm2, xmm2, 14
       vshufps  xmm2, xmm2, 64
       vmulps   xmm0, xmm0, xmm2
						;; bbWeight=0.50 PerfScore 14.00
G_M27508_IG05:
       vmovss   xmm2, dword ptr [r9+8]
       vmovsd   xmm3, qword ptr [r9]
       vshufps  xmm3, xmm2, 68
       vmovaps  xmm2, xmm3
       vdpps    xmm4, xmm3, xmm0, 113
       vandps   xmm4, xmm4, dword ptr [reloc @RWD48]
       vmovss   xmm5, dword ptr [reloc @RWD64]
       vucomiss xmm4, xmm5
       jbe      G_M27508_IG09
						;; bbWeight=1    PerfScore 23.25
G_M27508_IG06:
       mov      rax, bword ptr [rsp+A8H]
       vmovss   xmm0, dword ptr [rax+8]
       vmovsd   xmm4, qword ptr [rax]
       vshufps  xmm4, xmm0, 68
       vdpps    xmm0, xmm3, xmm4, 113
       vandps   xmm0, xmm0, dword ptr [reloc @RWD48]
       vucomiss xmm0, xmm5
       jbe      SHORT G_M27508_IG08
       vmovss   xmm4, dword ptr [r9+8]
       vandps   xmm4, xmm4, dword ptr [reloc @RWD48]
       vucomiss xmm4, xmm5
       ja       SHORT G_M27508_IG07
       vmovupd  xmm4, xmmword ptr [reloc @RWD00]
       jmp      SHORT G_M27508_IG08
						;; bbWeight=0.50 PerfScore 16.50
G_M27508_IG07:
       vmovupd  xmm4, xmmword ptr [reloc @RWD16]
						;; bbWeight=0.50 PerfScore 1.50
G_M27508_IG08:
       vmovaps  xmm0, xmm3
       vpsrldq  xmm0, 4
       vmovaps  xmm5, xmm4
       vpsrldq  xmm5, 8
       vmulss   xmm5, xmm0, xmm5
       vmovaps  xmm6, xmm3
       vpsrldq  xmm6, 8
       vmovaps  xmm7, xmm4
       vpsrldq  xmm7, 4
       vmulss   xmm7, xmm6, xmm7
       vsubss   xmm5, xmm5, xmm7
       vmovaps  xmm7, xmm4
       vmulss   xmm7, xmm6, xmm7
       vmovaps  xmm8, xmm4
       vpsrldq  xmm8, 8
       vmulss   xmm8, xmm3, xmm8
       vsubss   xmm7, xmm7, xmm8
       vmovaps  xmm8, xmm4
       vpsrldq  xmm8, 4
       vmulss   xmm8, xmm3, xmm8
       vmulss   xmm4, xmm0, xmm4
       vsubss   xmm4, xmm8, xmm4
       vxorps   xmm8, xmm8
       vmovss   xmm8, xmm8, xmm4
       vpslldq  xmm8, 4
       vmovss   xmm8, xmm8, xmm7
       vpslldq  xmm8, 4
       vmovss   xmm8, xmm8, xmm5
       vmovaps  xmm4, xmm8
       vdpps    xmm5, xmm4, xmm4, 113
       vsqrtss  xmm5, xmm5
       vinsertps xmm5, xmm5, 14
       vshufps  xmm5, xmm5, 64
       vdivps   xmm4, xmm4, xmm5
       vpslldq  xmm4, xmm4, 4
       vpsrldq  xmm4, xmm4, 4
       vmovaps  xmm5, xmm4
       vpsrldq  xmm5, 4
       vmulss   xmm5, xmm5, xmm6
       vmovaps  xmm7, xmm4
       vpsrldq  xmm7, 8
       vmulss   xmm7, xmm7, xmm0
       vsubss   xmm5, xmm5, xmm7
       vmovaps  xmm7, xmm4
       vpsrldq  xmm7, 8
       vmulss   xmm7, xmm7, xmm3
       vmovaps  xmm8, xmm4
       vmulss   xmm6, xmm8, xmm6
       vsubss   xmm6, xmm7, xmm6
       vmovaps  xmm7, xmm4
       vmulss   xmm0, xmm7, xmm0
       vmovaps  xmm7, xmm4
       vpsrldq  xmm7, 4
       vmulss   xmm3, xmm7, xmm3
       vsubss   xmm0, xmm0, xmm3
       vxorps   xmm3, xmm3
       vmovss   xmm3, xmm3, xmm0
       vpslldq  xmm3, 4
       vmovss   xmm3, xmm3, xmm6
       vpslldq  xmm3, 4
       vmovss   xmm3, xmm3, xmm5
       vmovaps  xmm0, xmm3
       vdpps    xmm3, xmm0, xmm0, 113
       vsqrtss  xmm3, xmm3
       vinsertps xmm3, xmm3, 14
       vshufps  xmm3, xmm3, 64
       vdivps   xmm0, xmm0, xmm3
       vpslldq  xmm0, xmm0, 4
       vpsrldq  xmm0, xmm0, 4
       jmp      G_M27508_IG10
						;; bbWeight=0.50 PerfScore 77.21
G_M27508_IG09:
       vmovaps  xmm4, xmm3
       vpsrldq  xmm4, 4
       vmovaps  xmm5, xmm0
       vpsrldq  xmm5, 8
       vmulss   xmm5, xmm4, xmm5
       vmovaps  xmm6, xmm3
       vpsrldq  xmm6, 8
       vmovaps  xmm7, xmm0
       vpsrldq  xmm7, 4
       vmulss   xmm7, xmm6, xmm7
       vsubss   xmm5, xmm5, xmm7
       vmovaps  xmm7, xmm0
       vmulss   xmm7, xmm6, xmm7
       vmovaps  xmm8, xmm0
       vpsrldq  xmm8, 8
       vmulss   xmm8, xmm3, xmm8
       vsubss   xmm7, xmm7, xmm8
       vmovaps  xmm8, xmm0
       vpsrldq  xmm8, 4
       vmulss   xmm8, xmm3, xmm8
       vmulss   xmm0, xmm4, xmm0
       vsubss   xmm0, xmm8, xmm0
       vxorps   xmm8, xmm8
       vmovss   xmm8, xmm8, xmm0
       vpslldq  xmm8, 4
       vmovss   xmm8, xmm8, xmm7
       vpslldq  xmm8, 4
       vmovss   xmm8, xmm8, xmm5
       vmovaps  xmm0, xmm8
       vdpps    xmm5, xmm0, xmm0, 113
       vsqrtss  xmm5, xmm5
       vinsertps xmm5, xmm5, 14
       vshufps  xmm5, xmm5, 64
       vdivps   xmm0, xmm0, xmm5
       vpslldq  xmm0, xmm0, 4
       vpsrldq  xmm0, xmm0, 4
       vmovaps  xmm5, xmm0
       vpsrldq  xmm5, 4
       vmulss   xmm5, xmm5, xmm6
       vmovaps  xmm7, xmm0
       vpsrldq  xmm7, 8
       vmulss   xmm7, xmm7, xmm4
       vsubss   xmm5, xmm5, xmm7
       vmovaps  xmm7, xmm0
       vpsrldq  xmm7, 8
       vmulss   xmm7, xmm7, xmm3
       vmovaps  xmm8, xmm0
       vmulss   xmm6, xmm8, xmm6
       vsubss   xmm6, xmm7, xmm6
       vmovaps  xmm7, xmm0
       vmulss   xmm4, xmm7, xmm4
       vmovaps  xmm7, xmm0
       vpsrldq  xmm7, 4
       vmulss   xmm3, xmm7, xmm3
       vsubss   xmm3, xmm4, xmm3
       vxorps   xmm4, xmm4
       vmovss   xmm4, xmm4, xmm3
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm6
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm5
       vmovaps  xmm3, xmm4
       vdpps    xmm4, xmm3, xmm3, 113
       vsqrtss  xmm4, xmm4
       vinsertps xmm4, xmm4, 14
       vshufps  xmm4, xmm4, 64
       vdivps   xmm3, xmm3, xmm4
       vpslldq  xmm3, xmm3, 4
       vpsrldq  xmm3, xmm3, 4
       vmovaps  xmm4, xmm0
       vmovaps  xmm0, xmm3
						;; bbWeight=0.50 PerfScore 76.46
G_M27508_IG10:
       vmovsd   qword ptr [rsp], xmm4
       vpshufd  xmm3, xmm4, 2
       vmovss   dword ptr [rsp+08H], xmm3
       vxorps   xmm3, xmm3
       vmovss   dword ptr [rsp+0CH], xmm3
       vmovsd   qword ptr [rsp+10H], xmm2
       vpshufd  xmm3, xmm2, 2
       vmovss   dword ptr [rsp+18H], xmm3
       vxorps   xmm2, xmm2
       vmovss   dword ptr [rsp+1CH], xmm2
       vmovsd   qword ptr [rsp+20H], xmm0
       vpshufd  xmm2, xmm0, 2
       vmovss   dword ptr [rsp+28H], xmm2
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+2CH], xmm0
       vmovsd   qword ptr [rsp+30H], xmm1
       vpshufd  xmm0, xmm1, 2
       vmovss   dword ptr [rsp+38H], xmm0
       vmovss   xmm0, dword ptr [reloc @RWD16]
       vmovss   dword ptr [rsp+3CH], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp]
       vmovdqu  xmmword ptr [rcx], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+10H]
       vmovdqu  xmmword ptr [rcx+16], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+20H]
       vmovdqu  xmmword ptr [rcx+32], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+30H]
       vmovdqu  xmmword ptr [rcx+48], xmm0
       mov      rax, rcx
						;; bbWeight=1    PerfScore 21.25
G_M27508_IG11:
       vmovaps  xmm6, qword ptr [rsp+60H]
       vmovaps  xmm7, qword ptr [rsp+50H]
       vmovaps  xmm8, qword ptr [rsp+40H]
       add      rsp, 120
       ret      
						;; bbWeight=1    PerfScore 13.25
RWD00  	dd	00000000h		;         0
	dd	00000000h		;         0
	dd	BF800000h		;        -1
	dd	00000000h		;         0
RWD16  	dd	3F800000h		;         1
	dd	00000000h		;         0
	dd	00000000h		;         0
	dd	00000000h		;         0
RWD32  	dd	38D1B717h		;    0.0001
RWD36  	dd	00000000h, 00000000h, 00000000h
RWD48  	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
	dd	7FFFFFFFh		;       nan
RWD64  	dd	3F7F8D9Eh		;  0.998255


; Total bytes of code 1092, prolog size 25, PerfScore 415.43, instruction count 228, allocated bytes for code 1271 (MethodHash=17ab948b) for method Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x1C
  CountOfUnwindCodes: 7
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x1C UnwindOp: UWOP_SAVE_XMM128 (8)     OpInfo: XMM8 (8)
      Scaled Small Offset: 4 * 16 = 64 = 0x00040
    CodeOffset: 0x15 UnwindOp: UWOP_SAVE_XMM128 (8)     OpInfo: XMM7 (7)
      Scaled Small Offset: 5 * 16 = 80 = 0x00050
    CodeOffset: 0x0E UnwindOp: UWOP_SAVE_XMM128 (8)     OpInfo: XMM6 (6)
      Scaled Small Offset: 6 * 16 = 96 = 0x00060
    CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 14 * 8 + 8 = 120 = 0x78

@tannergooding
Copy link
Member Author

The regression in Matrix4x4:Decompose is because the method is not actually SIMD "aware" and instead effectively just does scalar math out of Decompose. We are generating less than ideal code for the intrinsics that are still implemented via the "legacy" SIMD path (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdintrinsiclist.h), rather than being implemented via SimdAsHWIntrinsic (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsiclistxarch.h).

In particular SIMDIntrinsicGetItem is not "ideal" because it does everything according to SSE2 and doesn't take advantage of SSE4.1 or newer instructions where available: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simd.cpp#L2280-L2319 and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdcodegenxarch.cpp#L1608-L1892. This not only leads to code bloat, but also leads to worse performance on modern hardware.

I believe these few regressions are reasonable and will be easily fixed by migrating the functions to use the newer SimdAsHWIntrinsic support most of the other functions have migrated to.

@tannergooding
Copy link
Member Author

CC. @sandreenko, @echesakovMSFT

Comment on lines +18961 to +18974
else if (op->OperIs(GT_OBJ))
{
GenTree* addr = op->AsIndir()->Addr();

if (addr->OperIs(GT_ADDR))
{
setLclRelatedToSIMDIntrinsic(op->AsOp()->gtOp1->AsOp()->gtOp1);
GenTree* addrOp1 = addr->AsOp()->gtGetOp1();

if (addrOp1->OperIsLocal())
{
setLclRelatedToSIMDIntrinsic(addrOp1);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear why we were only checking for OBJ(ADDR(LCL)) here. We also sometimes generate BLK(ADDR(LCL)) or other nodes and so something like the following seems like it might be a better choice, given that it will cover all indirections over locals:

else if (op->OperIsIndir())
{
    GenTree* lcl = op->AsInidr()->Addr()->IsLocalAddrExpr();

    if (lcl != null)
    {
        setLclRelatedToSIMDIntrinsic(lcl);
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to be very consistent on what we check for calling setLclRelatedToSIMDIntrinsic either. The paths that create SIMD or HWINTRINSIC nodes all call SetOpLclRelatedToSIMDIntrinsic, while several other paths call setLclRelatedToSIMDIntrinsic directly.

Of those that call setLclRelatedToSIMDIntrinsic directly, they vary on what they check before calling setLclRelatedToSIMDIntrinsic. Sometimes simply checking OperIsLocal, sometimes checking for specific kinds of indirections, and sometimes checking all indirections. It would be nice (assuming nothing is blocking it), if all of those checks could centrally be handled here so they are consistent.

@@ -4927,6 +4927,9 @@ struct GenTreeJitIntrinsic : public GenTreeOp
GenTreeJitIntrinsic(
genTreeOps oper, var_types type, GenTree* op1, GenTree* op2, CorInfoType simdBaseJitType, unsigned simdSize)
: GenTreeOp(oper, type, op1, op2)
, gtLayout(nullptr)
, gtAuxiliaryJitType(CORINFO_TYPE_UNDEF)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed due to an assert encountered while running PMI diffs, its related to the refactoring that happened in #50832.
If these aren't explicitly zeroed, they may be 0xDD in debug builds and cause assertions when calling the helper functions used in hashing the trees.

@sandreenko
Copy link
Contributor

It also isn't clear to me why we would want to promote SIMD types anyways, as it will almost certainly be more efficient to keep the entire value enregistered (when registers are available) and to perform the relevant insertions/extractions using the appropriate SIMD instructions instead.

because if you have "Vector4 v; v.x = something;` you have the field accessed and can't do the promotion, so the logic is:

  1. is the field accessed? yes -> can't enregister, try to promote; no-> go to 2.
  2. is it used in an intrinsic? yes -> don't promote; no -> go to 1.
  3. try to promote.

@tannergooding
Copy link
Member Author

tannergooding commented Apr 23, 2021

because if you have "Vector4 v; v.x = something;` you have the field accessed and can't do the promotion, so the logic is:

@sandreenko, sorry, I'm not following here.

My question is explicitly why we would ever want to promote here. Promotion of the SIMD types is bad and causes it to be spilled to memory or split amongst multiple registers.
This is why we track isUsedinSIMDIntrinsic and block promotion: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L18153-L18158

We instead want it to be enregistered to a single SIMD register and for field accesses to be morphed into SIMDIntrinsicGetItem intrinsics, which we do in fgMorphFieldToSIMDIntrinsicGet: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12027-L12044
The reverse (setting a field) is done in fgMorphFieldAssignToSIMDIntrinsicSet: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12058-L12120

This is because SIMD is special and we support getting or setting any field from an enregistered struct via an intrinsic and so we should never need to promote, even when an individual field is accessed or assigned.
On certain platforms, the underlying hardware even has specialized instructions that allowing performing an operation against a "selected lane" of another register (e.g. MultiplyBySelectedScalar: FMUL Vd.2D, Vn.2D, Vm.D[lane]) which we correctly optimize for when the value is not promoted.

@sandreenko
Copy link
Contributor

that is a lot of text to scroll so I could have missed it but if the question was why the condition at
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L18153-L18158
is not

       if (varDsc->lvIsSIMDType())
        {
            varDsc->lvRegStruct = true;
        }

then it could be a historical reason that we did not have logic for SIMDIntrinsicGetItem/SetItem when this code was added, you can change it and look at the regressions.

@tannergooding
Copy link
Member Author

This should be ready for review. It provides some decent improvements and resolves #50939 by improving the codegen for copies across inlined code.
Treating them as SIMD copies for x86 looks to require some improvements around multi-reg returns. Changing morph to always enregister Vector2/3/4 likewise needs some improvements to where the field access is converted to an extract/insert (there are a few places this doesn't happen today).

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit 9a31832 into dotnet:main Apr 27, 2021
@sandreenko
Copy link
Contributor

@tannergooding could you keep an eye on the outerloop and stress job results with your change just in case?

@tannergooding
Copy link
Member Author

Will do.

@tannergooding
Copy link
Member Author

could you keep an eye on the outerloop and stress job results with your change just in case?

Outerloop jobs looks to have all passed with regards to this change. JitStress jobs will run this coming Sunday.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf -1,517%] System.Numerics.Tests.Perf_Vector2.DistanceBenchmark
3 participants