Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DivRem intrinsic for X86 #66551

Merged
merged 85 commits into from
Feb 15, 2023
Merged

Implement DivRem intrinsic for X86 #66551

merged 85 commits into from
Feb 15, 2023

Conversation

huoyaoyuan
Copy link
Member

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 13, 2022
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation and removed community-contribution Indicates that the PR has been added by a community member labels Mar 13, 2022
@ghost
Copy link

ghost commented Mar 13, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author: huoyaoyuan
Assignees: -
Labels:

area-CodeGen-coreclr, new-api-needs-documentation

Milestone: -

@huoyaoyuan huoyaoyuan added area-System.Runtime.Intrinsics and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 13, 2022
@ghost
Copy link

ghost commented Mar 13, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author: huoyaoyuan
Assignees: -
Labels:

area-System.Runtime.Intrinsics, new-api-needs-documentation

Milestone: -

src/coreclr/jit/gentree.h Show resolved Hide resolved
Comment on lines 493 to 497
case InstructionSet_Vector128:
case InstructionSet_X86Base:
return impBaseIntrinsic(intrinsic, clsHnd, method, sig, simdBaseJitType, retType, simdSize);
case InstructionSet_X86Base:
case InstructionSet_X86Base_X64:
return impX86BaseIntrinsic(intrinsic, method, sig, simdBaseJitType);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had an interesting side effect: if FeatureSIMD is turned off, X86Base.Pause will also be turned off.

Comment on lines 6779 to 6782
// For local stores on XARCH we can't handle another lclVar source.
// If the source was another lclVar similarly promoted, we would
// have broken it into multiple stores.
if (lclNode->OperIs(GT_STORE_LCL_VAR) && !lclNode->gtGetOp1()->OperIs(GT_CALL))
if (lclNode->OperIs(GT_STORE_LCL_VAR) && lclNode->gtGetOp1()->OperIs(GT_LCL_VAR))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is doing, but not doing this will cause crossgen2 failing for Utf8Formatter.TryFormat(DateTime).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to find the root cause of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have time for this. I brought it from another PR mentioned earlier and it just works. It's better to ask the original author.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt can you help explaining this? Not doing this isn't causing failures now, but the code looks necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huoyaoyuan - Carol doesn't work on this code base anymore. As the comment suggests, " If the source was another lclVar similarly promoted, we would have broken it into multiple stores.". So, I think you should revert this change, given that it works now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was just a means to try which nodes are caught if op1 != GT_CALL. So the reason it is failing (as expected) is because the tree now contains the op1 == GT_HWINTRINSIC.

                                                            ┌──▌  t10    int    
                                                            ├──▌  t13    int    
                                                            ├──▌  t58    int    
N007 (  8,  9) [000014] -----------                   t14 = ▌  HWINTRINSIC struct int DivRem $140
                                                            ┌──▌  t14    struct 
N009 ( 18, 16) [000017] MA---------                         ▌  STORE_LCL_VAR struct<System.ValueTuple`2[System.Int32, System.Int32], 8>(P) V05 tmp3         
                                                            ▌    int    V05.Item1 (offs=0x00) -> V13 tmp11        
                                                            ▌    int    V05.Item2 (offs=0x04) -> V14 tmp12   

We do want to enregister in such cases and not doing it (retaining op1 != GT_CALL) spills the result on stack.

https://www.diffchecker.com/TUY5sGjV

Copy link
Contributor

@SingleAccretion SingleAccretion Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same code on xarch?

Yes and no. My suggestion includes the (significant) addition that we will DNER all locals, not just promoted ones. It is to prevent the case I described above from arising: struct retyped as long = multi-reg-node and other assigns codegen doesn't support. Alternatively, we could of course support them in codegen, but that seems out of scope for this change.

Also, for the below code, do you mean to say use the current check of independent promotion too?

Yes, the independent promotion check (and the field count check as well) must stay. In fact, I think they should be expanded. Consider struct { promoted field<double> } = HWI(int, int) and other mismatched cases. Right now they would pass, but we need to DNER them, by passing the register count to the method instead of the return type descriptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now new failures and I don't have enough time to investigate in more depth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look hopefully next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code as per @SingleAccretion suggestion.

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved
// DIV implicitly put op1(lower) to EAX and op2(upper) to EDX
srcCount += BuildOperandUses(op1, RBM_EAX);
srcCount += BuildOperandUses(op2, RBM_EDX);
srcCount += BuildOperandUses(op3);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does RMW mean? Something like ADD src, dst that src is 1 register represented by both operand and result?
I tried BuildDelayFreeUses but causes register conflict in optimized code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does RMW mean? Something like ADD src, dst that src is 1 register represented by both operand and result?

Correct.

I tried BuildDelayFreeUses but causes register conflict in optimized code.

Do you have an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example?

I did false assert when op3 is spilled onto stack, the regNum was remained as the register it was spilled from. This shouldn't be an issue now.

@@ -317,13 +317,25 @@ public static int DivRem(int a, int b, out int result)
// Restore to using % and / when the JIT is able to eliminate one of the idivs.
// In the meantime, a * and - is measurably faster than an extra /.

if (X86Base.IsSupported)
{
(int quitient, result) = X86Base.DivRem((uint)a, a >> 31, b);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There lacks a way to represent the CDQ/CQO instruction. Is there any way better than >> 31?

@@ -1261,6 +1261,20 @@ private static readonly (string templateFileName, Dictionary<string, string> tem
("ScalarTernOpBinResTest.template", new Dictionary<string, string> { ["Isa"] = "Bmi2.X64", ["Method"] = "MultiplyNoFlags", ["RetBaseType"] = "UInt64", ["Op1BaseType"] = "UInt64", ["Op2BaseType"] = "UInt64", ["Op3BaseType"] = "UInt64", ["NextValueOp1"] = "UInt64.MaxValue", ["NextValueOp2"] = "UInt64.MaxValue", ["NextValueOp3"] = "0", ["ValidateResult"] = "ulong expectedHigher = 18446744073709551614, expectedLower = 1; isUnexpectedResult = (expectedHigher != higher) || (expectedLower != lower);" }),
};

private static readonly (string templateFileName, Dictionary<string, string> templateData)[] X86BaseInputs = new []
{
("ScalarTernOpTupleBinRetTest.template", new Dictionary<string, string> { ["Isa"] = "X86Base", ["Method"] = "DivRem", ["RetBaseType"] = "Int32", ["Op1BaseType"] = "UInt32", ["Op2BaseType"] = "Int32", ["Op3BaseType"] = "Int32", ["NextValueOp1"] = "UInt32.MaxValue", ["NextValueOp2"] = "-2", ["NextValueOp3"] = "-0x10001", ["ValidateResult"] = " int expectedQuotient = 0xFFFF; int expectedReminder = -2; isUnexpectedResult = (expectedQuotient != ret1) || (expectedReminder != ret2);" }),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were picked to verify correct signed-ness is used.

@huoyaoyuan
Copy link
Member Author

Code gen for sample method:

        public static ulong XL(ulong a, ulong b)
        {
            (ulong q, ulong r) = Math.DivRem(a, b);
            return q + r;
        }
G_M52626_IG01:
       mov      qword ptr [rsp+10H], rdx
						;; bbWeight=1    PerfScore 1.00

G_M52626_IG02:
       xor      edx, edx
       mov      rax, rcx
       div      rdx:rax, qword ptr [rsp+10H]
       add      rax, rdx
						;; bbWeight=1    PerfScore 61.75

G_M52626_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 19

In one commit it uses R8 instead of stack, but fails with register confliction in optimized code.

It can be worse if RDX is used in calling convention. For the following method:

        public long DivRemInt64(long a, long b)
        {
            (long q, long r) = Math.DivRem(a, b);
            return q + r;
        }

It saves a copy to stack:

       mov       [rsp+10],rdx
       sar       rdx,3F
       mov       rax,[rsp+10]
       idiv      r8
       add       rax,rdx
       ret
; Total bytes of code 21

And causes micro benchmark regresses. But real world sample can differ by register assignation.

@huoyaoyuan
Copy link
Member Author

Do I need to implement for Mono in this PR or create an issue to track it? This is an already implemented ISA and used in corelib. Lack of implementation would cause existing tests to fail. /cc @vargaz

@huoyaoyuan huoyaoyuan marked this pull request as draft March 13, 2022 13:34
@lambdageek
Copy link
Member

When running the aot compiler, the MONO_PATH env var needs to be set in such a way that the input corlib assembly is the same assembly as the one found in the directories in MONO_PATH. Not sure why this only fails for this one test.

I do see that this path is setup correctly.

2023-02-13T18:51:19.8150781Z   aot-compile: compiling /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r/System.Private.CoreLib.dll; MONO_PATH: /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r:/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/Tests/Core_Root
2023-02-13T18:51:19.8580478Z EXEC : error : Loaded assembly '/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/Tests/Core_Root/System.Private.CoreLib.dll' doesn't match original file name '/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_r/System.Private.CoreLib.dll'. Set MONO_PATH to the assembly's location. [/__w/1/s/src/mono/msbuild/aot-compile.proj]

@lambdageek

The issue is there's two different libraries in the MONO_PATH both called CoreLib.dll - the "real" one and the fake one that the testsuite adds in order to allow the tests to call APIs that are nto in System.Runtime.

That is very likely to confuse the AOT compiler. It would be much better if the fake one didn't end up in the directory for the test case.

I think it's worth investigating if the fake one can be a reference assembly, instead. But I'm not sure even that would be enough.

I'll try to build this PR (or #80297, which uses the same trick) locally to see if there's some way to make the AOT compiler happy

@lambdageek
Copy link
Member

Left a comment on the other PR that does the same fake corelib trick with one solution: #80297 (comment)

Unfortunately that solution just exposed an issue where the new intrinsic created some condition that the AOT compiler did not expect to see.

But maybe it'll work on this PR. Here's the commit (for the other PR) that shows the approach lambdageek@347d943

Make the DivRem tests reference the DivRem fake
CoreLib as reference assembly; and ensure that it is not included as a
ProjectReference by the toplevel HardwareIntrinsics merged test
runners.

The upshot is that the DivRem tests can call the extra APIs via
a direct reference to CoreLib (instead of through System.Runtime), but
the fake library is not copied into any test artifact directories, and
the Mono AOT compiler never sees it.

RefPosition* op3RefPosition;
srcCount += BuildDelayFreeUses(op3, op1, RBM_NONE, &op3RefPosition);
if ((op3RefPosition != nullptr) && !op3RefPosition->delayRegFree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to also set it as delay free for op3 because we don't track delayFree with regards to a specific operand, is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We just want to make sure that op3 is marked as delayFree so neither of op1 or op2's registers conflict with that of op3.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have tracking issues covering the mono work and any work required to use this from Math.DivRem before merging.

@kunalspathak
Copy link
Member

Would be good to have tracking issues covering the mono work and any work required to use this from Math.DivRem before merging.

Created:

@kunalspathak
Copy link
Member

Failure is known #81123

@kunalspathak kunalspathak merged commit 45c314f into dotnet:main Feb 15, 2023
@@ -0,0 +1,76 @@
<?xml version="1.0" encoding="utf-8"?>
<!-- https://learn.microsoft.com/en-us/dotnet/fundamentals/package-validation/diagnostic-ids -->
<Suppressions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have all these compatibility suppressions instead of adding these API to ref assemblies?

<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<OutputType>Library</OutputType>
<CLRTestKind>SharedLibrary</CLRTestKind>
<AssemblyName>System.Private.CoreLib</AssemblyName>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing CoreLib from the tests is anti-pattern. Tests should only reference public surface. If the tests need to hack on internal implementation details, they should use private reflection. Why are we doing this?

@jkotas
Copy link
Member

jkotas commented Feb 16, 2023

The structure introduce by this PR is very non-standard and full of anti-patterns. I am tempted to revert it so that it can be done properly. @ViktorHofer @akoeplinger Thoughts?

@jkotas
Copy link
Member

jkotas commented Feb 16, 2023

We don't want the new APIs to get consumed publically just yet,

Why not?

If you are worried that the Mono implementation is not in place yet, it would be fine to add a temporary non-intrinsic implementation under MONO ifdef. It would be very local change and it would avoid spreading the CoreLib hacks through many places.

@kunalspathak
Copy link
Member

I agree and I too do not like these hacks. Unfortunately, we had to go this route because there are few things, we need to optimize in order to consume it in Math.DivRem (Related #66551 (comment)). We wanted to get in this work, but not expose it just yet and so @tannergooding , @pentp and me agreed to go this route. This is not even about mono at this point but also applicable to coreclr. If you think we should revert it, we can revert it now, fix the optimizations that would make DivRem ready and consumable (e.g. Math.DivRem) and re-merge this one in?

@kunalspathak
Copy link
Member

We also had planned to do similar thing in #80297 since the new APIs that I am adding are not yet approved, but I decided to rather wait for API approval than to merge such anti-patterns of consuming them in test projects.

@jkotas
Copy link
Member

jkotas commented Feb 16, 2023

It is common to add new public APIs with simpler implementation with less-than-ideal performance first, and then optimize the implementation in subsequent PRs. We have done that many times. For example, #77799 added initial implementation of FrozenColllections and then number of subsequent PRs optimized it further, and I am sure we will get some more optimizations before .NET 8 ships. We should follow the same pattern for codegen intrinsics.

If you are really worried about the API being used accidentally before it is ready for prime time, you can slap RequiresPreviewFeaturesAttribute on it, like how it is done for AvxVnni Intrinsics currently.

@jkotas
Copy link
Member

jkotas commented Feb 16, 2023

We also had planned to do similar thing in #80297 since the new APIs that I am adding are not yet approved, but I decided to rather wait for API approval than to merge such anti-patterns of consuming them in test projects.

Yes, please do wait for the API shape to be approved first before merging the implementation.

@kunalspathak
Copy link
Member

If you are really worried about the API being used accidentally before it is ready for prime time, you can slap RequiresPreviewFeaturesAttribute on it, like how it is done for AvxVnni Intrinsics currently.

I see. Ok, I will submit a PR to revert part of the hacks and add RequiresPreviewFeaturesAttribute.

@tannergooding
Copy link
Member

Just noting these are hardware intrinsics and providing a software fallback is itself problematic and against the general design goal of the APIs and can lead users into a pit of failure. If one is provided, it needs to very explicitly still throw where the underlying platform is not x86/x64 and when X86Base.IsSupported reports false.

That being said, exposing a platform specific intrinsic API publicly without it being usable for its intended purpose is likewise problematic and leads users into a different pit of failure. This is why I originally pushed for the general support to be completed and for the API to be used from the primary consumption sites (such as Math.DivRem) before it got merged.

I only relented on that with the premise that these would not be public until after the other work was completed. So if we are reverting this, then any new PR should likely just complete the additional work so we don't risk shipping something that isn't actually functioning the way users expect.

@kunalspathak
Copy link
Member

kunalspathak commented Feb 16, 2023

So if we are reverting this

I was just planning to do the following:

  • add back the DivRem methods in ref
  • add RequiresPreviewFeaturesAttribute on the new APIs.
  • revert the fake library workaround added to make test projects work

Do we agree that it is the best course or should we completely revert this PR, address the codegen issues and likely close #82194? I am fine with either solution.

@jkotas
Copy link
Member

jkotas commented Feb 16, 2023

Yes, I think that it is the best course of action.

@huoyaoyuan huoyaoyuan deleted the divrem branch February 16, 2023 17:05
@kunalspathak
Copy link
Member

Yes, I think that it is the best course of action.

#82221

@huoyaoyuan
Copy link
Member Author

Glad to see this finally get merged!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DivMod instrinct for intel x86/x64