Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove extra UnmanagedCallersOnly overhead on x86 #46238

Closed

Conversation

jkoritzinsky
Copy link
Member

@jkoritzinsky jkoritzinsky commented Dec 19, 2020

This PR teaches the JIT to understand how to emit a method with a native x86 calling convention and then utilizes that feature to remove the argument shuffling stubs for both Windows and Unix x86 Reverse P/Invoke thunks.

As part of this work, the JIT helpers for Reverse P/Invokes now report transitions to the profiler and the debugger as the x86 Reverse P/Invoke thunk did.

This allows us to remove the rest of the overhead that x86 has for Reverse P/Invokes and UnmanagedCallersOnly methods in comparison to other platforms.

Since I changed the signature of the JIT_ReversePInvokeEnter helper, I tentatively changed the JIT-EE interface ID. If we don't need to do so for this change, we can revert that part of this change.

I've validated that there are no crossgen diffs. The results from a framework crossgen JIT diff on Windows x86 are included below.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 26561274
Total bytes of diff: 26561274
Total bytes of delta: 0 (0.00% of base)

0 total files with Code Size differences (0 improved, 0 regressed), 267 unchanged.

0 total methods with Code Size differences (0 improved, 0 regressed), 230772 unchanged.
Found 18 files with textual diffs.

Fixes #33582
Fixes #46177
Fixes #47037

…he correct argument order and register usage on x86.
…ve Windows x86 unmanaged callers only to not have extra overhead and put reverse P/Invoke stubs for Windows x86 on the UnmanagedCallersOnly plan.
…ntime; branch 'master' of github.com:dotnet/runtime into x86-reverse-stub-cleanup
@jkotas
Copy link
Member

jkotas commented Dec 19, 2020

It would be useful to collect some performance numbers: marshalled reverse P/Invoke no-op delegate, UnmanagedCallersOnly no-op, on both x86 and x64.

image

I love the simplification!

…emitted unmanaged thunk stub, we need to handle the x86 differences for copy-constructed parameters in the IL stub.
@jkoritzinsky
Copy link
Member Author

I've started taking a look at the test failures. The Vector2_3_4 test failure is due to CodeGen::genSIMDSplitReturn in the JIT not correctly handling the case where we are returning a Vector2 in EAX:EDX. It attempts to emit the following instructions:

IN001e:        vmovaps  eax, xmm0
IN001f:        vmovaps  edx, eax
IN0020:        vshufpd  xedx, xedx, 1

However, these instructions are incorrect. The actually emitted instructions are:

17EE382C  vmovaps     xmm0,xmm0  
17EE3830  vmovaps     xmm2,xmm0  
17EE3834  vshufpd     xmm2,xmm2,xmm2,1

As a result, the return value isn't put into the correct registers and the test fails.

We've had a number of these types of bugs when doing this work for x86. @sandreenko can you help out with fixing this one after you fix the bug I created the last time I tried to fix this type of bug?

@jkoritzinsky
Copy link
Member Author

I've validated as of the last commit (69e14c6) that there are still no JIT diffs on x86.

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jit changes LGTM with 1 qustion

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The one final thing I would like to confirm validation for is the IJW scenario. I would think that we have existing tests for this but want to confirm that to ensure we don't give any surprises to our WPF friends.

@jkoritzinsky
Copy link
Member Author

I've validated that this PR passes all of the C++/CLI tests that the C++ team runs.

jkotas
jkotas previously requested changes Jan 19, 2021
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkoritzinsky jkoritzinsky dismissed jkotas’s stale review January 20, 2021 02:18

Addressed linked comment.

@jkoritzinsky
Copy link
Member Author

Merged in 5ec0c7a

GitHub messed up and didn’t mark the PR as merged. I’ll close the PR.

MatthewSteeples pushed a commit to MatthewSteeples/runtime that referenced this pull request Jan 20, 2021
* Implement emitting an unmanaged calling convention entry point with the correct argument order and register usage on x86.

* Move Unix x86 to the UnmanagedCallersOnly plan now that we don't need to do argument shuffling.

* Add SEH hookup and profiler/debugger hooks to Reverse P/Invoke entry helper to match custom x86 thunk.

Fixes dotnet#46177

* Remove Windows x86 assembly stub for individual reverse p/invokes. Move Windows x86 unmanaged callers only to not have extra overhead and put reverse P/Invoke stubs for Windows x86 on the UnmanagedCallersOnly plan.

* Further cleanup

* Remove extraneous UnmanagedCallersOnly block now that x86 UnmanagedCallersOnly has been simplified.

* Undo ArgOrder size specifier since it isn't needed and it doesn't work.

* Fix copy constructor reverse marshalling. Now that we don't have the emitted unmanaged thunk stub, we need to handle the x86 differences for copy-constructed parameters in the IL stub.

* Fix version guid syntax.

* Remove FastNExportHandler.

* Revert "Remove FastNExportHandler."

This reverts commit 423f70e.

* Fix setting up entry frame for new thread.

* Allow the NExportSEH record to live below ESP so we don't need to create a new stack frame.

* Fix formatting.

* Assign an offset for the return buffer on x86 since it might come in on the stack.

* Make sure we use the TC block we just put in on x86 as well.

* Shrink the ReversePInvokeFrame on non-x86 back to master's size.

* Fix arch-specific R2R constant.

* Pass the return address of the ReversePInvokeEnter helper to TraceCall instead of the entry point and call TraceCall from all JIT_ReversePInvokeEnter* helpers.

* Fix ILVerification and ILVerify

* fix R2R constants for crossgen1

* Don't assert ReversePInvokeFrame size for cross-bitness scenarios.
// As a result, we specially decorate this method to have the correct calling convention
// and argument ordering for an HCALL, but we don't use the HCALL macros and contracts
// since this method doesn't follow the contracts.
void F_CALL_CONV HCCALL3(JIT_ReversePInvokeEnter, ReversePInvokeFrame* frame, CORINFO_METHOD_HANDLE handle, void* secretArg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have not addressed my comment about additional performance overhead of passing the extra method handle and secret arg arguments. It is introducing performance regression, it is breaking change for R2R (the helper has different signature now), and this design does not work for NativeAOT.

We should be paying this overhead only when somebody turns on the specific profiling notifications (they is rare and slow by design):

  • Add a flag to JIT/EE interface that controls whether
  • Keep the existing helper signature as is, introduce a new helper that takes these arguments
  • Call the new helper from JIT/EE interface only when the new flag is introduced

Could you please fix that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #47223

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at this ASAP.

{
_ASSERTE(frame != NULL && handle != NULL);

void* traceAddr = _ReturnAddress();
Copy link
Member

@jkotas jkotas Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to fetch the return address upfront. It is only needed on the slow path when we are calling the rare helper.

@jkoritzinsky jkoritzinsky deleted the x86-reverse-stub-cleanup branch January 21, 2021 19:29
@ghost ghost locked as resolved and limited conversation to collaborators Mar 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
9 participants