Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement stubs to improve performance #65738

Merged
merged 11 commits into from
Mar 17, 2022
Merged

Conversation

janvorli
Copy link
Member

This change implements FixupPrecodeStub, PrecodeStub, CallCountingStub and VSD stubs LookupStub, DispatchStub and ResolveStub using a new mechanism with fixed code and separate RW data. The LoaderHeap was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the methods using FixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.

As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.

I think the LoaderHeap would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.

There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

@janvorli janvorli requested a review from jkotas February 22, 2022 21:21
@janvorli janvorli self-assigned this Feb 22, 2022
@janvorli
Copy link
Member Author

Performance improvements - plaintext benchmark server side start time and client side read throughput (which is linearly proportional to requests/s). Please take the results with a grain of salt, they vary a lot and these are averages of 7 runs with the lowest outlier removed. But the trend is stable.

Win x64

Main This PR
W^X off W^X on W^X off W^X on
Start time [ms] 324 386 324 357
Read throughput [MB/s] 662.27 640.35 674.22 679.10

Linux Intel x64

Main This PR
W^X off W^X on W^X off W^X on
Start time [ms] 194 251 197 212
Read throughput [MB/s] 532.87 485.32 556.29 549.11

Linux AMD x64

Main This PR
W^X off W^X on W^X off W^X on
Start time [ms] 143 182 143 155
Read throughput [MB/s] 933.55 882.35 960.85 921.60

@janvorli
Copy link
Member Author

I've measured similar trends on arm64 Linux in the past, but I need to re-measure the stuff after the recent cleanup and code unification changes.

@janvorli
Copy link
Member Author

It seems disabling the mapping loggings has broken the build, I am looking into it.

@jkotas jkotas requested a review from davidwrighton February 22, 2022 22:21
src/coreclr/vm/amd64/thunktemplates.asm Outdated Show resolved Hide resolved
src/coreclr/inc/jithelpers.h Outdated Show resolved Hide resolved
src/coreclr/utilcode/executableallocator.cpp Outdated Show resolved Hide resolved
jmp QWORD PTR [DATA_SLOT(LookupStub, ResolveWorkerTarget)]
LEAF_END_MARKED LookupStubCode, _TEXT

LEAF_ENTRY DispatchStubCode, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that you were able to get away with the extra indirection in DispatchStubCode. It will be interesting to see the results of the microbenchmark runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what indirection you mean. This was the original code:

BYTE _entryPoint [2]; // 48 B8 mov rax,
size_t _expectedMT; // xx xx xx xx xx xx xx xx 64-bit address
BYTE part1 [3]; // 48 39 XX cmp [THIS_REG], rax
BYTE nopOp; // 90 nop ; 1-byte nop to align _implTarget

BYTE part1[2]; // 48 B8 mov rax,
size_t _implTarget; // xx xx xx xx xx xx xx xx 64-bit address
BYTE part2 [1]; // 75 jne
BYTE _failDispl; // xx failLabel
BYTE part3 [2]; // FF E0 jmp rax
// failLabel:
BYTE part4 [2]; // 48 B8 mov rax,
size_t _failTarget; // xx xx xx xx xx xx xx xx 64-bit address
BYTE part5 [2]; // FF E0 jmp rax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the performance overhead from the extra indirections introduced by this refactoring. The original code had two immediates, the new code has two indirections.

@adamsitnik What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, I've thought you meant it somehow the other way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

We have two options:

  1. Run the benchmarks using two different coreruns (before & after) and store them in dedicated folders, then compare them using ResultsComparer tool: https://github.com/dotnet/performance/blob/ef3edbc52d92c6b30ba1c316082f46865e5ff1d6/docs/benchmarking-workflow-dotnet-runtime.md#preventing-regressions It's the best solution if the implementation might change and you might need to re-run the benchmarks using updated corerun.
  2. Run the benchmarks using two different coreruns without storing them in dedicated dolders, use BDN to perform the statistical test on the fly: https://github.com/dotnet/performance/blob/03207f183f042f6fc6b9f341df7a0e36b7175f5d/src/benchmarks/micro/README.md#private-runtime-builds it's good if you have only few benchmarks or don't intend to change the implementation

src/coreclr/vm/loaderallocator.cpp Outdated Show resolved Hide resolved
@@ -8804,7 +8804,15 @@ void CEEInfo::getFunctionEntryPoint(CORINFO_METHOD_HANDLE ftnHnd,
// Resolve methodImpl.
ftn = ftn->GetMethodTable()->MapMethodDeclToMethodImpl(ftn);

ret = (void *)ftn->TryGetMultiCallableAddrOfCode(accessFlags);
if (!ftn->IsFCall() && ftn->MayHavePrecode() && ftn->GetPrecodeType() == PRECODE_FIXUP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition is not right for the tiered compilation disabled case (not default, but still used by some first parties). Does this need to call into code version manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible (although all the tests pass with tiered compilation disabled except for one where the failure looks unrelated to this). What do you think is the specific problem here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the tiered compilation disabled case will go through unnecessary indirection when the target method is JITed already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to add !ftn->HasStableEntryPoint() to the condition. It severely degraded the plaintext benchmark performance. I've also thought that !ftn->IsPointingToStableNativeCode() might be needed, but the function seems to never report true together with the preexisting condition (based on my testing when I've added an assert and none of the coreclr pri1 tests hit it, with or without tiered compilation enabled).

Copy link
Member

@jkotas jkotas Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic should be something like this:

if (ftn->IsVersionable())
{
    IAT_PVALUE // must go via indirection to enable versioning 
}
else
{
    if (the target has final code)
        IAT_VALUE for the final code // this includes FCalls and methods that are already JITed (w/o tiering)
    else
        IAT_PVALUE // must go via indirection to pick up the final code once it is ready
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a try

@EgorBo
Copy link
Member

EgorBo commented Feb 24, 2022

@janvorli since you seem to be refactoring a lot of stuff is it possible now to get an understanding where a currently jitted method will end up in memory precisely (or at lest approximately)?
Basically, exposed AllocPtr of a LoaderCodeHeap

to be able to rely on it for "relocs" in jitted code

@janvorli
Copy link
Member Author

@EgorBo this change only changes allocation of stubs, it doesn't change in any way where jitted code will end up. But I am not sure if I fully understood what you are after.

This change implements `FixupPrecodeStub`, `PrecodeStub`,
`CallCountingStub` and VSD stubs `LookupStub`, `DispatchStub` and
`ResolveStub` using a new mechanism with fixed code and separate RW
data. The `LoaderHeap` was updated to support a new kind of allocation
using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the
methods using `FixupPrecode`, improving performance of the ASPNet
plaintext benchmark by 3-4% depending on the target platform (measured
on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly
aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long /
short ones anymore as they are not needed - the indirect jumps we use
now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split
implementation is now in single place for all targets. Only a few
constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but
rather written in asm and compiled. These precompiled templates are then
used as a source to copy the code from. The x86 is a bit more complex
than that due to the fact that it doesn't support PC relative indirect
addressing, so we need to relocate all access to the data slots when
generating the code pages.

As a further improvement, we could generate just a single page of the
code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are
various page sizes being used. So the asm templates are generated for
4k..64k page sizes and the variant is then picked at runtime based on
the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications
of the stub code when W^X is enabled, in the plaintext benchmark they
were reduced by 75%. That results in a significant reducing of the .NET
application startup time with W^X enabled.

I think the `LoaderHeap` would benefit from some refactoring, but I'd
prefer leaving it for a follow up. It seems that for the sake of the
review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact
locations. This helped me to drive the work and I am planning to use it
for further changes. It can be removed in the future once we reach a
final state.

There are still opportunities for improvement, but these stubs allowed
me to scrape off the most significant portion of the mappings.
* Change the CallCountingStub to not to use return address of an
unbalanced call as a stub identifying token. The counter address is used
instead on all targets.
* Fix some tabs instead of spaces
* Fix getTargetMethodDesc - in some cases, we get address of the start
of the FixupPrecode too.
* Remove a leftover comment
The assembler was generating 32 bit conditional relative jumps instead
of ones with 8 bit displacement. I've found that a presence of a global
label between the jump site and the destination makes the assembler to
do that. Changing PATCH_LABEL macro fixed it.
{
Precode* pPrecode = Precode::GetPrecodeFromEntryPoint((PCODE)hlpDynamicFuncTable[dynamicFtnNum].pfnHelper);
_ASSERTE(pPrecode->GetType() == PRECODE_FIXUP);
*ppIndirection = ((FixupPrecode*)pPrecode)->GetTargetSlot();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees that we have the final code of the method by this point?

If this is a valid optimization, we should be doing it when filling hlpDynamicFuncTable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have the final code. This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot. When we have the JITted code, the slot points to it. Without this optimization, JIT would jump to the beginning of the fixup stub and the indirect jump in there would jump through the slot. So we save one jump by this. I have found this using the performance repo microbenchmarks where the casting benchmarks were consistently slower with my change. With this fix, they are now consistently faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can modify the hlpDynamicFuncTable entry to be able to carry either the indirection or the target address and a flag indicating which one it is. I was going back and forth in my head whether to do that or do it the way I've ended up doing it.

However, there is a _AddrIsJITHelper function on 32 bit Windows in the debugger controller code that compares addresses of all the helpers with an address that the debugger has stepped into to avoid the debugger stopping in unmanaged runtime code. While I believe we should not call this methods that code for prefix stubs (we should get TRACE_STUB tracetype), I need to double check that. I am talking about the case when the helper was not JITted yet and so the indirection would go to the middle of the fixup stub and the code there would not detect it was in a helper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot.

Ah ok, I have missed that.

@janvorli
Copy link
Member Author

I have removed the VSD stubs change, it was causing steady state performance degradation in more complex ASPNet benchmarks like the Orchard or Fortunes. I am planning to re-add the Lookup stub in a follow up change, but for this PR, it was safer to remove it all.

Extensive benchmarking has shown that the new model for VSD stubs is
causing steady state performance degradation for code using a lot of
virtual calls. It wasn't showing up in the plaintext benchmark I have
used as the driver of the change.
This change looses part of the startup perf improvements, but I am
planning to add back the Lookup stubs in a follow up change. Those
should not be perf critical and they are likely the ones where the
startup improvements are gained from.
src/coreclr/vm/common.h Outdated Show resolved Hide resolved
@janvorli
Copy link
Member Author

Before merging this change, I would like to share some new details on the performance. It turned out that while the plaintext and JSON benchmarks that I was using as a driver were showing gains all over the place, Orchard and Fortunes that I've tested after this PR was created are showing 2.5-4.5% regression in the read throughput on Citrine Intel Windows and Linux. But on Citrine AMD Linux, the same metric is showing 12-17% gain. The AMD machine has more CPU cores, so it might be one of the reasons.
I wasn't able to pinpoint the source of the regression yet, I'll keep investigating these regressions after the PR is merged.

@janvorli janvorli merged commit eb8460f into dotnet:main Mar 17, 2022
@EgorBo
Copy link
Member

EgorBo commented Mar 22, 2022

image

I assume this perf improvement is because of this PR 👍 (according to the graph, the commit range is 05cb7f5...110cb9f)

@kunalspathak
Copy link
Member

kunalspathak commented Mar 22, 2022

@kunalspathak
Copy link
Member

kunalspathak commented Mar 22, 2022

@DrewScoggins
Copy link
Member

DrewScoggins commented Mar 24, 2022

@DrewScoggins
Copy link
Member

DrewScoggins commented Mar 24, 2022

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
* Reimplement stubs to improve performance

This change implements `FixupPrecodeStub`, `PrecodeStub`
and `CallCountingStub` using a new mechanism with fixed code and separate RW
data. The `LoaderHeap` was updated to support a new kind of allocation
using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the
methods using `FixupPrecode`, improving performance of the ASPNet
plaintext benchmark by 3-4% depending on the target platform (measured
on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly
aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long /
short ones anymore as they are not needed - the indirect jumps we use
now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split
implementation is now in single place for all targets. Only a few
constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but
rather written in asm and compiled. These precompiled templates are then
used as a source to copy the code from. The x86 is a bit more complex
than that due to the fact that it doesn't support PC relative indirect
addressing, so we need to relocate all access to the data slots when
generating the code pages.

As a further improvement, we could generate just a single page of the
code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are
various page sizes being used. So the asm templates are generated for
4k..64k page sizes and the variant is then picked at runtime based on
the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications
of the stub code when W^X is enabled, in the plaintext benchmark they
were reduced by 75%. That results in a significant reducing of the .NET
application startup time with W^X enabled.

I think the `LoaderHeap` would benefit from some refactoring, but I'd
prefer leaving it for a follow up. It seems that for the sake of the
review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact
locations. This helped me to drive the work and I am planning to use it
for further changes. It can be removed in the future once we reach a
final state.

There are still opportunities for improvement, but these stubs allowed
me to scrape off the most significant portion of the mappings.
@ghost ghost locked as resolved and limited conversation to collaborators Apr 23, 2022
@jozkee
Copy link
Member

jozkee commented Oct 14, 2022

@janvorli we detected a slight x64 regression for the System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List) benchmark in the 7.0 RC2 vs 6.0 perf report that seems to be related to this commit range c032e0d...731d936. Do you suspect this PR could be to blame?

newplot (1)

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

Result Base Diff Ratio Alloc Delta Operating System Bit Processor Name Modality
Same 2740.76 2694.02 1.02 +0 ubuntu 18.04 Arm64 Unknown processor
Same 621.78 599.29 1.04 +0 Windows 11 Arm64 Unknown processor
Same 956.34 949.61 1.01 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 1009.64 987.95 1.02 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 561.31 561.70 1.00 +0 macOS Monterey 12.6 Arm64 Apple M1
Same 546.88 545.49 1.00 +0 macOS Monterey 12.6 Arm64 Apple M1 Max
Slower 917.99 1120.40 0.82 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 732.11 826.22 0.89 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 499.89 572.45 0.87 +0 Windows 11 X64 AMD Ryzen 9 5900X
Same 455.45 463.50 0.98 +0 Windows 11 X64 AMD Ryzen 9 7950X
Same 716.52 775.36 0.92 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 699.97 829.86 0.84 +0 debian 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 496.60 540.29 0.92 +0 ubuntu 18.04 X64 AMD Ryzen 9 5900X
Slower 757.61 911.97 0.83 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 497.21 559.91 0.89 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Same 1078.91 1143.66 0.94 +0 ubuntu 20.04 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 755.07 779.39 0.97 +0 ubuntu 20.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 1031.91 1159.22 0.89 +0 macOS Big Sur 11.7 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 866.00 981.43 0.88 +0 macOS Monterey 12.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

@janvorli
Copy link
Member Author

@jozkee I am sorry for missing your message before. Yes, I believe the regression was caused by that change. There are few corner cases where the new stubs implementation has slight negative effect on micro-benchmarks. Some delegate calls are one of those cases and this test was one of those where I've seen it to have visible impact when I was developing the change.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants