-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray #67176
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsbroken out of #67101 Commit range: 6bf873a...bc5e386. Seems the only relevant change is JIT: enable reading PGO data when switching to optimized (https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)
|
Hopefully addressed by the PGO update we're waiting for... cc @EgorBo If not I'll take a look... |
Windows ARM64 Related benchmark shows similar but smaller regression Windows x64 shows similar but much smaller regression These are likely the same issues we see in #67318 where block reordering driven by PGO data is interfering with loop optimizations. Will need to drill in to be sure. |
Regression has persisted even after we deferred some block reordering in #69878 Need to investigate this one. |
Can repro this locally, digging in.
|
Trying to find a way to run these with the Got past it by updating BDN to use |
Assembly for the key loops. Main differences are:
Likely it is the difference in the call to Hoisting the IAT_PVALUE indir may end up being risky as the indir target presumably might get changed. Perhaps for frequently running loops we could split the loop somehow and refresh the hoisted value every so often. But we could easily hoist/cse the creation of the indir addresses.
|
I cherry-picked the changes for #70988 and while I see more compact call in AFTER case, it doesn't make the benchmark any faster. So seemingly the perf issue lies elsewhere? Profiling is strongly implicating code within the method. Comparing two BDN runs with same number of iterations/invocations we see all the extra exclusive cycles in |
Looking at the raw sample data from ETL it seems to suggest this bit of code in the AFTER version, right after the F9401A60 ldr x0, [x19,#48]
B4000480 cbz x0, G_M9251_IG25
B9400801 ldr w1, [x0,#8]
6B14003F cmp w1, w20
54000283 blo G_M9251_IG23 The BEFORE code layout inverts the first branch.
|
Ubuntu arm64 shows similar regression at the same change. Feels like I must be missing something obvious here. cc @EgorBo |
Also (note to self) perfview will dump the raw counts, if you tell it to annotate the source for a method, and look in the log, you'll see something like:
This gives similar (though not exactly the same) data as my instructions retired tool with the new -show-samples option, which for the AFTER run above produces:
The difference may be that my tool uses the exact BDN events to filter the samples, while the perfview report above is using an "include filter" on |
Similar results on the M1
|
@AndyAyersMS so if I understand you correctly you blame an indirect call to |
Good news, turns out my PR #70988 doesn't need |
However, the actual improvement from the direct call is still small (~1%) |
Right, the older version of #70988 had the same effect on the codegen and didn't fix the perf issue here. |
The benchmark is basically "fixed" with #71534 |
Interesting. Would still be nice to understand why there was a regression here. Since it repros on many different devices it seems like it is something fundamental and not just a microarchitectural problem. Seems like there might be something important to be learned. |
My current hunch is that something in the stub chain between this method and |
Btw, performance was slightly faster (>2%) if I remove this condition + #70988 (without It means - just call absolute address, don't try to reloc it (or emit jump stubs) |
Running on an ampere, from perf, overview:
So seems to be primarily a stall issue. Drilling in a bit:
So looks like a backend stall. Was not able to pin this down further, even following approach here: Arm Neoverse N1 Core: Performance Analysis Methodology) to any particular backend event. Also consulted the Arm_Neoverse_N1_PMU_Guide. Sampling on |
We no longer benchmark that bit of hardware, so not much we can do at this point. |
broken out of #67101
Commit range: 6bf873a...bc5e386.
Seems the only relevant change is JIT: enable reading PGO data when switching to optimized (https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)
category:performance
theme:benchmarks
The text was updated successfully, but these errors were encountered: