-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased variance in some benchmark tests #87324
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWe have recently noticed a mysterious increase in variance for a number of benchmark tests. For example MicroBenchmarks.Serializers.Json_FromStream.SystemTextJson_SourceGen_ Zooming in to the recent history we see that the variance changes started around June 3. One hypothesis is that the variance is caused by Dynamic PGO, but Dynamic PGO was enabled on May 19th and performance improved at that point and remained steady for two weeks afterwards. So, while Dynamic PGO may be involved there is seemingly some other causal factor here too. Looking at this benchmark across os/isa we see for this particular test the variance was only an issue for win-x64-intel (pink) I will collate a few more examples here to see if anything obvious jumps out.
|
Looking at secondary counters it seems some aspect of GC may have changed -- fewer Gen0 GCs If we zoom in like above we see the GC change is well correlated with the perf change (hard to get the x-axis exactly the same here, but it's close). A plausible commit range here is 7febca6...2f18320 but there does not seem to be any obvious culprit in that range. |
Many of the other tests in #87179 have similar behaviors, though most of those seem to have stabilized as of late. |
System.Collections.ContainsTrue<Int32>.ICollection(Size: 512)'s variance seems to be related to enabling Dynamic PGO NonPGO (until May 19) history So this is not new behavior. @EgorBo possibly a case for testing if reservoir size is too small. |
The first test in this issue now seems to have calmed down: Possibly because of the larger reservoir table size for class probes in #87332 |
Another example (from #87179): System.Collections.CtorFromCollection(Int32).SortedSet(Size: 512) |
Some analysis I did for #87179 suggests one idea to try. With the scalable profile mode #84427 and pervasive sparse profiling #80481, during count reconstruction we often compute the difference between counter values; and if that value is negative, we set the count to zero. But zero count has a special meaning with PGO (see eg #48778) and it may be the jit is getting overly hung up here by the zero, and this happens randomly from run to run. We could validate this by seeing if the PGO counts for the benchmarks above are large enough that we switch from deterministic to randomized counting (currently this happens when the count reaches 8192). So one idea is to set the diff count for these inconsistent cases to something small or nonzero, say 1% of the larger/smaller value being subtracted, since large counts have a 2% or so margin of error. |
The hottest method in System.Collections.CtorFromCollection(Int32).SortedSet(Size: 512) is
So it certainly seems likely that we could run into the problem noted above, where we subtract two large approximate counts and end up with a negative result. |
With the advent of scalable profile counters (or even with the old racing counters) counts might be approximate or inconsistent. When we run across a negative count during reconstruction, set the afflicted count to a small positive value instead of to zero, since zero has special meaning to the JIT. The aim is to reduce some of the benchmark instability we are seeing in dotnet#87324. Depending on exact counter values, we can see different sets of edge weights (and hence block weights) from run to run.
With the advent of scalable profile counters (or even with the old racing counters) counts might be approximate or inconsistent. When we run across a negative count during reconstruction, set the afflicted count to a small positive value instead of to zero, since zero has special meaning to the JIT. The aim is to reduce some of the benchmark instability we are seeing in #87324. Depending on exact counter values, we can see different sets of edge weights (and hence block weights) from run to run.
So possibly the variance is from using median results in the lab where the iteration to iteration times drop sharply once everything is finally tiered up (note on my machine PGO is quite typically always faster)
Here's the entire set of these tests, locally. "diff" is with PGO, "base" is without. enchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
Adding [MinWarmupCount(6, forceAutoWarmup: true)]
[MaxWarmupCount(10, forceAutoWarmup: true)] Gives better numbers with PGO enabled (diff) BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
|
It looks like there are a variety of explanations:
I am going to close this as I don't see anything actionable right now; we can revisit paritcular cases if they prove problematic. |
We have recently noticed a mysterious increase in variance for a number of benchmark tests. For example
MicroBenchmarks.Serializers.Json_FromStream.SystemTextJson_SourceGen_
Zooming in to the recent history we see that the variance changes started around June 3.
One hypothesis is that the variance is caused by Dynamic PGO, but Dynamic PGO was enabled on May 19th and performance improved at that point and remained steady for two weeks afterwards. So, while Dynamic PGO may be involved there is seemingly some other causal factor here too.
Looking at this benchmark across os/isa we see for this particular test the variance was only an issue for win-x64-intel (pink)
I will collate a few more examples here to see if anything obvious jumps out.
The text was updated successfully, but these errors were encountered: