Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize performance measurement #43227

Open
11 of 20 tasks
kunalspathak opened this issue Oct 9, 2020 · 24 comments
Open
11 of 20 tasks

Stabilize performance measurement #43227

kunalspathak opened this issue Oct 9, 2020 · 24 comments
Assignees
Labels
area-Meta Bottom Up Work Not part of a theme, epic, or user story tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Oct 9, 2020

The code quality and performance of RyuJIT is tracked internally by running MicroBenchmarks in our performance lab. We regularly triage the performance issues opened by the .NET performance team. After going through these issues for past several months, we have identified some key points.

Stability

Many times, the set of commits that are flagged as introducing regression in a benchmark, do not touch the code that is tested in the benchmark. In fact, the assembly code generated for the .NET code that is being tested is often identical and yet the measurements show differences. Some of our investigation reveals that the fluctuation in the benchmark measurements happen because of the misalignment of generated JIT code in process memory. Below is an example of LoopReturn benchmark that shows such behavior.

image

It is very time consuming for .NET developers to do the analysis of benchmarks that regressed because of things that are out of control of .NET runtime. In the past, we have closed several issues like #13770, #39721 and #39722 because they were regressions because of code alignment. A great example that we found out while investigating those issues was the change introduced in #38586 eliminated a test instruction and should have showed improvement in the benchmarks, but introduced regression because the code (loop code inside method) now gets misaligned and the method runs slower.

Alignment issues was brought up few times in #9912 and #8108 and this issue tracks the progress towards the goal of stabilizing and possibly improving the performance of .NET apps that are heavily affected because of code alignment.

Performance lab infrastructure

Once we address the code alignment issue, the next big thing will be to identify and make required infrastructure changes in our performance lab to make sure that it can easily flag such issues without needing much interaction from .NET developers. For example, dotnet/BenchmarkDotNet#1513 proposes to make memory alignment in the benchmark run random to catch these issues early and once we address the underlying problem in .NET, we should never see bimodal behavior of those benchmarks. After that, if the performance lab does find a regression in the benchmark, we need to have robust tooling support to get possible metrics from performance runs so that a developer doing the investigation can easily identify the cause of regression. For example, identifying the time spent in various phases of .NET runtime like Jitting, Jit interface, Tier0/Tier1 JIT code, hot methods, instructions retired during benchmark execution and so forth.

Reliable benchmarks collection

Lastly, for developers working on JIT, we want to identify set of benchmarks that are stable enough and can be trusted to give us reliable measurement whenever there is a need to verify the performance for changes done to the JIT codebase. This will help us conduct performance testing ahead of time and identify potential regressions rather than waiting it to happen in performance lab.

Here are set of work items that we have identified to achieve all the above:

Code alignment work

Future work

  • Make method entries of JIT code having loops aligned at relevant boundary for arm32/arm64. See section 4.8 of https://developer.arm.com/documentation/swog309707/a.
  • Combine branch tightening and loop alignment adjustments in single phase.
  • Advanced: If padding proves to be costly, have a way to spread the padding throughout the method so the loop header gets aligned. Account the padding while doing branch tensioning.
    • Add padding at the blind spot like after jmp or ret instruction that comes before align instruction.
    • Add padding at the end of blocks that has lower weight than the block that precedes the loop header block.
    • Explore option to have a method misaligned such that we can skip the padding needed for loops and they get auto-aligned or aligned with minimal padding.
  • Perform loop alignment similar to Align inner loops #44370 for R2R code.
  • For x86, we should improve the encoding we use for multiple size NOP instructions. Today, we just output repeated single byte 90 , but could do better like we do for x64.
  • Explore option to align enclosing loops if it can borrow some of the padding needed for inner loop. E.g. in a nested loop, if there is an inner loop that needs padding of 10 bytes and the outer loop can be aligned by adding padding of 4 bytes, then add padding of 4 bytes to outer loop and 6 bytes to inner loop. That way, both the loops are aligned.

Performance tooling work

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @danmosemsft

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Oct 9, 2020
@danmoseley
Copy link
Member

cc @DrewScoggins @billwert @adamsitnik

@danmoseley
Copy link
Member

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

@adamsitnik adamsitnik added the tenet-performance-benchmarks Issue from performance benchmark label Oct 9, 2020
@kunalspathak
Copy link
Member Author

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

Yes, there will definitely be more reason for bimodality, but alignment will fix most of the obvious ones that we know of and then, it will be easier for us to focus one remaining ones. Currently, there are just too many bimodal benchmarks.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 9, 2020
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Oct 9, 2020
@JulieLeeMSFT JulieLeeMSFT added the Epic Groups multiple user stories. Can be grouped under a theme. label Oct 9, 2020
@JulieLeeMSFT JulieLeeMSFT added Team Epic and removed Epic Groups multiple user stories. Can be grouped under a theme. labels Oct 19, 2020
@JulieLeeMSFT JulieLeeMSFT reopened this Oct 19, 2020
@JulieLeeMSFT
Copy link
Member

Bolded WIP texts in the plan.

@JulieLeeMSFT JulieLeeMSFT added User Story A single user-facing feature. Can be grouped under an epic. Bottom Up Work Not part of a theme, epic, or user story and removed Team Epic labels Nov 16, 2020
@danmoseley
Copy link
Member

As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused?

Never mind, I forgot about dotnet/performance#1587 etc. we're attacking both sides. 😊

@kunalspathak
Copy link
Member Author

Another round of analysis.

Stability

I went through the stability data for windows x64 and linux x64 that @DrewScoggins shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64 and Linux x64 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count and Single.Max where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.

Performance

Here are some of the take away points:

  • JIT maintains a loop table to track all the loops it has seen and use it to perform various optimizations and analysis. It turns out that there is a bug (or a missing feature, whatever you call) where we do not record the cloned loop inside loop table. With that, we miss out opportunities to do the required analysis on cloned loops. Tracking issue: Certain loops do not get recorded in optLoopTable #43713

  • We sometimes add "NOP compensation" for the over-estimated instruction. Most of the over-estimated instructions are jumps and that that happens because branch tightening happens before loop alignment adjustment phase. In branch tightening we adjust the size of jump instructions depending on the target offset, but later, the loop alignment adjustment can further shorten the IG offsets resulting in over-estimated jump instruction. To address this, we need to combine branch tightening and loop alignment adjustment in same phase rather than separating them. I have updated the work item list to capture this explicitly. After doing this, we might further think about adding the compensation behind jmp or in cold block.

  • Often the encoding of jump instruction before vs. after loop alignment might increase because of added padding instructions. I do not think we can do much about it.

FannkuchRedux_9

image

The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6I.

MulMatrix

image

The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPi

TryGetUInt16:

image

In this method, TryParseUInt16D() is the hot method as seen from VTune screenshot:

image
image

With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2).
Asmdiffs: https://www.diffchecker.com/jHt5pv99

GetTypeCode

image

Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.

ContainsTrue/ContainsFalse

image

image

image

This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in #43713 (comment).

@JulieLeeMSFT
Copy link
Member

JulieLeeMSFT commented Feb 4, 2021 via email

@kunalspathak
Copy link
Member Author

What is cloned loop in the loop table? Thanks, Julie

Sometimes, if a loop has checks like array's bound check, we will clone a loop. One variant of the loop will have no bound checks (and hence will be fast loop) while other variant will have bound checks (and will be a slow loop). Depending on the input, we decide whether to execute fast loop or the cloned (slow) loop. You can see an example here. Here, fast loop is L003e ~ L0058 while cloned slow loop is L005c ~ L0080. During JIT, we track all the loops in a data structure called optLoopTable. Whenever we want to do any analysis on loops, we iterate over this table to decide whether the given loop needs any update/optimization. When we clone a loop, we do not inserted an entry of cloned loop inside optLoopTable because of which further analysis doesn't happen on cloned loop. Towards the end of my comment in #43713 (comment), I have listed what type of analysis we miss out on cloned loop because of lack of recording its entry inside optLoopTable.

@JulieLeeMSFT
Copy link
Member

Moved future since .NET 6 scope items are complete now.

@MithrilMan
Copy link

I read your article here about loop alignment here https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/ and it mentions that a loop like this

for (int l = 0; l < M; l++) {
        // body
        OtherMethod();
    }

isn't a candidate for inlining because contains a call to a method
What about methods that are inlined?
Shouldn't inline method be considered like loop body and be eligible for alignment if their body is small enough?

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 13, 2021
@kunalspathak
Copy link
Member Author

What about methods that are inlined?

Inlining decision happens in early phase of compilation and when we decide whether to align a loop or not, we already know that it was inlined or not. We won't align the loop if we know for sure that the method will not be inlined.

@krwq krwq removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 13, 2021
@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 4, 2021
@buyaa-n buyaa-n removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Meta Bottom Up Work Not part of a theme, epic, or user story tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark User Story A single user-facing feature. Can be grouped under an epic.
Projects
Archived in project
Status: Needs Consultation
Development

No branches or pull requests

8 participants