Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved register promotion in the presence of exception handling #6212

Open
GSPP opened this issue Jun 25, 2016 · 8 comments
Open

Improved register promotion in the presence of exception handling #6212

GSPP opened this issue Jun 25, 2016 · 8 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@GSPP
Copy link

GSPP commented Jun 25, 2016

Here's a micro benchmark to test the performance effect of an exception handler. The handler is never executed, it's just there.

    static void Main()
    {
        var outerIterationCount = 100 * 1000;
        var innerIterationCount = 10000; //use a high value so that the handler is entered rarely

        {
            var sw = Stopwatch.StartNew();
            Handler0(outerIterationCount, innerIterationCount);
            sw.Stop();
            Console.WriteLine($"Handler0: {sw.Elapsed.TotalMilliseconds:F1}ms");
        }

        {
            var sw = Stopwatch.StartNew();
            Handler1(outerIterationCount, innerIterationCount);
            sw.Stop();
            Console.WriteLine($"Handler1: {sw.Elapsed.TotalMilliseconds:F1}ms");
        }
    }

    static void Handler1(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                try
                {
                    for (int iInner = 0; iInner < innerIterationCount; iInner++)
                    {
                        result += 0x1FEFE590; //dummy work that is not optimized out
                        result ^= 0x59209DFC;
                    }
                }
                catch (Exception ex)
                {
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

    static void Handler0(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                for (int iInner = 0; iInner < innerIterationCount; iInner++)
                {
                    result += 0x1FEFE590; //dummy work that is not optimized out
                    result ^= 0x59209DFC;
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

Handler0: 590,4ms
Handler1: 3745,1ms

The slowdown is quite severe (6x). Inspecting the disassembly I see that the inner loop in Handler1 spills to the stack after each arithmetic operation. This is so strange and undesirable that I'm reporting this as a bug.

This is .NET Desktop, 4.6.1, x64, Release, no debugger attached.

category:cq
theme:eh
skill-level:expert
cost:extra-large

@GSPP
Copy link
Author

GSPP commented Jun 25, 2016

Here's a variant that demonstrates a different issue. innerIterationCountLocal is not enregistered and loaded for each inner loop iteration. This is a code gen quality issue:

    static void Handler1_Variant2(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                try
                {
                    uint result1 = 0; //use a local accumulator
                    for (int iInner = 0; iInner < innerIterationCount; iInner++)
                    {
                        result1 += 0x1FEFE590; //write to local
                        result1 ^= 0x59209DFC; //write to local
                    }
                    result += result1;
                }
                catch (Exception ex)
                {
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

The inner loop comes out as:

000007FE970D466F  add         ecx,1FEFE590h  
                            result1 ^= 0x59209DFC;
000007FE970D4675  xor         ecx,59209DFCh  
                        for (int iInner = 0; iInner < innerIterationCount; iInner++)
000007FE970D467B  inc         eax  
000007FE970D467D  mov         edx,dword ptr [rbp+18h]  //unnecessary
000007FE970D4680  cmp         eax,edx  
000007FE970D4682  jl          000007FE970D466F  

And indeed this variant is still a little bit slower than Handler0.

@pgavlin
Copy link
Contributor

pgavlin commented Jun 25, 2016

cc @BruceForstall @cmckinsey

@pgavlin
Copy link
Contributor

pgavlin commented Jun 25, 2016

(and of course @dotnet/jit-contrib)

@cmckinsey
Copy link
Contributor

@GSPP

This is a known issue and we have a work-item on our TFS side for it. I will say it's "by-design" right now in that it is intentional although we would like to improve the design in the future. The current implementation was chosen to be easy to get correct and stay reliable. I believe this was inherited from the x86 JIT approach to handling register allocation in the presence of EH. JIT64, which was based on the C++ compiler, has a more sophisticated handling of this problem albeit at more complexity and slower compile-time.

It is not true in general that you can wrap a section of code with a TRY/CATCH and except similar performance even when no exception is taken. This is because the compiler must handle the fact that exceptions thrown transfer control into the runtime stack walker and do not preserve register state when the handler is invoked or flow continues at the continuation after handling an exception. So in general for any variable live on an exception edge, a store must be inserted to the frame so that it's value can be restored. This is the default assumption that each instruction executed may transfer control to the handler. Both cases that you mention above are in fact the same case, and this explains the stores that you see. The reloads however are not necessary and this is an opportunity for improvement in our implementation. This requires more sophisticated algorithms.

Improving this involves major design decisions in the fundamental data flow representation and model of the compiler and so given our current ship deliverables we have not tackled it yet. I don't believe this is an area were we can put the work up for grabs. It is something we want to fix in a future release.

One additional point before someone else points it out, in your case there can be proven to be no exception-occurring in the TRY and the compiler could eliminate the TRY/CATCH entirely, at which point the JIT could generate fully unregistered code. This is a separate optimization that we also don't do today.

BTW, thank you for the stand-alone benchmark that illustrates the problem. I would be very happy to accept someone adding this as an xunit JIT benchmark for us to run to keep reminding the team until we get better.

@cmckinsey cmckinsey changed the title Exception handling causes significant slowdown Improved register promotion in the presence of exception handling Jun 25, 2016
@GSPP
Copy link
Author

GSPP commented Jun 25, 2016

Thank you for that enlightening explanation.

I just tried it with finally and found the same issue. This is logical. Does this mean that whatever is inside of a try (or using) is potentially really slow? This would be important to know as an optimization guideline. So far I only found "people on the web" claiming that unused EH is "zero-cost". It's important to know that at least right now this is not the case.

It seems to me that the only way to make it fast again would be to track the correspondence of registers and locals and to generate code that spills the regs to stack slots only when the try jump is triggered in any way.

Surely, the existence of exception filters does not make this easier.

And in turn this seems to make it harder to coalesce stores because the effect of stores could be visible to an exception filter.

Indeed I was not able to find any case where the JIT coalesces multiple stores to the same location.

@cmckinsey
Copy link
Contributor

@GSPP

I'm not sure how to define "potentially really slow", it really depends on the workload. I wouldn't suggest wrapping your critical compute-bound inner loops in TRY/FINALLY, rather nest them into an outer method. But for most cases I would typically tend to think of it as a small overhead that we'll chip away at in the JIT. Generally people say our EH-model is "zero cost" compared to older dynamic EH models that require code overhead on entry/exit to and from TRY methods (remember SETJMP/LONGJMP based EH models from back in the day). The 64-bit Windows model is static and so generally cheaper at runtime. Yes, finally/catch/filter/using all present similar complications to the JIT optimizer.

Experience suggests the most critical thing is to make sure we can avoid unnecessary reloads in the TRY bodies since loads typically cause latency. While stores do take up memory ports and instruction encoding space in the cache, they have near zero latency. So focusing on store coalescing is typically 2nd or 3rd order impact.

@GSPP
Copy link
Author

GSPP commented Jun 27, 2016

Makes sense. Thank you! These information tidbits are really valuable.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 24, 2023

I get the following on main today, on a 5950X:

Handler0: 426.4ms
Handler1: 434.2ms

But still doesn't look like EH write-thru kicks in (diffs), so maybe we should keep this open anyway? Will leave it up to @kunalspathak

@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Nov 1, 2023
@TIHan TIHan modified the milestones: Future, 9.0.0 Nov 1, 2023
@kunalspathak kunalspathak removed this from the 9.0.0 milestone May 3, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants