Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible OSX/Linux JIT Inlining Optimization Bug .Net Core 5.0.200 #49078

Closed
kevinmv opened this issue Mar 3, 2021 · 14 comments · Fixed by #49256
Closed

Possible OSX/Linux JIT Inlining Optimization Bug .Net Core 5.0.200 #49078

kevinmv opened this issue Mar 3, 2021 · 14 comments · Fixed by #49256
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kevinmv
Copy link

kevinmv commented Mar 3, 2021

Hello!

We are seeing what looks to be a JIT optimization bug in .Net Core 5.0.200 but only for OSX and Linux, Windows does not present any issues.

Description

We have a function ScheduleFlushSend(JobHandle dep) which takes a small struct as an input parameter:

    [StructLayout(LayoutKind.Sequential)]
    public struct JobHandle
    {
        public IntPtr JobGroup;
        public uint Version;
        //<member functions removed for brevity>
     }

Inside ScheduleFlushSend we pass along the dep JobHandle parameter to an implemented interface method ScheduleSend however depending on optimization levels the dep jobhandle is corrupted.

Release builds can be forced to work by decorating ScheduleFlushSend with [MethodImpl(MethodImplOptions.NoInlining)], or alternatively running with COMPlus_JITMinOpts=1

The crash would only happen after calling ScheduleFlushSend 30+ times, however by running with COMPlus_TieredCompilation=0 the corruption is immediate further supporting our suspicion this is a JIT optimization bug.

Printing out the values of the JobHandle members before passing / after entering functions shows what looks to be an issue where inlining the function results in the parameter being read from a memory address that is off by 8 bytes.

(in this example RPCSystem.OnUpdate calls ScheduleFlushSend which calls ScheduleSend)

RPCSystem:	                jobgroup=76 version=0
   |--ScheduleFlushSend:	jobgroup=76 version=0
      |--ScheduleSend:	        jobgroup=139823864677200 version=76 

Here you can see the JobGroup field's value has shifted into the second member of the struct, and the first member is a random value (it actually looks like a valid, irrelevant, memory address)

I can send a working (forcing noinlining on ScheduleFlushSend) repo and the crashing repo (removing the forced inlining) if provided an address.

Configuration

  • Which version of .NET is the code running on?
    • 5.0.200
  • What OS and version, and what distro if applicable?
    • Ubuntu 20.04
    • OSX Mojave
  • What is the architecture (x64, x86, ARM, ARM64)?
    • x64
  • Do you know whether it is specific to that configuration?
    • Yes, only reproduces with optimizations on. Debug builds work correctly, and forcing inlining off hides the issue

Other information

  • We do not see this issue on Windows, OSX or Linux when using .Net Core 6.0.0-preview.1
  • We are compiling the application as a netstandard 2.0 and running via netcorerun.exe (a netcore app that loads all assemblies into an AssemblyLoadContext and then executes main)
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 3, 2021
@kevinmv
Copy link
Author

kevinmv commented Mar 3, 2021

Adding area owners @BruceForstall @dotnet/jit-contrib

@BruceForstall
Copy link
Member

@kevinmv Thank you for the report, and investigation. It does smell like a codegen bug. Given that it's not reproducing in .NET 6, perhaps we just need to figure out what fixed the problem and backport the fix. Let me figure out how to get an upload from you of the repros.

In the meantime, if you have the capability to build your own version of .NET from source, and can build a Checked build, running with "export COMPlus_JitDump=ScheduleFlushSend ScheduleSend" would be be the first thing we would likely do, to see the JIT internal output for these functions. (You could upload the output as a gist if there's nothing private you can't share publicly.)

@JulieLeeMSFT JulieLeeMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 4, 2021
@kevinmv
Copy link
Author

kevinmv commented Mar 4, 2021

Thanks! I made a local build and it seems really quite informative. The Checked build does not fail immediately the first time ScheduleFlushSend is called but after ~40 invocations the method is recompiled and crashes right after. I've never read this log before but it seems like a possible bug with the ScheduleFlushSend trying to be devirtualized ScheduleSend.
jit-opt-bug-out.txt

It's unclear to me if this bug has been fixed in .Net 6 or just doesn't present itself. Let me know if I can provide any more information to help.

All the best,
Kev

@BruceForstall BruceForstall added this to the 6.0.0 milestone Mar 5, 2021
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Mar 5, 2021
@BruceForstall
Copy link
Member

This looks like bad code generation in ScheduleFlushSend(), where it passes arguments to ScheduleSend() that don't match what ScheduleSend() expects to see.

In particular, the problem is when we think we can make a fast (optimistic) tailcall from ScheduleFlushSend() to ScheduleSend() (because ScheduleFlushSend() just returns the value of ScheduleSend() and does no further processing), but, late, we realize we can't because don't have enough available stack space to place the outgoing arguments. So we back out, and attempt to create a normal call. But, in the special case the called function returns a struct in multiple registers (in this case, the 16 byte JobHandle struct is passed in consecutive registers on Linux x64), we do a little extra processing to handle the return value. One more special-case: the call to ScheduleSend() is a virtual stub dispatch call where we previously added a special, hidden argument (the address of the VSD cell). When we process the call again (morph the call + call args), we forget about that, and think this special arg is a normal arg, so we erroneously add another special arg.

So, in summary, for a call A->B, to see this failure, we need:

  1. The call is considered a potential tailcall (by the importer)
  2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
  3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg to ScheduleSend is a large struct)
  4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI)
  5. Given all these, we forget about the non-standard args we added, then add them again (in fgMorphCall)

I'm not sure why this wouldn't repro in .NET 6; the problem still seems to exist. I need to try and construct a small repro case.

@kevinmv
Copy link
Author

kevinmv commented Mar 5, 2021

Thanks @BruceForstall that is a great breakdown. Let me know if I can provide anymore information.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 8, 2021
BruceForstall added a commit to BruceForstall/runtime that referenced this issue Mar 11, 2021
The problem was when a VSD interface call returning a multi-byte struct
in registers was initially considered to be a tailcall, but the tailcall
was abandoned in morph due to not enough stack space to store outgoing
arguments, in which case we create a new call return local to store the
return struct, and re-morph the call. In doing so, we forget that we had
already added VSD non-standard args, and re-add them, leaving the originally
added arg as a "normal" arg that shouldn't be there.

So, in summary, for a call A->B, to see this failure, we need:
1. The call is considered a potential tailcall (by the importer)
2. The call requires non-standard arguments that add call argument IR in
   fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough
   incoming arg stack space in A to store B's outgoing args), in this case
   because the first arg is a large struct. We can't reject it earlier,
   due to things like address exposed locals -- we must get far enough
   through the checks to have called fgInitArgInfo() to add the extra
   non-standard arg.
4. B returns a struct in multiple registers (e.g., a 16-byte struct in
   Linux x64 ABI)

The fix is to remove the previously-added non-standard VSD argument IR when
we are preparing to re-morph a call.

There was one other location that was already doing this. I'm a little
worried that there are other places where the non-standard added IR is
not being considered when we clear out the arg info before remorphing.
It seems like there is some risk here. Probably, we should consider a
better way of handling the non-standard arg IR given the need in some
cases to re-morph calls.

I commented out a few cases of:
```
assert(!fgCanFastTailCall(call, nullptr));
```
because this function can call `fgInitArgInfo` which can alter the IR.
That seems dangerous in an assert, which should have any side-effects
like modifying the IR.

Fixes dotnet#49078

No SPMI asm diffs.
BruceForstall added a commit that referenced this issue Mar 11, 2021
…#49256)

The problem was when a VSD interface call returning a multi-byte struct
in registers was initially considered to be a tailcall, but the tailcall
was abandoned in morph due to not enough stack space to store outgoing
arguments, in which case we create a new call return local to store the
return struct, and re-morph the call. In doing so, we forget that we had
already added VSD non-standard args, and re-add them, leaving the originally
added arg as a "normal" arg that shouldn't be there.

So, in summary, for a call A->B, to see this failure, we need:
1. The call is considered a potential tailcall (by the importer)
2. The call requires non-standard arguments that add call argument IR in
   fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough
   incoming arg stack space in A to store B's outgoing args), in this case
   because the first arg is a large struct. We can't reject it earlier,
   due to things like address exposed locals -- we must get far enough
   through the checks to have called fgInitArgInfo() to add the extra
   non-standard arg.
4. B returns a struct in multiple registers (e.g., a 16-byte struct in
   Linux x64 ABI)

The fix is to remove the previously-added non-standard VSD argument IR when
we are preparing to re-morph a call.

There was one other location that was already doing this. I'm a little
worried that there are other places where the non-standard added IR is
not being considered when we clear out the arg info before remorphing.
It seems like there is some risk here. Probably, we should consider a
better way of handling the non-standard arg IR given the need in some
cases to re-morph calls.

I commented out a few cases of:
```
assert(!fgCanFastTailCall(call, nullptr));
```
because this function can call `fgInitArgInfo` which can alter the IR.
That seems dangerous in an assert, which should have any side-effects
like modifying the IR.

Fixes #49078

No SPMI asm diffs.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 11, 2021
@ecuzzillo
Copy link

Is there some way for us to be notified or find out when this fix is backported to .NET 5?

@BruceForstall BruceForstall reopened this Mar 11, 2021
@BruceForstall
Copy link
Member

Reopening for .NET 5 port.

Is there some way for us to be notified or find out when this fix is backported to .NET 5?

The best way would be to follow this issue (I'll submit a PR with the .NET 5 port of the fix soon)

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Mar 12, 2021
The problem was when a VSD interface call returning a multi-byte struct
in registers was initially considered to be a tailcall, but the tailcall
was abandoned in morph due to not enough stack space to store outgoing
arguments, in which case we create a new call return local to store the
return struct, and re-morph the call. In doing so, we forget that we had
already added VSD non-standard args, and re-add them, leaving the originally
added arg as a "normal" arg that shouldn't be there.

So, in summary, for a call A->B, to see this failure, we need:
1. The call is considered a potential tailcall (by the importer)
2. The call requires non-standard arguments that add call argument IR in
   fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough
   incoming arg stack space in A to store B's outgoing args), in this case
   because the first arg is a large struct. We can't reject it earlier,
   due to things like address exposed locals -- we must get far enough
   through the checks to have called fgInitArgInfo() to add the extra
   non-standard arg.
4. B returns a struct in multiple registers (e.g., a 16-byte struct in
   Linux x64 ABI)

The fix is to remove the previously-added non-standard VSD argument IR when
we are preparing to re-morph a call.

There was one other location that was already doing this. I'm a little
worried that there are other places where the non-standard added IR is
not being considered when we clear out the arg info before remorphing.
It seems like there is some risk here. Probably, we should consider a
better way of handling the non-standard arg IR given the need in some
cases to re-morph calls.

I commented out a few cases of:
```
assert(!fgCanFastTailCall(call, nullptr));
```
because this function can call `fgInitArgInfo` which can alter the IR.
That seems dangerous in an assert, which should have any side-effects
like modifying the IR.

Fixes dotnet#49078

No SPMI asm diffs.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 12, 2021
@BruceForstall BruceForstall removed this from the 6.0.0 milestone Mar 30, 2021
@BruceForstall BruceForstall added this to the 5.0.x milestone Mar 30, 2021
@acgaudette
Copy link

hey! any updates on the backport? :)

Anipik pushed a commit that referenced this issue Apr 6, 2021
…#49552)

The problem was when a VSD interface call returning a multi-byte struct
in registers was initially considered to be a tailcall, but the tailcall
was abandoned in morph due to not enough stack space to store outgoing
arguments, in which case we create a new call return local to store the
return struct, and re-morph the call. In doing so, we forget that we had
already added VSD non-standard args, and re-add them, leaving the originally
added arg as a "normal" arg that shouldn't be there.

So, in summary, for a call A->B, to see this failure, we need:
1. The call is considered a potential tailcall (by the importer)
2. The call requires non-standard arguments that add call argument IR in
   fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough
   incoming arg stack space in A to store B's outgoing args), in this case
   because the first arg is a large struct. We can't reject it earlier,
   due to things like address exposed locals -- we must get far enough
   through the checks to have called fgInitArgInfo() to add the extra
   non-standard arg.
4. B returns a struct in multiple registers (e.g., a 16-byte struct in
   Linux x64 ABI)

The fix is to remove the previously-added non-standard VSD argument IR when
we are preparing to re-morph a call.

There was one other location that was already doing this. I'm a little
worried that there are other places where the non-standard added IR is
not being considered when we clear out the arg info before remorphing.
It seems like there is some risk here. Probably, we should consider a
better way of handling the non-standard arg IR given the need in some
cases to re-morph calls.

I commented out a few cases of:
```
assert(!fgCanFastTailCall(call, nullptr));
```
because this function can call `fgInitArgInfo` which can alter the IR.
That seems dangerous in an assert, which should have any side-effects
like modifying the IR.

Fixes #49078

No SPMI asm diffs.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2021
@BruceForstall
Copy link
Member

hey! any updates on the backport? :)

@acgaudette It just got merged to the 5.0 branch.

@acgaudette
Copy link

acgaudette commented Apr 6, 2021

hey! any updates on the backport? :)

@acgaudette It just got merged to the 5.0 branch.

@BruceForstall awesome. do you know when the next release might appear?

@BruceForstall
Copy link
Member

The previews are monthly, as mentioned in the last release blog post here: https://devblogs.microsoft.com/dotnet/announcing-net-6-preview-2/. So... soon.

@wackoisgod
Copy link

@BruceForstall would that be true even for the backport ?

@BruceForstall
Copy link
Member

@BruceForstall would that be true even for the backport ?

oops, sorry, got confused between releases there...

However, service patch releases for 5.0 are also release on an approximately monthly cadence, as can be seen from the release page: https://dotnet.microsoft.com/download/dotnet/5.0. One was released today, so this change will be in the next one, in about a month.

@ghost ghost locked as resolved and limited conversation to collaborators May 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants