-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible OSX/Linux JIT Inlining Optimization Bug .Net Core 5.0.200 #49078
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Adding area owners @BruceForstall @dotnet/jit-contrib |
@kevinmv Thank you for the report, and investigation. It does smell like a codegen bug. Given that it's not reproducing in .NET 6, perhaps we just need to figure out what fixed the problem and backport the fix. Let me figure out how to get an upload from you of the repros. In the meantime, if you have the capability to build your own version of .NET from source, and can build a Checked build, running with "export COMPlus_JitDump=ScheduleFlushSend ScheduleSend" would be be the first thing we would likely do, to see the JIT internal output for these functions. (You could upload the output as a gist if there's nothing private you can't share publicly.) |
Thanks! I made a local build and it seems really quite informative. The Checked build does not fail immediately the first time It's unclear to me if this bug has been fixed in .Net 6 or just doesn't present itself. Let me know if I can provide any more information to help. All the best, |
This looks like bad code generation in In particular, the problem is when we think we can make a fast (optimistic) tailcall from So, in summary, for a call A->B, to see this failure, we need:
I'm not sure why this wouldn't repro in .NET 6; the problem still seems to exist. I need to try and construct a small repro case. |
Thanks @BruceForstall that is a great breakdown. Let me know if I can provide anymore information. |
The problem was when a VSD interface call returning a multi-byte struct in registers was initially considered to be a tailcall, but the tailcall was abandoned in morph due to not enough stack space to store outgoing arguments, in which case we create a new call return local to store the return struct, and re-morph the call. In doing so, we forget that we had already added VSD non-standard args, and re-add them, leaving the originally added arg as a "normal" arg that shouldn't be there. So, in summary, for a call A->B, to see this failure, we need: 1. The call is considered a potential tailcall (by the importer) 2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call) 3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg is a large struct. We can't reject it earlier, due to things like address exposed locals -- we must get far enough through the checks to have called fgInitArgInfo() to add the extra non-standard arg. 4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI) The fix is to remove the previously-added non-standard VSD argument IR when we are preparing to re-morph a call. There was one other location that was already doing this. I'm a little worried that there are other places where the non-standard added IR is not being considered when we clear out the arg info before remorphing. It seems like there is some risk here. Probably, we should consider a better way of handling the non-standard arg IR given the need in some cases to re-morph calls. I commented out a few cases of: ``` assert(!fgCanFastTailCall(call, nullptr)); ``` because this function can call `fgInitArgInfo` which can alter the IR. That seems dangerous in an assert, which should have any side-effects like modifying the IR. Fixes dotnet#49078 No SPMI asm diffs.
…#49256) The problem was when a VSD interface call returning a multi-byte struct in registers was initially considered to be a tailcall, but the tailcall was abandoned in morph due to not enough stack space to store outgoing arguments, in which case we create a new call return local to store the return struct, and re-morph the call. In doing so, we forget that we had already added VSD non-standard args, and re-add them, leaving the originally added arg as a "normal" arg that shouldn't be there. So, in summary, for a call A->B, to see this failure, we need: 1. The call is considered a potential tailcall (by the importer) 2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call) 3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg is a large struct. We can't reject it earlier, due to things like address exposed locals -- we must get far enough through the checks to have called fgInitArgInfo() to add the extra non-standard arg. 4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI) The fix is to remove the previously-added non-standard VSD argument IR when we are preparing to re-morph a call. There was one other location that was already doing this. I'm a little worried that there are other places where the non-standard added IR is not being considered when we clear out the arg info before remorphing. It seems like there is some risk here. Probably, we should consider a better way of handling the non-standard arg IR given the need in some cases to re-morph calls. I commented out a few cases of: ``` assert(!fgCanFastTailCall(call, nullptr)); ``` because this function can call `fgInitArgInfo` which can alter the IR. That seems dangerous in an assert, which should have any side-effects like modifying the IR. Fixes #49078 No SPMI asm diffs.
Is there some way for us to be notified or find out when this fix is backported to .NET 5? |
Reopening for .NET 5 port.
The best way would be to follow this issue (I'll submit a PR with the .NET 5 port of the fix soon) |
The problem was when a VSD interface call returning a multi-byte struct in registers was initially considered to be a tailcall, but the tailcall was abandoned in morph due to not enough stack space to store outgoing arguments, in which case we create a new call return local to store the return struct, and re-morph the call. In doing so, we forget that we had already added VSD non-standard args, and re-add them, leaving the originally added arg as a "normal" arg that shouldn't be there. So, in summary, for a call A->B, to see this failure, we need: 1. The call is considered a potential tailcall (by the importer) 2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call) 3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg is a large struct. We can't reject it earlier, due to things like address exposed locals -- we must get far enough through the checks to have called fgInitArgInfo() to add the extra non-standard arg. 4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI) The fix is to remove the previously-added non-standard VSD argument IR when we are preparing to re-morph a call. There was one other location that was already doing this. I'm a little worried that there are other places where the non-standard added IR is not being considered when we clear out the arg info before remorphing. It seems like there is some risk here. Probably, we should consider a better way of handling the non-standard arg IR given the need in some cases to re-morph calls. I commented out a few cases of: ``` assert(!fgCanFastTailCall(call, nullptr)); ``` because this function can call `fgInitArgInfo` which can alter the IR. That seems dangerous in an assert, which should have any side-effects like modifying the IR. Fixes dotnet#49078 No SPMI asm diffs.
hey! any updates on the backport? :) |
…#49552) The problem was when a VSD interface call returning a multi-byte struct in registers was initially considered to be a tailcall, but the tailcall was abandoned in morph due to not enough stack space to store outgoing arguments, in which case we create a new call return local to store the return struct, and re-morph the call. In doing so, we forget that we had already added VSD non-standard args, and re-add them, leaving the originally added arg as a "normal" arg that shouldn't be there. So, in summary, for a call A->B, to see this failure, we need: 1. The call is considered a potential tailcall (by the importer) 2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call) 3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg is a large struct. We can't reject it earlier, due to things like address exposed locals -- we must get far enough through the checks to have called fgInitArgInfo() to add the extra non-standard arg. 4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI) The fix is to remove the previously-added non-standard VSD argument IR when we are preparing to re-morph a call. There was one other location that was already doing this. I'm a little worried that there are other places where the non-standard added IR is not being considered when we clear out the arg info before remorphing. It seems like there is some risk here. Probably, we should consider a better way of handling the non-standard arg IR given the need in some cases to re-morph calls. I commented out a few cases of: ``` assert(!fgCanFastTailCall(call, nullptr)); ``` because this function can call `fgInitArgInfo` which can alter the IR. That seems dangerous in an assert, which should have any side-effects like modifying the IR. Fixes #49078 No SPMI asm diffs.
@acgaudette It just got merged to the 5.0 branch. |
@BruceForstall awesome. do you know when the next release might appear? |
The previews are monthly, as mentioned in the last release blog post here: https://devblogs.microsoft.com/dotnet/announcing-net-6-preview-2/. So... soon. |
@BruceForstall would that be true even for the backport ? |
oops, sorry, got confused between releases there... However, service patch releases for 5.0 are also release on an approximately monthly cadence, as can be seen from the release page: https://dotnet.microsoft.com/download/dotnet/5.0. One was released today, so this change will be in the next one, in about a month. |
Hello!
We are seeing what looks to be a JIT optimization bug in .Net Core 5.0.200 but only for OSX and Linux, Windows does not present any issues.
Description
We have a function
ScheduleFlushSend(JobHandle dep)
which takes a small struct as an input parameter:Inside
ScheduleFlushSend
we pass along thedep
JobHandle parameter to an implemented interface methodScheduleSend
however depending on optimization levels thedep
jobhandle is corrupted.Release builds can be forced to work by decorating
ScheduleFlushSend
with[MethodImpl(MethodImplOptions.NoInlining)]
, or alternatively running withCOMPlus_JITMinOpts=1
The crash would only happen after calling
ScheduleFlushSend
30+ times, however by running withCOMPlus_TieredCompilation=0
the corruption is immediate further supporting our suspicion this is a JIT optimization bug.Printing out the values of the
JobHandle
members before passing / after entering functions shows what looks to be an issue where inlining the function results in the parameter being read from a memory address that is off by 8 bytes.(in this example RPCSystem.OnUpdate calls ScheduleFlushSend which calls ScheduleSend)
Here you can see the
JobGroup
field's value has shifted into the second member of the struct, and the first member is a random value (it actually looks like a valid, irrelevant, memory address)I can send a working (forcing noinlining on
ScheduleFlushSend
) repo and the crashing repo (removing the forced inlining) if provided an address.Configuration
Other information
The text was updated successfully, but these errors were encountered: