Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] thread suspension on Unix #71187

Merged
merged 29 commits into from
Jul 19, 2022
Merged

[NativeAOT] thread suspension on Unix #71187

merged 29 commits into from
Jul 19, 2022

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jun 23, 2022

Implements return address hijacking and asynchronous suspension at safe points on Unix.

  • x64
  • arm64 (will be in a separate change)

Contributes to: #67805

@ghost ghost assigned VSadov Jun 23, 2022
@VSadov VSadov force-pushed the hijackUnix branch 2 times, most recently from d5325b2 to e050f9b Compare June 23, 2022 20:56
@VSadov VSadov marked this pull request as ready for review June 30, 2022 21:03
@VSadov VSadov requested review from jkotas and removed request for MichalStrehovsky June 30, 2022 21:04
// enum used by the runtime.
GCRefKind GetGcRefKind(ReturnKind returnKind)
{
static_assert((GCRefKind)ReturnKind::RT_Scalar == GCRK_Scalar, "ReturnKind::RT_Scalar does not match GCRK_Scalar");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unix x64 ABI can return values in two registers, and so there are variants like RT_Scalar_Obj.

Do we need to handle them here? Also, it may be interesting to look into why you have not seen this problem in the tests.

Copy link
Member Author

@VSadov VSadov Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lab testing that is currently enabled for this on Unix is fairly limited, I believe, and I have a limited capacity to stress-test the change locally so we may miss some cases.

The static assert tests some cases, but GCRefKind translation is basically one-to-one, not even sure if the assert helps much here.
I am not changing GC decoding, so it may just work if it is the same data just stored/read at different location.

Copy link
Member Author

@VSadov VSadov Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more expanded libs testing will provide failures and repro cases.
I am not yet sure what I'd be adding here for multreg returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread::InternalHijack needs to be fixed to hande the new cases.

You can try to build a targeted repro that has a lot of methods that return struct { int a; object b; }. Also, check the generated code to make sure that this struct is indeed returned in rax and rdx registers (object b should be in rdx).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look. Thanks!

I kind of assumed that since we can stackwalk in scenarios that already work, we may already be handling all kinds of cases, but this is a new scenario.

@MichalStrehovsky
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Jul 18, 2022

  • Windows x64 failure:
Unhandled Exception: System.PlatformNotSupportedException: Operation is not supported on this platform.
   at System.Threading.Thread.InitializeCom(ApartmentState) + 0xba
   at System.Threading.Thread.StartThread(IntPtr) + 0x5d
   at System.Threading.Thread.ThreadEntryPoint(IntPtr) + 0x1a
  • Windows arm64 failure:
Assert.Equal() Failure
Expected: 2
Actual:   0
   at System.Threading.Threads.Tests.ThreadTests.ApartmentStateTest_ChangeBeforeThreadStarted_Windows(Func`2, Func`3, Int32) + 0x164
   at System.Threading.Thread.Tests!<BaseAddress>+0x878df4
   at System.InvokeUtils.CallDynamicInvokeMethod(Object, IntPtr, IntPtr, IntPtr, Object, Object[], BinderBundle, Boolean, Boolean) + 0xec

I do not think they are related to the change.

@@ -318,7 +320,7 @@ C_FUNC(\Name):
DEFAULT_FRAME_SAVE_FLAGS = PTFF_SAVE_ALL_PRESERVED + PTFF_SAVE_RSP

.macro PUSH_COOP_PINVOKE_FRAME trashReg
push_nonvol_reg rbp // push RBP frame
push_nonvol_reg rbp // push RBP frame // TODO: do we need this? not on windows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the added TODO: I believe this is needed for unwindability. I would think that on Windows, it is needed too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These macros are used to call GC helpers, not the user code, so I assume unwindability here would be useful for crash dumps and such.
The Windows version seems to be using RSP-based frame style, which makes it simpler as RBP is just a nonvol register. I think the pattern is unwindable, at least as much as needed for debugging.

I will keep the TODO here. I will be doing some cleanup of assembly helpers. I think we have a good deal of redundancy and will take a look at this as a part of cleanup.
Ideally we should use the same code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant unwinding when debugging the code (and core dump too.)
This macro was actually changed to use RBP based frame by @jkotas in dotnet/corert#4812

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks! Saves me time to figure if the alternative works with debuggers (I now know it does not).

I will change Windows to use the same code then (in a separate change). Even if Windows is resilient to this, it is nicer to have the same code where possible.


// if we are in an epilogue, there will be at least one instruction left.
int trailingEpilogueInstructions = 1;
uint8_t* NextByte = (uint8_t*)pvAddress;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - can you please fix the naming convention of some of the locals (start with lowercase)?

@jkotas
Copy link
Member

jkotas commented Jul 18, 2022

I do not think they are related to the change.

Yes, that's #72352

ASSERT(returnKind <= GCRK_LastValid);
return (GCRefKind)(returnKind & (GCRK_Object | GCRK_Byref));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtractReg1ReturnKind can be moved here as well.

{
reinterpret_cast<void*>(RhpGcProbeHijack)
};
#else // TARGET_ARM64 || TARGET_UNIX
EXTERN_C void FASTCALL RhpGcProbeHijackScalar();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to switch the other platforms to the one-and-only hijack target plan (in separate PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It occured to me that using multiple probe targets on win-x64 as a way to pass return kind to the probe is not very rewarding optimization.
Perf of suspension code would be gated by how many syscalls we make and how many iterative attempts we do, so benefits here would be fairly small in comparison.

//
// Such an opcode is an unambiguous epilogue indication.
//

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to check for a breakpoint and give up if you see one.

(It is annoying that setting a breakpoint introduces GC crash.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or print debug output if you run into a breakpoint.

Copy link
Member Author

@VSadov VSadov Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'int 3' in regular code will be naturally handled since it is not recognized as a part of epilogue anyways.

Placing a breakpoint in epilogue might be rare and would be hard to handle reliably, since I would not know what it replaced.
I will have to think about this. I can treat int 3 as a ret if I have seen pops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just return -1 if i see int 3and treat that as not-hijackable.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo comments. Thank you!

@dotnet dotnet deleted a comment from azure-pipelines bot Jul 18, 2022
@VSadov
Copy link
Member Author

VSadov commented Jul 18, 2022

/azp run runtime-extra-platforms

@dotnet dotnet deleted a comment from azure-pipelines bot Jul 18, 2022
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Jul 19, 2022

quic failures are #72429

@VSadov VSadov merged commit 62075f1 into dotnet:main Jul 19, 2022
@VSadov VSadov deleted the hijackUnix branch July 19, 2022 03:20
@VSadov
Copy link
Member Author

VSadov commented Jul 19, 2022

Thanks!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants