Prevent memory allocation in signal handler #16384

mlabiuk · 2018-02-14T11:40:08Z

If the signal occurs when heap being inconsistent we should not
use heap. We should call signal-safe functions only from signal handler.

fix https://github.com/dotnet/coreclr/issues/16338

If the signal occurs when heap being inconsistent we should not use heap. We should call signal-safe functions only from signal handler. fix https://github.com/dotnet/coreclr/issues/16338

dnfclas · 2018-02-14T11:40:20Z

All CLA requirements met.

mlabiuk · 2018-02-14T13:53:03Z

@janvorli @jkotas PTAL
CC @Dmitri-Botcharnikov @alpencolt @kbaladurin

jkotas · 2018-02-14T15:08:42Z

src/pal/src/exception/seh-unwind.cpp

 {
    ExceptionRecords* records;
-    if (posix_memalign((void**)&records, alignof(ExceptionRecords), sizeof(ExceptionRecords)) != 0)
+    if (allocationProhibited ||
+	(posix_memalign((void**)&records, alignof(ExceptionRecords), sizeof(ExceptionRecords)) != 0) )


Do we need to change the implementation below to keep re-trying instead of aborting when we do not have a free context available?

The problem is that we have a limited number of the preallocated contexts (64 for 64 bit processors, 32 for 32 bit ones). That means that if more than that number of threads were handling hardware exceptions at the same time, we would abort the process. In fact, even with less threads, we can exhaust this storage easily if a hardware exception happens while we are handling another one and then another etc. We need to keep the previous exception until the final one is handled.
So it seems that the solution will need to be more involved than using these fallback contexts which were really meant just to be a bit more resilient in out of memory cases.

@jkotas retrying would not help, the contexts are alive during the whole lifetime of the PAL_SEHException, which means until the exception is fully handled.

I actually wonder if using mmap to allocate memory in the signal handler would be safe. While the mmap is not listed on the async signal safe functions list, I don't see how it could cause a problem. The problems with other functions are based on the fact that the sigsegv can happen inside of such functions and that such functions are not reentrant. But mmap is just a syscall and so no sigsegv can occur while executing code inside of it, thus it should be safe to call from our sigsegv handler.
So maybe we could create a custom allocator based on mmap and use it here.

@janvorli Could you explain why preallocated context count is so tiny? For atomic free slot searching? Why not use N bitmaps?

I can add check for recursive handler and using preallocated context only if previous signal handler is not completed. But it ruined main idea do not use unsafe functions in signal handler.

Yes, the reason why it is so tiny is so that we can allocate the slot atomically. That's why the count is bitness specific.
We really need a right fix that works in all cases, so checks for recursive handler would not be the right solution (moreover, you would not be able to track the recursiveness at this level, since we never return back to the signal handler if the exception is processed by the runtime.
I can see a several possible ways we could solve the issue:

Use mmap as a base for a custom allocator for the contexts. Either for all or just for the ones allocated for the signal handler. As I've said before, I think there should be no problem with using mmap in the handler even though it is not listed as async signal safe.

Refactor the common_signal_handler, SEHProcessException and related code to not to allocate the records until we know that the failing code was executing in code under our control - that means that g_safeExceptionCheckFunction or CatchHardwareExceptionHolder::IsEnabled()returned true. In such case, it is safe to use the existing allocator. That will complicate a bit sharing the SEHProcessException implementation for OSX and other Unixes, since the records are generated differently for OSX and the others and we would need to do the conversion between the platform native context and the

Initialize the PAL_SEHException with records allocated on stack and add a method to the PAL_SEHException that would update the records to heap allocated ones (allocate the memory and copy the original records there). This method would be called from SEHProcessException before the call to the g_hardwareExceptionHandler and PAL_ThrowExceptionFromContext where we know it is safe to allocate. We would add a bool flag into the PAL_SEHException indicating whether the records are on stack or on heap to make sure the destructor does not try to free them if they were still on stack. That would also make it transparent to the OSX case where the records will always be allocated from the heap, since on OSX, we don't use signals for handling hardware exceptions.

I think that we should go with the way 3 - we could initiate the exception with the signalContextRecord (and change the code in the common_signal_handler to extract the context directly into it). Since we already had a memcpy that copied the context to this structure, we would not be adding an additional copy of the large record (just move the place where we do the copy), which is good.
Also, the change will be quite localized and won't do anything that's not safe.

I agree. Only performance problem may occur. Because mmap is a system call moreover it is serialized on process system call.

2 - 3. Signal handler is not a safe place for glibc memory allocation. OK some cases are safe for runtime. But all are not safe for other (native) threads. Original issue #16338 catches segfault not in memory allocation but when memory is fried. In this case we can't use glibc allocation at all. Also delayed allocation is useless because it is only optimization for signals generated from not managed code.

So I think we have to use mmap for allocation exception context in signal handler.

There is no way to avoid allocations in the signal handler. The whole hardware exception handling is executed from the handler. That means that once we start handling the exception, there is no limit in what we call and execute and we never return from the signal handler.
But what is important is to not to start calling async signal unsafe stuff until we know that the code that was executing was managed code or special helpers in our runtime. Once we are sure it was the case, there is no issue with calling anything on the handler, since we know we won't be reentering any platform functions.
That's why I believe we should use the way #3. The way #1 would require writing an allocator on top of mmap so that we don't waste memory (especially on arm64 where page size is 64 kB on some Linux distros). That would make it considerably more complex than the way #3.

This reverts commit fa6087f0c2856215e7141259b56e75b46746f74d.

If the signal occurs in not managed code we cannot use heap. We should call signal-safe functions only from signal handler. Create exception object on stack for checking source of signal. If signal is from managed code we can use memory allocation to create persistent exception on heap as copy of volatile exception on stack. If signal from unmanaged code we do nothing and call base signal handler. fix https://github.com/dotnet/coreclr/issues/16338

janvorli

@mlabiuk thank you for making the changes! I have a couple of suggestions.

janvorli · 2018-02-19T22:52:49Z

src/pal/inc/pal.h

@@ -3762,6 +3762,8 @@ PAL_BindResources(IN LPCSTR lpDomain);

 #define EXCEPTION_IS_SIGNAL 0x100

+#define EXCEPTION_ON_STACK 0x400


I would prefer adding a bool member to the PAL_SEHException (and a corresponding argument to the constructor) instead to indicate whether exception records are on stack or allocated by the allocator. It feels a bit better due to the fact that it is related to both of the exception and context records. Also, as for the name, I'd prefer something like "RecordsOnStack" since it refers to the records instead of the exception itself.

janvorli · 2018-02-19T22:58:15Z

src/pal/src/exception/seh.cpp

@@ -202,6 +202,23 @@ void ThrowExceptionHelper(PAL_SEHException* ex)
    throw std::move(*ex);
 }

+static PAL_SEHException copyPAL_SEHException(PAL_SEHException* src)


Rather than creating a new copy, I would prefer updating the existing exception instance. Either directly or by adding a Set method to the PAL_SEHException with the same arguments as the constructor.
Also, the prevalent coding convention in PAL is to use pascal casing for function names without underscores. The PAL_ prefix is an exception, indicating functions and data structures that are exported by PAL, but don't exist on Windows.
How about naming it e.g. "MoveExceptionRecordsToHeap" or "EnsureExceptionRecordsOnHeap"?

janvorli · 2018-02-19T23:10:46Z

src/pal/src/exception/seh.cpp

@@ -249,6 +266,9 @@ SEHProcessException(PAL_SEHException* exception)
                        PROCAbort();
                    }
                }
+
+                if(exceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK)
+                    *exception = copyPAL_SEHException(exception);


Could you please use the braces even for single line body to match the surrounding code style?

janvorli · 2018-02-19T23:33:16Z

src/pal/inc/pal.h

@@ -5683,7 +5685,8 @@ struct PAL_SEHException

    void FreeRecords()
    {
-        if (ExceptionPointers.ExceptionRecord != NULL)
+        if (ExceptionPointers.ExceptionRecord != NULL &&
+            ! (ExceptionPointers.ExceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK) )


This second part of the condition is not correct, it should be & instead of |.

janvorli · 2018-02-19T23:34:47Z

src/pal/src/exception/signal.cpp

    native_context_t *ucontext;

    ucontext = (native_context_t *)sigcontext;
    g_common_signal_handler_context_locvar_offset = (int)((char*)&signalContextRecord - (char*)__builtin_frame_address(0));

-    AllocateExceptionRecords(&exceptionRecord, &contextRecord);
+    //AllocateExceptionRecords(&exceptionRecord, &contextRecord);


Please remove the commented out code.

janvorli · 2018-02-19T23:38:42Z

src/pal/src/exception/seh.cpp

@@ -262,6 +282,9 @@ SEHProcessException(PAL_SEHException* exception)

        if (CatchHardwareExceptionHolder::IsEnabled())
        {
+            if(exceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK)


To simplify the callers, it would be nice to move the check into the actual function that updates the exception. For exception that already has records on heap, the function would become a no-op.

PAL_SEHException::EnsureExceptionRecordsOnHeap() moves exception record to heap if needed. fix https://github.com/dotnet/coreclr/issues/16338

mlabiuk · 2018-02-20T12:37:35Z

@janvorli Thaks for you valuable suggestions.
I have one question about move constructor and assignment operators in PAL_SEHException.
Should we do PROCAbort() if source (ex) object have records on stack?

janvorli · 2018-02-20T13:15:31Z

src/pal/inc/pal.h

@@ -5663,6 +5663,11 @@ PAL_FreeExceptionRecords(
  IN EXCEPTION_RECORD *exceptionRecord, 
  IN CONTEXT *contextRecord);

+VOID
+AllocateExceptionRecords(


I don't like exposing the AllocateExceptionRecords from PAL. Nothing out of PAL needs this functionality. That's why I've suggested adding the Set method to the exception so that you can allocate the records in the PAL code and just set them on the exception (or without the Set, just update the members).

The point is that PAL should expose the minimal possible surface and the record allocation is internal implementation detail of the PAL.

janvorli · 2018-02-20T13:16:58Z

I have one question about move constructor and assignment operators in PAL_SEHException.
Should we do PROCAbort() if source (ex) object have records on stack?

I don't think we need to do that. There is nothing wrong with moving the exception object while the records are still on stack if we ever needed it.

janvorli

LGTM, thank you for your contribution!

This reverts commit d9753f4.

There was a subtle bug. When the hardware exception handler returns back to the signal handler, the exception's CONTEXT record may contain modified registers and so the changes need to be propagated back to the signal context. But the recent change dotnet#16384 was restoring the signal context from the originally grabbed context instead of the one that's pointed to by the exception, which is different. I have also added a little optimization - the contextRecord that was added is not needed, since the signalContextRecord can be used as the initial context record for the exception. So we can save the contextRecord and also copying to the signalContextRecord from it.

There was a subtle bug. When the hardware exception handler returns back to the signal handler, the exception's CONTEXT record may contain modified registers and so the changes need to be propagated back to the signal context. But the recent change #16384 was restoring the signal context from the originally grabbed context instead of the one that's pointed to by the exception, which is different. I have also added a little optimization - the contextRecord that was added is not needed, since the signalContextRecord can be used as the initial context record for the exception. So we can save the contextRecord and also copying to the signalContextRecord from it.

Prevent memory allocation in signal handler

dfa0a5b

If the signal occurs when heap being inconsistent we should not use heap. We should call signal-safe functions only from signal handler. fix https://github.com/dotnet/coreclr/issues/16338

jkotas reviewed Feb 14, 2018

View reviewed changes

mlabiuk added 2 commits February 19, 2018 19:32

Revert "Prevent memory allocation in signal handler"

ee26488

This reverts commit fa6087f0c2856215e7141259b56e75b46746f74d.

janvorli suggested changes Feb 19, 2018

View reviewed changes

Move exception allocation to PAL_SEHException

e7bf54e

PAL_SEHException::EnsureExceptionRecordsOnHeap() moves exception record to heap if needed. fix https://github.com/dotnet/coreclr/issues/16338

janvorli reviewed Feb 20, 2018

View reviewed changes

Remove exception records allocation from pal.h

5966e0e

janvorli approved these changes Feb 20, 2018

View reviewed changes

janvorli merged commit d9753f4 into dotnet:master Feb 20, 2018

mlabiuk deleted the fix_crash_in_common_signal_handler branch February 21, 2018 06:43

janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018

Revert "Prevent memory allocation in signal handler (dotnet#16384)"

81addc0

This reverts commit d9753f4.

janvorli mentioned this pull request Feb 22, 2018

Fix preventing memory allocation in signal handler #16485

Merged

sdmaclea mentioned this pull request Jan 31, 2020

[Arm64/Ubuntu] GCStress = 0x4/0x8/0xC/0xF Regression dotnet/runtime#9764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent memory allocation in signal handler #16384

Prevent memory allocation in signal handler #16384

mlabiuk commented Feb 14, 2018

dnfclas commented Feb 14, 2018 •

edited

Loading

mlabiuk commented Feb 14, 2018

jkotas Feb 14, 2018 •

edited

Loading

janvorli Feb 14, 2018

janvorli Feb 14, 2018

janvorli Feb 14, 2018

mlabiuk Feb 14, 2018

janvorli Feb 14, 2018

mlabiuk Feb 16, 2018

janvorli Feb 16, 2018

janvorli left a comment

janvorli Feb 19, 2018

janvorli Feb 19, 2018

janvorli Feb 19, 2018

janvorli Feb 19, 2018

janvorli Feb 19, 2018

janvorli Feb 19, 2018

mlabiuk commented Feb 20, 2018

janvorli Feb 20, 2018

janvorli Feb 20, 2018

janvorli commented Feb 20, 2018

janvorli left a comment

		@@ -3762,6 +3762,8 @@ PAL_BindResources(IN LPCSTR lpDomain);

		#define EXCEPTION_IS_SIGNAL 0x100

		#define EXCEPTION_ON_STACK 0x400

Prevent memory allocation in signal handler #16384

Prevent memory allocation in signal handler #16384

Conversation

mlabiuk commented Feb 14, 2018

dnfclas commented Feb 14, 2018 • edited Loading

mlabiuk commented Feb 14, 2018

jkotas Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvorli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlabiuk commented Feb 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvorli commented Feb 20, 2018

janvorli left a comment

Choose a reason for hiding this comment

dnfclas commented Feb 14, 2018 •

edited

Loading

jkotas Feb 14, 2018 •

edited

Loading