Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Prevent memory allocation in signal handler #16384

Merged
merged 5 commits into from
Feb 20, 2018

Conversation

mlabiuk
Copy link

@mlabiuk mlabiuk commented Feb 14, 2018

If the signal occurs when heap being inconsistent we should not
use heap. We should call signal-safe functions only from signal handler.

fix https://github.com/dotnet/coreclr/issues/16338

If the signal occurs when heap being inconsistent we should not
use heap. We should call signal-safe functions only from signal handler.

fix https://github.com/dotnet/coreclr/issues/16338
@dnfclas
Copy link

dnfclas commented Feb 14, 2018

CLA assistant check
All CLA requirements met.

@mlabiuk
Copy link
Author

mlabiuk commented Feb 14, 2018

{
ExceptionRecords* records;
if (posix_memalign((void**)&records, alignof(ExceptionRecords), sizeof(ExceptionRecords)) != 0)
if (allocationProhibited ||
(posix_memalign((void**)&records, alignof(ExceptionRecords), sizeof(ExceptionRecords)) != 0) )
Copy link
Member

@jkotas jkotas Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the implementation below to keep re-trying instead of aborting when we do not have a free context available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we have a limited number of the preallocated contexts (64 for 64 bit processors, 32 for 32 bit ones). That means that if more than that number of threads were handling hardware exceptions at the same time, we would abort the process. In fact, even with less threads, we can exhaust this storage easily if a hardware exception happens while we are handling another one and then another etc. We need to keep the previous exception until the final one is handled.
So it seems that the solution will need to be more involved than using these fallback contexts which were really meant just to be a bit more resilient in out of memory cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas retrying would not help, the contexts are alive during the whole lifetime of the PAL_SEHException, which means until the exception is fully handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if using mmap to allocate memory in the signal handler would be safe. While the mmap is not listed on the async signal safe functions list, I don't see how it could cause a problem. The problems with other functions are based on the fact that the sigsegv can happen inside of such functions and that such functions are not reentrant. But mmap is just a syscall and so no sigsegv can occur while executing code inside of it, thus it should be safe to call from our sigsegv handler.
So maybe we could create a custom allocator based on mmap and use it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janvorli Could you explain why preallocated context count is so tiny? For atomic free slot searching? Why not use N bitmaps?

I can add check for recursive handler and using preallocated context only if previous signal handler is not completed. But it ruined main idea do not use unsafe functions in signal handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason why it is so tiny is so that we can allocate the slot atomically. That's why the count is bitness specific.
We really need a right fix that works in all cases, so checks for recursive handler would not be the right solution (moreover, you would not be able to track the recursiveness at this level, since we never return back to the signal handler if the exception is processed by the runtime.
I can see a several possible ways we could solve the issue:

  1. Use mmap as a base for a custom allocator for the contexts. Either for all or just for the ones allocated for the signal handler. As I've said before, I think there should be no problem with using mmap in the handler even though it is not listed as async signal safe.
  2. Refactor the common_signal_handler, SEHProcessException and related code to not to allocate the records until we know that the failing code was executing in code under our control - that means that g_safeExceptionCheckFunction or CatchHardwareExceptionHolder::IsEnabled()returned true. In such case, it is safe to use the existing allocator. That will complicate a bit sharing the SEHProcessException implementation for OSX and other Unixes, since the records are generated differently for OSX and the others and we would need to do the conversion between the platform native context and the
  3. Initialize the PAL_SEHException with records allocated on stack and add a method to the PAL_SEHException that would update the records to heap allocated ones (allocate the memory and copy the original records there). This method would be called from SEHProcessException before the call to the g_hardwareExceptionHandler and PAL_ThrowExceptionFromContext where we know it is safe to allocate. We would add a bool flag into the PAL_SEHException indicating whether the records are on stack or on heap to make sure the destructor does not try to free them if they were still on stack. That would also make it transparent to the OSX case where the records will always be allocated from the heap, since on OSX, we don't use signals for handling hardware exceptions.

I think that we should go with the way 3 - we could initiate the exception with the signalContextRecord (and change the code in the common_signal_handler to extract the context directly into it). Since we already had a memcpy that copied the context to this structure, we would not be adding an additional copy of the large record (just move the place where we do the copy), which is good.
Also, the change will be quite localized and won't do anything that's not safe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree. Only performance problem may occur. Because mmap is a system call moreover it is serialized on process system call.

2 - 3. Signal handler is not a safe place for glibc memory allocation. OK some cases are safe for runtime. But all are not safe for other (native) threads. Original issue #16338 catches segfault not in memory allocation but when memory is fried. In this case we can't use glibc allocation at all. Also delayed allocation is useless because it is only optimization for signals generated from not managed code.

So I think we have to use mmap for allocation exception context in signal handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to avoid allocations in the signal handler. The whole hardware exception handling is executed from the handler. That means that once we start handling the exception, there is no limit in what we call and execute and we never return from the signal handler.
But what is important is to not to start calling async signal unsafe stuff until we know that the code that was executing was managed code or special helpers in our runtime. Once we are sure it was the case, there is no issue with calling anything on the handler, since we know we won't be reentering any platform functions.
That's why I believe we should use the way #3. The way #1 would require writing an allocator on top of mmap so that we don't waste memory (especially on arm64 where page size is 64 kB on some Linux distros). That would make it considerably more complex than the way #3.

This reverts commit fa6087f0c2856215e7141259b56e75b46746f74d.
If the signal occurs in not managed code we cannot use heap.
We should call signal-safe functions only from signal handler.

Create exception object on stack for checking source of signal.
If signal is from managed code we can use memory allocation to create
persistent exception on heap as copy of volatile exception on stack.

If signal from unmanaged code we do nothing and call base signal handler.

fix https://github.com/dotnet/coreclr/issues/16338
Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlabiuk thank you for making the changes! I have a couple of suggestions.

@@ -3762,6 +3762,8 @@ PAL_BindResources(IN LPCSTR lpDomain);

#define EXCEPTION_IS_SIGNAL 0x100

#define EXCEPTION_ON_STACK 0x400
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer adding a bool member to the PAL_SEHException (and a corresponding argument to the constructor) instead to indicate whether exception records are on stack or allocated by the allocator. It feels a bit better due to the fact that it is related to both of the exception and context records. Also, as for the name, I'd prefer something like "RecordsOnStack" since it refers to the records instead of the exception itself.

@@ -202,6 +202,23 @@ void ThrowExceptionHelper(PAL_SEHException* ex)
throw std::move(*ex);
}

static PAL_SEHException copyPAL_SEHException(PAL_SEHException* src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new copy, I would prefer updating the existing exception instance. Either directly or by adding a Set method to the PAL_SEHException with the same arguments as the constructor.
Also, the prevalent coding convention in PAL is to use pascal casing for function names without underscores. The PAL_ prefix is an exception, indicating functions and data structures that are exported by PAL, but don't exist on Windows.
How about naming it e.g. "MoveExceptionRecordsToHeap" or "EnsureExceptionRecordsOnHeap"?

@@ -249,6 +266,9 @@ SEHProcessException(PAL_SEHException* exception)
PROCAbort();
}
}

if(exceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK)
*exception = copyPAL_SEHException(exception);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the braces even for single line body to match the surrounding code style?

@@ -5683,7 +5685,8 @@ struct PAL_SEHException

void FreeRecords()
{
if (ExceptionPointers.ExceptionRecord != NULL)
if (ExceptionPointers.ExceptionRecord != NULL &&
! (ExceptionPointers.ExceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second part of the condition is not correct, it should be & instead of |.

native_context_t *ucontext;

ucontext = (native_context_t *)sigcontext;
g_common_signal_handler_context_locvar_offset = (int)((char*)&signalContextRecord - (char*)__builtin_frame_address(0));

AllocateExceptionRecords(&exceptionRecord, &contextRecord);
//AllocateExceptionRecords(&exceptionRecord, &contextRecord);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented out code.

@@ -262,6 +282,9 @@ SEHProcessException(PAL_SEHException* exception)

if (CatchHardwareExceptionHolder::IsEnabled())
{
if(exceptionRecord->ExceptionFlags | EXCEPTION_ON_STACK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the callers, it would be nice to move the check into the actual function that updates the exception. For exception that already has records on heap, the function would become a no-op.

PAL_SEHException::EnsureExceptionRecordsOnHeap() moves exception record
to heap if needed.

fix https://github.com/dotnet/coreclr/issues/16338
@mlabiuk
Copy link
Author

mlabiuk commented Feb 20, 2018

@janvorli Thaks for you valuable suggestions.
I have one question about move constructor and assignment operators in PAL_SEHException.
Should we do PROCAbort() if source (ex) object have records on stack?

@@ -5663,6 +5663,11 @@ PAL_FreeExceptionRecords(
IN EXCEPTION_RECORD *exceptionRecord,
IN CONTEXT *contextRecord);

VOID
AllocateExceptionRecords(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like exposing the AllocateExceptionRecords from PAL. Nothing out of PAL needs this functionality. That's why I've suggested adding the Set method to the exception so that you can allocate the records in the PAL code and just set them on the exception (or without the Set, just update the members).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that PAL should expose the minimal possible surface and the record allocation is internal implementation detail of the PAL.

@janvorli
Copy link
Member

I have one question about move constructor and assignment operators in PAL_SEHException.
Should we do PROCAbort() if source (ex) object have records on stack?

I don't think we need to do that. There is nothing wrong with moving the exception object while the records are still on stack if we ever needed it.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for your contribution!

@janvorli janvorli merged commit d9753f4 into dotnet:master Feb 20, 2018
@mlabiuk mlabiuk deleted the fix_crash_in_common_signal_handler branch February 21, 2018 06:43
janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018
janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018
There was a subtle bug. When the hardware exception handler returns back
to the signal handler, the exception's CONTEXT record may contain
modified registers and so the changes need to be propagated back to the
signal context. But the recent change dotnet#16384 was restoring the signal
context from the originally grabbed context instead of the one that's
pointed to by the exception, which is different.

I have also added a little optimization - the contextRecord that was
added is not needed, since the signalContextRecord can be used as the
initial context record for the exception. So we can save the
contextRecord and also copying to the signalContextRecord from it.
janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018
There was a subtle bug. When the hardware exception handler returns back
to the signal handler, the exception's CONTEXT record may contain
modified registers and so the changes need to be propagated back to the
signal context. But the recent change dotnet#16384 was restoring the signal
context from the originally grabbed context instead of the one that's
pointed to by the exception, which is different.

I have also added a little optimization - the contextRecord that was
added is not needed, since the signalContextRecord can be used as the
initial context record for the exception. So we can save the
contextRecord and also copying to the signalContextRecord from it.
janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018
There was a subtle bug. When the hardware exception handler returns back
to the signal handler, the exception's CONTEXT record may contain
modified registers and so the changes need to be propagated back to the
signal context. But the recent change dotnet#16384 was restoring the signal
context from the originally grabbed context instead of the one that's
pointed to by the exception, which is different.

I have also added a little optimization - the contextRecord that was
added is not needed, since the signalContextRecord can be used as the
initial context record for the exception. So we can save the
contextRecord and also copying to the signalContextRecord from it.
janvorli added a commit to janvorli/coreclr that referenced this pull request Feb 22, 2018
There was a subtle bug. When the hardware exception handler returns back
to the signal handler, the exception's CONTEXT record may contain
modified registers and so the changes need to be propagated back to the
signal context. But the recent change dotnet#16384 was restoring the signal
context from the originally grabbed context instead of the one that's
pointed to by the exception, which is different.

I have also added a little optimization - the contextRecord that was
added is not needed, since the signalContextRecord can be used as the
initial context record for the exception. So we can save the
contextRecord and also copying to the signalContextRecord from it.
janvorli added a commit that referenced this pull request Feb 23, 2018
There was a subtle bug. When the hardware exception handler returns back
to the signal handler, the exception's CONTEXT record may contain
modified registers and so the changes need to be propagated back to the
signal context. But the recent change #16384 was restoring the signal
context from the originally grabbed context instead of the one that's
pointed to by the exception, which is different.

I have also added a little optimization - the contextRecord that was
added is not needed, since the signalContextRecord can be used as the
initial context record for the exception. So we can save the
contextRecord and also copying to the signalContextRecord from it.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UNIX] common_signal_handler calls non-async-signal-safe functions
4 participants