-
-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily allocate thread stacks #143
Comments
Just as an example of the kind of savings I am hoping to get from this: Right now, in even a small application ("make image=rogue", for example) I see 180 threads. The memory saving will be much larger in applications which use many pthreads, e.g., heavily threaded Java applications. |
Linux's mmap() API has a flag, MAP_STACK, which is a no-op on Linux, but was added to the API in case "some architectures require special treatment for stack allocations" (see mmap(2)). Unfortunately, OSv indeed has a special requirement for stack memory: As explained in issue #143, we currently do not support demand-paged stacks (i.e., accessing the stack cannot cause a page fault), so the stack must be allocated pre-faulted and pinned. So if an application did us the courtesy of actually using MAP_STACK when mmap'ing a stack, let's return the favor by mapping pre-faulted memory which will actually work if used as a stack. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
In issue #790, Timmons C. Player discovered something dangerous with on-stack sched::thread objects: The application's stack may be mmap()ed, so the scheduler now needs to access an application's mmap()ed memory area. Those accesses can potentially have all sort of problems (e.g., page faults in the scheduler, although in practice issue #143 precludes it). But more concretely, The "lazy TLB flush" code added in commit 7e38453 means that the scheduler may see old pages for an application's mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed area. This patch prevents on-stack sched::thread construction by hiding the constructor, and only allowing a thread to be created through the function sched::thread::make(...), which creates the object on the heap. This patch is long but it more-or-less contains just the following types of changes: 1. The change to sched.hh to hide the sched::thread constructor, and instead offer a sched::thread::make(...) function. 2. Every class which used a "sched::thread" member was replaced by a std::unique_ptr<sched::thread> member. One "casualty" of this change is that a class that used to hold a sched::thread member will now hold a sched::thread pointer, and use an extra allocation and indirection - even if we know for sure that the containing object will always be allocated on the heap (such as the case in pthread.cc's pthread::_thread, for example). This can be worked around by making the exceptional class a friend of sched::thread to be able to use the hidden constructor - but for now I decided the tiny performance benefit isn't important enough to make this exception. 3. Every place which used "new sched::thread(...)" to create the thread, now uses "sched::thread::make(...)" instead. Sadly we no longer allow "new sched::thread(...)", although there is nothing wrong with it, because disabling the constructor also disables the new expression. 4. One test in misc-wake.cc (which wasn't run by default anyway, because) it is named "misc-*") was commented out because it relied on creating thread objects in mmap()ed areas. One of the places modified is rcu_flush(), whose on-stack sched::thread objects are what caused issue #790. Fixes #790. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In issue cloudius-systems#790, Timmons C. Player discovered something dangerous with on-stack sched::thread objects: The application's stack may be mmap()ed, so the scheduler now needs to access an application's mmap()ed memory area. Those accesses can potentially have all sort of problems (e.g., page faults in the scheduler, although in practice issue cloudius-systems#143 precludes it). But more concretely, The "lazy TLB flush" code added in commit 7e38453 means that the scheduler may see old pages for an application's mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed area. This patch prevents on-stack sched::thread construction by hiding the constructor, and only allowing a thread to be created through the function sched::thread::make(...), which creates the object on the heap. This patch is long but it more-or-less contains just the following types of changes: 1. The change to sched.hh to hide the sched::thread constructor, and instead offer a sched::thread::make(...) function. 2. Every class which used a "sched::thread" member was replaced by a std::unique_ptr<sched::thread> member. One "casualty" of this change is that a class that used to hold a sched::thread member will now hold a sched::thread pointer, and use an extra allocation and indirection - even if we know for sure that the containing object will always be allocated on the heap (such as the case in pthread.cc's pthread::_thread, for example). This can be worked around by making the exceptional class a friend of sched::thread to be able to use the hidden constructor - but for now I decided the tiny performance benefit isn't important enough to make this exception. 3. Every place which used "new sched::thread(...)" to create the thread, now uses "sched::thread::make(...)" instead. Sadly we no longer allow "new sched::thread(...)", although there is nothing wrong with it, because disabling the constructor also disables the new expression. 4. One test in misc-wake.cc (which wasn't run by default anyway, because) it is named "misc-*") was commented out because it relied on creating thread objects in mmap()ed areas. One of the places modified is rcu_flush(), whose on-stack sched::thread objects are what caused issue cloudius-systems#790. Fixes cloudius-systems#790. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this mailing-list thread, https://groups.google.com/d/msg/osv-dev/LqUAWjzVpSc/Yf4bTdi0AwAJ, Roberto and Waldek noticed that a Java program with 400 user threads uses over half a gigabyte just for these threads' stack, and on Linux this doesn't happen. It happens because of this issue (Java's default stack size is 1MB). |
This seems like a quite important issue to fix from memory utilization standpoint. I wonder what the pros and cons of the 3 options are:
|
Hi @wkozaczuk my andswers and recommendations: In the single-byte-read page fault the issue is not the application wanting to use more than a single page, but the kernel. It's not even all the kernel code - it's only preemption-disabled parts of the kernel which will be limited to 4096 bytes of stack. I'm hoping that most the kernel code does not have preemption disabled, so maybe 4096 bytes will be enough, but I really don't know. I also suggested how we can read more than one byte at 4096-byte interviews to pre-fault more than one page (at some performance cost), but I think the best option is perhaps the one I already mentioned - of bringing in several adjacent pages on one single page fault. In other words, the code reads just one byte 4096 bytes back (stack grows downward...), but if this one read gets a page fault, the page fault handler will allocate not only this one page, but also the previous few pages, if they are still part of the same mapped range. Or, since this feature is very stack-specific we can do this only for stacks: currently, About think-code - I think it's complicated we should only look into it if we're out of other options. |
@nyh So it looks your proposal is probably the best way to go. As I understand it would not only apply to application threads (where app and OSv share same stack anyway) but also to pure kernel threads (like ZFS ones), correct? In theory same solution could be applied to syscall stack which we malloc on first syscall execution. |
@wkozaczuk this issue is relevant to stacks allocated by Most other internal threads in OSv has smaller stackes which are allocated by malloc(), not by mmap(). See |
Some takeaways from discussing the ongoing implementation effort - https://groups.google.com/d/msg/osv-dev/oaB5CHThc7M/2JCIvzgpAQAJ:
|
@nyh Do you have any comments about these most recent takeaways and a solution proposed on the mailing list? |
This is the original patch from Jan 11, 2021 applied manually to resolve conflicts caused by subsequent changes to the relevant files since. Waldemar Kozaczuk ------------------------------------------------------------------ Since commit 695375f, new threads start with preemption disabled, and then immediately (in thread_main_c()) enable preemption. However, this code runs on the thread's stack, which can be supplied by the user. If this stack is lazily allocated (not pre-populated, with pages to be populated only on first use), we will get a page fault when thread_main_c() begins to run with preemption disabled. This can happen if the application creates a thread stack with mmap() without passing the MAP_STACK option. Note that it is still strongly recommended to pre-populate thread stacks by using MAP_STACK (or MAP_POPULATE), because we still haven't committed a fix for issue #143. However, let's at least not crash immediately and consistently, as we did before this patch. This patch simply reads from the top byte of the user-supplied stack before starting the thread - causing it to be faulted in (if needed) at that time instead of when the thread starts. The first stack page should be enough to start thread_main_c() and get to the point where it re-enables preemption. This patch also includes a test reproducing this issue and demonstrating that it is fixed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is the 1st of the total of eight patches that implement optional support of so called "lazy stack" feature. The lazy stack is well explained by the issue #143 and allows to save substantial amount of memory if application spawns many pthreads with large stacks by letting stack grow dynamically as needed instead of getting pre-populated ahead of time. The crux of this solution and the previous versions is based on the observation that OSv memory fault handler requires that both interrupts and preemption must be enabled when fault is triggered. Therefore if stack is dynamically mapped we need to make sure that stack page faults NEVER happen in the relatively few places of kernel code that executes with either interrupts or preemption disabled. And we satisfy this requirement by "pre-faulting" the stack by reading a byte page (4096) down per stack pointer just before preemption or interrupts are disabled. Now, the problem is complicated by the fact that the kernel code A that disables preemption or interrupts may nest by calling another kernel function B that also disables preemption or interrupts in which case the function B should NOT try to pre-fault the stack otherwise the fault handler will abort due to violated constraint stated before. In short we cannot "blindly" or unconditionally pre-fault the stack in all places before interrupts or preemption are disabled. Some of the previous solutions modified both arch::irq_disable() and sched::preempt_disable() to check if both preemption and interrupts are enabled and only then try to read a byte at -4096 offset down. Unfortunately, this makes it more costly than envisioned by Nadav Har'El - instead of single instruction to read from memory, compiler needs 4-5 to read data if preemption and interrupts are enabled and perform relevant jump. To make it worse, the implementation of arch::irq_enabled() is pretty expensive at least in x64 and uses stack with pushfq. To avoid it the previous solutions would add new thread local counter and pack irq disabling counter together with preemption one. But even with this optimization I found this approach to deteriorate the performance quite substantially. For example the memory allocation logic disables preemption in quite many places (see core/mempool.cc) and corresponding test - misc-free-perf.cc - would show performance - number of malloc()/free() executed per second - degrade on average by 15-20%. So this latest version implemented by this and next 7 patches takes different approach. Instead of putting the conditional pre-faulting of stack in both irq_disable() and preempt_disable(), we analyze OSv code to find all places where irq_disable() and/or preempt_disable() is called directly (or indirectly sometimes) and pre-fault the stack there if necessary or not. This makes it obviously way more laborious and prone to human error (we can miss some places), but would make it way more performant (no noticable performance degradation noticed) comparing to earlier versions described in the paragraph above. As we analyze all call sites, we need to make some observations to help us decide what exactly to do in each case: - do nothing - blindly pre-fault the stack (single instruction) - conditionally pre-fault the stack (hopefully in very few places) Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread. Rule 2: Do nothing if call site executes on populated stack - includes the above but also code executing on interrupt, exception or syscall stack. Rule 3: Do nothing if call site executes when we know that either interrupts or preemption are disabled. Good example is an interrupt handler or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks. Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts are enabled. These in most cases can only be deduced by analysing where the particular function is called. In general any such function called by user code like libc would satisfy the condition. But sometimes it is tricky because kernel might be calling libc functions, such as malloc(). Rule 5: Otherwise pre-fault stack by determining dynamically: only if sched::preemptable() and irq::enabled() One general rule is that all potential stack page faults would happen on application thread stack when some kernel code gets executed down the call stack. In general we identify the call sites in following categories: - direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints) - direct calls to sched::preempt_disable() - code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type - code using WITH_LOCK(preempt_lock) - code using WITH_LOCK(osv::rcu_read_lock) The above locations can be found with simple grep but also with an IDE like CLion from JetBrains that can help more efficiently find all direct but also more importantly indirect usages of the call sites identified above. So this patch lays a ground work by defining the inline assembly to pre-fault the stack where necessary and introduces two build parameters - CONF_lazy_stack and CONF_lazy_stack_invariant - that are disabled by default. The first one is used in all places to enable the lazy stack logic and the second one is used to add code with some related invariants that will help us to reason about the code and whether we should do nothing, pre-fault stack "blindly" or conditionally. The remaining 7 patches mostly add the pre-fault code in relevant places but also annotate code with some invariants using assert(). Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This has been implemented by this series of commits - https://github.com/search?q=repo%3Acloudius-systems%2Fosv+%22lazy+stack%22&type=commits. Even though the lazy stack support is not enabled by default (see the lazy stack flag in |
We currently allocate for each pthread, by default, a 1 MB stack. This allocation is done with mmap with mmu::mmap_populate, i.e., we force the entire allocation to be done up-front. This is a terrible waste: thread stacks are an excellent candidate for lazy allocation - they cannot use huge-pages anyway (they are less than 2MB, and have a guard page), and they often use much less memory than the full 1MB allotment.
Unfortunately, OSv crashes with lazy-population of stack threads, e.g., in tst-pipe.so, as explained in commit 41efdc1 which made the allocation immediate (with mmu::mmap_populate). The problem is that with lazy allocation of stacks, any code can cause us to need to allocate one more page for the stack, and cause a page fault, which may sleep (waiting on a mutex for memory allocation, for example). The crash happens when this happens in some OSv "kernel" code which has preemption disabled or irq disabled but runs on the user's stack (e.g., RCU lock, wait_until, the scheduler, or numerous other places), and a sleep is not allowed.
On the OSv mailing list, I proposed the following workaround: In preempt_disable() and irq_disable(), before we disable preemption, read a byte 4096 bytes ahead of the current stack top. Usually, this will add a tiny performance penalty (a single read, not even a if()), but if the next page of the stack is not yet allocated, it will cause the page fault to be taken here and now, before we disable preemption.
We also need to read a byte from the stack in thread::init_stack(), because new threads start with preemption disabled so we need them to start with a minimal part of the stack allocated.
If we want more than 4096 bytes of available stack to be guaranteed, we can read several bytes at 4096-byte stride, but this is slower than reading a single byte. Instead, we can consider doing this: read just one byte a few pages from the current stack position, and modify the page-fault handler to allocate more than one adjacent page on each page fault.
On the mailing list, Avi Kivity suggested other options to solve our lazily-allocated stack problem:
The text was updated successfully, but these errors were encountered: