Lazily allocate thread stacks #143

nyh · 2013-12-26T13:38:10Z

We currently allocate for each pthread, by default, a 1 MB stack. This allocation is done with mmap with mmu::mmap_populate, i.e., we force the entire allocation to be done up-front. This is a terrible waste: thread stacks are an excellent candidate for lazy allocation - they cannot use huge-pages anyway (they are less than 2MB, and have a guard page), and they often use much less memory than the full 1MB allotment.

Unfortunately, OSv crashes with lazy-population of stack threads, e.g., in tst-pipe.so, as explained in commit 41efdc1 which made the allocation immediate (with mmu::mmap_populate). The problem is that with lazy allocation of stacks, any code can cause us to need to allocate one more page for the stack, and cause a page fault, which may sleep (waiting on a mutex for memory allocation, for example). The crash happens when this happens in some OSv "kernel" code which has preemption disabled or irq disabled but runs on the user's stack (e.g., RCU lock, wait_until, the scheduler, or numerous other places), and a sleep is not allowed.

On the OSv mailing list, I proposed the following workaround: In preempt_disable() and irq_disable(), before we disable preemption, read a byte 4096 bytes ahead of the current stack top. Usually, this will add a tiny performance penalty (a single read, not even a if()), but if the next page of the stack is not yet allocated, it will cause the page fault to be taken here and now, before we disable preemption.
We also need to read a byte from the stack in thread::init_stack(), because new threads start with preemption disabled so we need them to start with a minimal part of the stack allocated.

If we want more than 4096 bytes of available stack to be guaranteed, we can read several bytes at 4096-byte stride, but this is slower than reading a single byte. Instead, we can consider doing this: read just one byte a few pages from the current stack position, and modify the page-fault handler to allocate more than one adjacent page on each page fault.

On the mailing list, Avi Kivity suggested other options to solve our lazily-allocated stack problem:

insert "thunk code" between user and kernel code that switches the stacks to known resident stacks. We could abuse the elf linker code to do that for us, at run time.
use -fsplit-stack to allow a dynamically allocated, discontiguous stack on physical memory

The text was updated successfully, but these errors were encountered:

nyh · 2014-02-07T22:15:37Z

Just as an example of the kind of savings I am hoping to get from this:

Right now, in even a small application ("make image=rogue", for example) I see 180 threads.
Threads have variable stack sizes (some are 4K, pthreads are 1MB by default), but the vast majority of these 180 threads are ZFS threads which have a 65 KB stack.
If half of this stack is unused (just a ballpark guestimate...), we have 32 KB wasted for each of the 180 threads. That's almost 6MB, which is 13% of the memory used by this tiny application (I can run it on a 46 MB guest).

The memory saving will be much larger in applications which use many pthreads, e.g., heavily threaded Java applications.

Linux's mmap() API has a flag, MAP_STACK, which is a no-op on Linux, but was added to the API in case "some architectures require special treatment for stack allocations" (see mmap(2)). Unfortunately, OSv indeed has a special requirement for stack memory: As explained in issue #143, we currently do not support demand-paged stacks (i.e., accessing the stack cannot cause a page fault), so the stack must be allocated pre-faulted and pinned. So if an application did us the courtesy of actually using MAP_STACK when mmap'ing a stack, let's return the favor by mapping pre-faulted memory which will actually work if used as a stack. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

In issue #790, Timmons C. Player discovered something dangerous with on-stack sched::thread objects: The application's stack may be mmap()ed, so the scheduler now needs to access an application's mmap()ed memory area. Those accesses can potentially have all sort of problems (e.g., page faults in the scheduler, although in practice issue #143 precludes it). But more concretely, The "lazy TLB flush" code added in commit 7e38453 means that the scheduler may see old pages for an application's mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed area. This patch prevents on-stack sched::thread construction by hiding the constructor, and only allowing a thread to be created through the function sched::thread::make(...), which creates the object on the heap. This patch is long but it more-or-less contains just the following types of changes: 1. The change to sched.hh to hide the sched::thread constructor, and instead offer a sched::thread::make(...) function. 2. Every class which used a "sched::thread" member was replaced by a std::unique_ptr<sched::thread> member. One "casualty" of this change is that a class that used to hold a sched::thread member will now hold a sched::thread pointer, and use an extra allocation and indirection - even if we know for sure that the containing object will always be allocated on the heap (such as the case in pthread.cc's pthread::_thread, for example). This can be worked around by making the exceptional class a friend of sched::thread to be able to use the hidden constructor - but for now I decided the tiny performance benefit isn't important enough to make this exception. 3. Every place which used "new sched::thread(...)" to create the thread, now uses "sched::thread::make(...)" instead. Sadly we no longer allow "new sched::thread(...)", although there is nothing wrong with it, because disabling the constructor also disables the new expression. 4. One test in misc-wake.cc (which wasn't run by default anyway, because) it is named "misc-*") was commented out because it relied on creating thread objects in mmap()ed areas. One of the places modified is rcu_flush(), whose on-stack sched::thread objects are what caused issue #790. Fixes #790. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

In issue cloudius-systems#790, Timmons C. Player discovered something dangerous with on-stack sched::thread objects: The application's stack may be mmap()ed, so the scheduler now needs to access an application's mmap()ed memory area. Those accesses can potentially have all sort of problems (e.g., page faults in the scheduler, although in practice issue cloudius-systems#143 precludes it). But more concretely, The "lazy TLB flush" code added in commit 7e38453 means that the scheduler may see old pages for an application's mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed area. This patch prevents on-stack sched::thread construction by hiding the constructor, and only allowing a thread to be created through the function sched::thread::make(...), which creates the object on the heap. This patch is long but it more-or-less contains just the following types of changes: 1. The change to sched.hh to hide the sched::thread constructor, and instead offer a sched::thread::make(...) function. 2. Every class which used a "sched::thread" member was replaced by a std::unique_ptr<sched::thread> member. One "casualty" of this change is that a class that used to hold a sched::thread member will now hold a sched::thread pointer, and use an extra allocation and indirection - even if we know for sure that the containing object will always be allocated on the heap (such as the case in pthread.cc's pthread::_thread, for example). This can be worked around by making the exceptional class a friend of sched::thread to be able to use the hidden constructor - but for now I decided the tiny performance benefit isn't important enough to make this exception. 3. Every place which used "new sched::thread(...)" to create the thread, now uses "sched::thread::make(...)" instead. Sadly we no longer allow "new sched::thread(...)", although there is nothing wrong with it, because disabling the constructor also disables the new expression. 4. One test in misc-wake.cc (which wasn't run by default anyway, because) it is named "misc-*") was commented out because it relied on creating thread objects in mmap()ed areas. One of the places modified is rcu_flush(), whose on-stack sched::thread objects are what caused issue cloudius-systems#790. Fixes cloudius-systems#790. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

nyh · 2019-05-29T06:20:56Z

In this mailing-list thread, https://groups.google.com/d/msg/osv-dev/LqUAWjzVpSc/Yf4bTdi0AwAJ, Roberto and Waldek noticed that a Java program with 400 user threads uses over half a gigabyte just for these threads' stack, and on Linux this doesn't happen. It happens because of this issue (Java's default stack size is 1MB).

wkozaczuk · 2019-05-30T20:30:44Z

This seems like a quite important issue to fix from memory utilization standpoint. I wonder what the pros and cons of the 3 options are:

single-byte-read page fault when stack
- one obvious downside is that if an application wants to use more than single page (or other buffer we want to keep in advance) this will not work
"thunk code"
- how would we implement it?
-fsplit-stack

nyh · 2019-05-30T20:52:33Z

Hi @wkozaczuk my andswers and recommendations:

In the single-byte-read page fault the issue is not the application wanting to use more than a single page, but the kernel. It's not even all the kernel code - it's only preemption-disabled parts of the kernel which will be limited to 4096 bytes of stack. I'm hoping that most the kernel code does not have preemption disabled, so maybe 4096 bytes will be enough, but I really don't know.

I also suggested how we can read more than one byte at 4096-byte interviews to pre-fault more than one page (at some performance cost), but I think the best option is perhaps the one I already mentioned - of bringing in several adjacent pages on one single page fault. In other words, the code reads just one byte 4096 bytes back (stack grows downward...), but if this one read gets a page fault, the page fault handler will allocate not only this one page, but also the previous few pages, if they are still part of the same mapped range. Or, since this feature is very stack-specific we can do this only for stacks: currently, libc_flags_to_mmap() converts MAP_STACK to "mmu::mmap_populate". It should convert it to a new "mmu::mmap_stack", and the page fault handler will only do the "populate a few pages below the faulting one" trick when it sees a mapping with the mmu::mmap_stack parameter. Other OSv stack-allocation code which today uses mmu::mmap_populate (see commit 41efdc1) will also need to use the new mmu::mmap_stack instead.

About think-code - I think it's complicated we should only look into it if we're out of other options.
About "-fsplit-stack" - I have no idea how it works, and if it does work on OSv. But even if it did, how would Java use it? I thought Java itself determined that we allocate 1MB stacks for its threads, and doesn't use "split stack" features.

wkozaczuk · 2019-05-30T21:14:43Z

@nyh So it looks your proposal is probably the best way to go. As I understand it would not only apply to application threads (where app and OSv share same stack anyway) but also to pure kernel threads (like ZFS ones), correct? In theory same solution could be applied to syscall stack which we malloc on first syscall execution.

nyh · 2019-05-30T21:27:58Z

@wkozaczuk this issue is relevant to stacks allocated by mmap(), not by malloc(). All our Posix threads used by applications use mmap, because our pthread::allocate_stack()usesmmu::map_anon(nullptr, size, mmu::mmap_populate, ...)` which is also an mmap (and, as I said above, uses explicitly mmap_populate for immediate population and it shouldn't, perhaps it should use a new mmu::mmap_stack).

Most other internal threads in OSv has smaller stackes which are allocated by malloc(), not by mmap(). See init_stack() in arch/x64/arch-switch.hh. These always allocate immediately and aren't lazily populated, and the thinking was that this is fine for the small stacks (usually 64 KB) we use there. So nothing needs to change there, I think.

wkozaczuk · 2019-12-18T22:02:30Z

Some takeaways from discussing the ongoing implementation effort - https://groups.google.com/d/msg/osv-dev/oaB5CHThc7M/2JCIvzgpAQAJ:

We need to handle the nesting scenario when preempt_disable might be called after irq_disable was already called and vice versa. This means we cannot read a byte from the page ahead blindly as the stack may have already advanced (gotten deeper) and -4096(%rsp) could lead to another page fault which we cannot handle. This can be prevented by introducing separate stack_page_read_counter that we would increment and decrement in right places around disabling/enabling preemption/interrupts and trigger page read if the counter is 1 (look for a WIP patch here).
All the increments and decrements of the stack_page_read_counter should be symmetrical. Decrementing it in wait_for_interrupts() does not seem to but if I remember correctly when I was trying to get it all to work I had to put it there, maybe because we start with 1. But I am not longer convinced it is necessary. We need to better understand it.
Ideally, we should not even be trying to read the next page on kernel threads or even better on threads with stack mapped with mmap_stack. That can be accomplished by initializing the stack_page_read_counter to some higher value than 1 (10 or 11) so it never reaches 0 to trigger page read. This possibly can be set as a 1 thing in the thread_main_c method (see arch/x64/arch-switch.hh). It received thread pointer so we should somehow be able to determine how its stack got constructed. If not we can add a boolean flag to the thread class.
In theory, we could get by without stack_page_read_counter with just checking sched::preemptable() AND somehow read the flags register (PUSHF?) to directly check if the interrupt flag is set or not in the read_stack_page method. But I have a feeling it would be more expensive.
Using the counter may not be that expensive given we already use a similar mechanism to implement preemptable().

wkozaczuk · 2019-12-26T05:38:18Z

@nyh Do you have any comments about these most recent takeaways and a solution proposed on the mailing list?

This is the original patch from Jan 11, 2021 applied manually to resolve conflicts caused by subsequent changes to the relevant files since. Waldemar Kozaczuk ------------------------------------------------------------------ Since commit 695375f, new threads start with preemption disabled, and then immediately (in thread_main_c()) enable preemption. However, this code runs on the thread's stack, which can be supplied by the user. If this stack is lazily allocated (not pre-populated, with pages to be populated only on first use), we will get a page fault when thread_main_c() begins to run with preemption disabled. This can happen if the application creates a thread stack with mmap() without passing the MAP_STACK option. Note that it is still strongly recommended to pre-populate thread stacks by using MAP_STACK (or MAP_POPULATE), because we still haven't committed a fix for issue #143. However, let's at least not crash immediately and consistently, as we did before this patch. This patch simply reads from the top byte of the user-supplied stack before starting the thread - causing it to be faulted in (if needed) at that time instead of when the thread starts. The first stack page should be enough to start thread_main_c() and get to the point where it re-enables preemption. This patch also includes a test reproducing this issue and demonstrating that it is fixed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

This is the 1st of the total of eight patches that implement optional support of so called "lazy stack" feature. The lazy stack is well explained by the issue #143 and allows to save substantial amount of memory if application spawns many pthreads with large stacks by letting stack grow dynamically as needed instead of getting pre-populated ahead of time. The crux of this solution and the previous versions is based on the observation that OSv memory fault handler requires that both interrupts and preemption must be enabled when fault is triggered. Therefore if stack is dynamically mapped we need to make sure that stack page faults NEVER happen in the relatively few places of kernel code that executes with either interrupts or preemption disabled. And we satisfy this requirement by "pre-faulting" the stack by reading a byte page (4096) down per stack pointer just before preemption or interrupts are disabled. Now, the problem is complicated by the fact that the kernel code A that disables preemption or interrupts may nest by calling another kernel function B that also disables preemption or interrupts in which case the function B should NOT try to pre-fault the stack otherwise the fault handler will abort due to violated constraint stated before. In short we cannot "blindly" or unconditionally pre-fault the stack in all places before interrupts or preemption are disabled. Some of the previous solutions modified both arch::irq_disable() and sched::preempt_disable() to check if both preemption and interrupts are enabled and only then try to read a byte at -4096 offset down. Unfortunately, this makes it more costly than envisioned by Nadav Har'El - instead of single instruction to read from memory, compiler needs 4-5 to read data if preemption and interrupts are enabled and perform relevant jump. To make it worse, the implementation of arch::irq_enabled() is pretty expensive at least in x64 and uses stack with pushfq. To avoid it the previous solutions would add new thread local counter and pack irq disabling counter together with preemption one. But even with this optimization I found this approach to deteriorate the performance quite substantially. For example the memory allocation logic disables preemption in quite many places (see core/mempool.cc) and corresponding test - misc-free-perf.cc - would show performance - number of malloc()/free() executed per second - degrade on average by 15-20%. So this latest version implemented by this and next 7 patches takes different approach. Instead of putting the conditional pre-faulting of stack in both irq_disable() and preempt_disable(), we analyze OSv code to find all places where irq_disable() and/or preempt_disable() is called directly (or indirectly sometimes) and pre-fault the stack there if necessary or not. This makes it obviously way more laborious and prone to human error (we can miss some places), but would make it way more performant (no noticable performance degradation noticed) comparing to earlier versions described in the paragraph above. As we analyze all call sites, we need to make some observations to help us decide what exactly to do in each case: - do nothing - blindly pre-fault the stack (single instruction) - conditionally pre-fault the stack (hopefully in very few places) Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread. Rule 2: Do nothing if call site executes on populated stack - includes the above but also code executing on interrupt, exception or syscall stack. Rule 3: Do nothing if call site executes when we know that either interrupts or preemption are disabled. Good example is an interrupt handler or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks. Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts are enabled. These in most cases can only be deduced by analysing where the particular function is called. In general any such function called by user code like libc would satisfy the condition. But sometimes it is tricky because kernel might be calling libc functions, such as malloc(). Rule 5: Otherwise pre-fault stack by determining dynamically: only if sched::preemptable() and irq::enabled() One general rule is that all potential stack page faults would happen on application thread stack when some kernel code gets executed down the call stack. In general we identify the call sites in following categories: - direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints) - direct calls to sched::preempt_disable() - code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type - code using WITH_LOCK(preempt_lock) - code using WITH_LOCK(osv::rcu_read_lock) The above locations can be found with simple grep but also with an IDE like CLion from JetBrains that can help more efficiently find all direct but also more importantly indirect usages of the call sites identified above. So this patch lays a ground work by defining the inline assembly to pre-fault the stack where necessary and introduces two build parameters - CONF_lazy_stack and CONF_lazy_stack_invariant - that are disabled by default. The first one is used in all places to enable the lazy stack logic and the second one is used to add code with some related invariants that will help us to reason about the code and whether we should do nothing, pre-fault stack "blindly" or conditionally. The remaining 7 patches mostly add the pre-fault code in relevant places but also annotate code with some invariants using assert(). Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>

wkozaczuk · 2024-01-20T18:55:23Z

This has been implemented by this series of commits - https://github.com/search?q=repo%3Acloudius-systems%2Fosv+%22lazy+stack%22&type=commits. Even though the lazy stack support is not enabled by default (see the lazy stack flag in conf/base.mk) because we deem it as experimental, I suggest we close this issue as implemented.

wkozaczuk added high priority memory labels May 30, 2019

wkozaczuk mentioned this issue Jun 27, 2021

aarch64: consider implementing dedicated stacks like on x64 #1155

Closed

wkozaczuk added the optimization label Aug 26, 2021

wkozaczuk closed this as completed Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazily allocate thread stacks #143

Lazily allocate thread stacks #143

nyh commented Dec 26, 2013

nyh commented Feb 7, 2014

nyh commented May 29, 2019

wkozaczuk commented May 30, 2019

nyh commented May 30, 2019 •

edited

Loading

wkozaczuk commented May 30, 2019

nyh commented May 30, 2019

wkozaczuk commented Dec 18, 2019 •

edited

Loading

wkozaczuk commented Dec 26, 2019

wkozaczuk commented Jan 20, 2024

Lazily allocate thread stacks #143

Lazily allocate thread stacks #143

Comments

nyh commented Dec 26, 2013

nyh commented Feb 7, 2014

nyh commented May 29, 2019

wkozaczuk commented May 30, 2019

nyh commented May 30, 2019 • edited Loading

wkozaczuk commented May 30, 2019

nyh commented May 30, 2019

wkozaczuk commented Dec 18, 2019 • edited Loading

wkozaczuk commented Dec 26, 2019

wkozaczuk commented Jan 20, 2024

nyh commented May 30, 2019 •

edited

Loading

wkozaczuk commented Dec 18, 2019 •

edited

Loading