Micro-optimize the __morestack fast path #3565

brson · 2012-09-23T22:08:48Z

This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.

Here's what happens after calling into __morestack, on the fast path

Set up the frame pointer
Push all possible argument registers of the calling function in case the call to upcall_new_stack clobbers them
Shuffle the argument registers from the __morestack custom calling convention registers to the C calling convention registers used by upcall_new_stack
Call upcall_new_stack, through the indirection of the dynamic linker
Call get_sp_limit, an entire assembly function consisting of movq %fs:112, %rax
Compare the sp_limit to 0 and don't branch to the rust_get_current_task slow path. This branch always makes the same decision during a __morestack call.
Do some math to find the task pointer from the stack limit
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Get the minimum stack size
Do some simple math and pointer indirections to determine if task->stk->next is a big enough stack segment to use
Assert some invariants
memcpy the arguments from the old stack to the new stack
Align the new stack frame
Call reuse_valgrind_stack to give valgrind hints
Call record_stack_limit to execute another single instruction
Return the stack pointer to __morestack
Pop all the saved argument registers
Finally, call the original function

And returning from the segment:

Call upcall_del_stack through the dynamic linker
Call get_sp_limit, an entire function consisting of movq %fs:112, %rax
Compare the sp_limit to 0, etc.
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Update the current stack pointer in the task
Call record_stack_limit

Potential optimizations:

Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
Inline get_sp_limit, record_stack_limit (Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521)
Statically link upcall_new_stack and upcall_del_stack, hitting new dynamically linked upcalls for the slow path
Create a new version of rust_get_current_task that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
Put all asserts under the compile-time debug flag, including the canary check
Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
Ensure that upcall_new_stack doesn't use xmm registers and remove the xmm saves and restores in __morestack Stop saving floating point registers in __morestack #2043
Inline upcall_del_stack into __morestack
Write the entire fast path in assembly

The text was updated successfully, but these errors were encountered:

msullivan · 2013-07-12T19:32:23Z

When all does __morestack get called?

There has also been a bunch of discussion about possibly ditching segmented stacks?

thestinger · 2013-07-12T20:03:13Z

It's added to every single function, and LLVM does accounting of stack space and growth for us through our __morestack implementation. There are other growth/safety strategies we could use, like using guard pages + checks on allocations larger than the guard pages, but I think doing that would require patching LLVM.

pnkfelix · 2013-09-20T10:11:08Z

visiting for triage, email from 2013-09-09

Right now split-stacks are turned off since they are not supported in the newrt. But I imagine most/all of the suggestions above could be applicable in the next implementation, unless we switch to an entirely new strategy (like using guard pages as suggested by thestinger)

alexcrichton · 2013-10-29T18:40:09Z

In today's meeting we have decided to jettison segmented stacks.

alexcrichton · 2013-10-29T18:40:33Z

We only use __morestack for detecting stack overflow, and that doesn't need to get micro-optimized.

* Implement Serialize on IgnoreList * Add a test for rust-lang#3536

re-organize libc tests And share some more things across unices

alexcrichton closed this as completed Oct 29, 2013

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021

Implement Serialize on IgnoreList (rust-lang#3565)

a7d4ec9

* Implement Serialize on IgnoreList * Add a test for rust-lang#3536

RalfJung pushed a commit to RalfJung/rust that referenced this issue May 5, 2024

Auto merge of rust-lang#3565 - RalfJung:libc, r=RalfJung

19a5d47

re-organize libc tests And share some more things across unices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro-optimize the __morestack fast path #3565

Micro-optimize the __morestack fast path #3565

brson commented Sep 23, 2012

msullivan commented Jul 12, 2013

thestinger commented Jul 12, 2013

pnkfelix commented Sep 20, 2013

alexcrichton commented Oct 29, 2013

alexcrichton commented Oct 29, 2013

Micro-optimize the __morestack fast path #3565

Micro-optimize the __morestack fast path #3565

Comments

brson commented Sep 23, 2012

msullivan commented Jul 12, 2013

thestinger commented Jul 12, 2013

pnkfelix commented Sep 20, 2013

alexcrichton commented Oct 29, 2013

alexcrichton commented Oct 29, 2013