Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc should use llvm.lifetime intrinsics to reduce stack space #15665

Closed
zwarich opened this issue Jul 14, 2014 · 2 comments · Fixed by #15863
Closed

rustc should use llvm.lifetime intrinsics to reduce stack space #15665

zwarich opened this issue Jul 14, 2014 · 2 comments · Fixed by #15863
Labels
A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@zwarich
Copy link

zwarich commented Jul 14, 2014

It recently came up on dev-servo that Rust-compiled code is using an egregious amount of stack space. One possible solution that has been brought up is that we should use llvm.lifetime.start and llvm.lifetime.end intrinsics to reduce the amount of stack space that LLVM uses for our functions: http://llvm.org/docs/LangRef.html#llvm-lifetime-start-intrinsic.

An incomplete attempt at this has been made in an earlier PR: #12004, and the numbers there were promising.

@emberian
Copy link
Member

cc @dotdash

@dotdash
Copy link
Contributor

dotdash commented Jul 20, 2014

Started working on this again.

dotdash added a commit to dotdash/rust that referenced this issue Jul 22, 2014
…eneral

Lifetime intrinsics help to reduce stack usage, because LLVM can apply
stack coloring to reuse the stack slots of dead allocas for new ones.

For example these functions now both use the same amount of stack, while
previous `bar()` used five times as much as `foo()`:

````rust
fn foo() {
  println("{}", 5);
}

fn bar() {
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
}
````

On top of that, LLVM can also optimize out certain operations when it
knows that memory is dead after a certain point. For example, it can
sometimes remove the zeroing used to cancel the drop glue. This is
possible when the glue drop itself was already removed because the
zeroing dominated the drop glue call. For example in:

````rust
pub fn bar(x: (Box<int>, int)) -> (Box<int>, int) {
    x
}
````

With optimizations, this currently results in:

````llvm
define void @_ZN3bar20h330fa42547df8179niaE({ i64*, i64 }* noalias nocapture nonnull sret, { i64*, i64 }* noalias nocapture nonnull) unnamed_addr #0 {
"_ZN29_$LP$Box$LT$int$GT$$C$int$RP$39glue_drop.$x22glue_drop$x22$LP$1347$RP$17h88cf42702e5a322aE.exit":
  %2 = bitcast { i64*, i64 }* %1 to i8*
  %3 = bitcast { i64*, i64 }* %0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %3, i8* %2, i64 16, i32 8, i1 false)
  tail call void @llvm.memset.p0i8.i64(i8* %2, i8 0, i64 16, i32 8, i1 false)
  ret void
}
````

But with lifetime intrinsics we get:

````llvm
define void @_ZN3bar20h330fa42547df8179niaE({ i64*, i64 }* noalias nocapture nonnull sret, { i64*, i64 }* noalias nocapture nonnull) unnamed_addr #0 {
"_ZN29_$LP$Box$LT$int$GT$$C$int$RP$39glue_drop.$x22glue_drop$x22$LP$1347$RP$17h88cf42702e5a322aE.exit":
  %2 = bitcast { i64*, i64 }* %1 to i8*
  %3 = bitcast { i64*, i64 }* %0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %3, i8* %2, i64 16, i32 8, i1 false)
  tail call void @llvm.lifetime.end(i64 16, i8* %2)
  ret void
}
````

Fixes rust-lang#15665
bors added a commit that referenced this issue Jul 22, 2014
Lifetime intrinsics help to reduce stack usage, because LLVM can apply
stack coloring to reuse the stack slots of dead allocas for new ones.

For example these functions now both use the same amount of stack, while
previous `bar()` used five times as much as `foo()`:

````rust
fn foo() {
  println("{}", 5);
}

fn bar() {
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
  println("{}", 5);
}
````

On top of that, LLVM can also optimize out certain operations when it
knows that memory is dead after a certain point. For example, it can
sometimes remove the zeroing used to cancel the drop glue. This is
possible when the glue drop itself was already removed because the
zeroing dominated the drop glue call. For example in:

````rust
pub fn bar(x: (Box<int>, int)) -> (Box<int>, int) {
    x
}
````

With optimizations, this currently results in:

````llvm
define void @_ZN3bar20h330fa42547df8179niaE({ i64*, i64 }* noalias nocapture nonnull sret, { i64*, i64 }* noalias nocapture nonnull) unnamed_addr #0 {
"_ZN29_$LP$Box$LT$int$GT$$C$int$RP$39glue_drop.$x22glue_drop$x22$LP$1347$RP$17h88cf42702e5a322aE.exit":
  %2 = bitcast { i64*, i64 }* %1 to i8*
  %3 = bitcast { i64*, i64 }* %0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %3, i8* %2, i64 16, i32 8, i1 false)
  tail call void @llvm.memset.p0i8.i64(i8* %2, i8 0, i64 16, i32 8, i1 false)
  ret void
}
````

But with lifetime intrinsics we get:

````llvm
define void @_ZN3bar20h330fa42547df8179niaE({ i64*, i64 }* noalias nocapture nonnull sret, { i64*, i64 }* noalias nocapture nonnull) unnamed_addr #0 {
"_ZN29_$LP$Box$LT$int$GT$$C$int$RP$39glue_drop.$x22glue_drop$x22$LP$1347$RP$17h88cf42702e5a322aE.exit":
  %2 = bitcast { i64*, i64 }* %1 to i8*
  %3 = bitcast { i64*, i64 }* %0 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %3, i8* %2, i64 16, i32 8, i1 false)
  tail call void @llvm.lifetime.end(i64 16, i8* %2)
  ret void
}
````

Fixes #15665
bors added a commit that referenced this issue Jul 24, 2014
The allocas used in match expression currently don't get good lifetime
markers, in fact they only get lifetime start markers, because their
lifetimes don't match to cleanup scopes.

While the bindings themselves are bog standard and just need a matching
pair of start and end markers, they might need them twice, once for a
guard clause and once for the match body.

The __llmatch alloca OTOH needs a single lifetime start marker, but
when there's a guard clause, it needs two end markers, because its
lifetime ends either when the guard doesn't match or after the match
body.

With these intrinsics in place, LLVM can now, for example, optimize
code like this:

````rust
enum E {
  A1(int),
  A2(int),
  A3(int),
  A4(int),
}

pub fn variants(x: E) {
  match x {
    A1(m) => bar(&m),
    A2(m) => bar(&m),
    A3(m) => bar(&m),
    A4(m) => bar(&m),
  }
}
````

To a single call to bar, using only a single stack slot. It still fails
to eliminate some of checks.

````gas
.Ltmp5:
	.cfi_def_cfa_offset 16
	movb	(%rdi), %al
	testb	%al, %al
	je	.LBB3_5
	movzbl	%al, %eax
	cmpl	$1, %eax
	je	.LBB3_5
	cmpl	$2, %eax
.LBB3_5:
	movq	8(%rdi), %rax
	movq	%rax, (%rsp)
	leaq	(%rsp), %rdi
	callq	_ZN3bar20hcb7a0d8be8e17e37daaE@PLT
	popq	%rax
	retq
````

Refs #15665
bors added a commit to rust-lang-ci/rust that referenced this issue Nov 13, 2023
…cola

internal: De-`unwrap` `generate_function.rs`

Fixes rust-lang/rust-analyzer#15398 (comment)

cc `@Inicola`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants