[HOLD] chore: lto makes a smaller and/or faster executable #1398

u2 · 2019-08-13T10:26:34Z

before the size of ckb is 407m,
after lto = true it's 307M,
after codegen-units = 1 it's 256M

https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units

Link-time Optimization (LTO)
…except that we can. Enter the world of link-time optimization.

So the story is as follows: We can individually optimize each crate, and in fact all standard libraries ship in the optimized form. Once the compiler produces an optimized binary, it gets assembled to a single executable by a program called “the linker”. But we don’t need the entirety of standard library: a simple “Hello, world” program definitely does not need std::net for example. Yet, the linker is so stupid that it won’t try to remove unused parts of crates; it will just paste them in.

There is actually a good reason that the traditional linker behaves like this. The linker is commonly used in the C and C++ languages among others, and each file is compiled individually. This is a sharp difference from Rust where the entire crate is compiled altogether. Unless required functions are scattered throughout the files, the linker can fairly easily get rid of unused files all at once. It’s not perfect, but reasonably approximates what we want: removing unused functions. One disadvantage is that the compiler is unable to optimize function calls pointing to other files; it simply lacks the required information.

C and C++ folks had been fine with that approximation for decades, but in the recent decades they decided they'd had enough and started to provide an option to enable link-time optimization (LTO). In this scheme the compiler produces optimized binaries from each file without looking at others, and then the linker actively looks at them all and tries to optimize the binary. It is much harder than working with (internally simplified) sources, and it hugely increases the compilation time, but it is worth trying if a smaller and/or faster executable is needed.

So far we have talked about C and C++, but the LTO is much more beneficial for Rust. Cargo.toml has an option to enable LTO:

[profile.release]
lto = true
Did that work? Well, sort of:

$ ls -al target/release/hello
-rwxrwxr-x 1 lifthrasiir 615725 May 31 20:17 target/release/hello*
It had a larger effect than the optimization level, but not much. Maybe it is time to look at the executable itself.

refs: https://lifthrasiir.github.io/rustlog/why-is-a-rust-executable-large.html

Question: after strip the size is sharply down to 24M, but the stack info is not human readable, what about adding sentry_disable cfg and for this special artificial we could strip it before release. The disadvantage is that the user would always prefer the smaller size one.

nervos-bot · 2019-08-13T10:26:38Z

@xxuejie is assigned as the chief reviewer

xxuejie · 2019-08-13T10:36:35Z

Any ideas on how long it takes to build CKB in release mode? I think one major concern we had before was the linking time.

Also, I think the size of CKB's large binary is not due to LTO, but the fact that we chose to enable debug infos in release mode now. In fact I do recall the release binary of CKB without debug infos is around 26MB last time I tried

u2 · 2019-08-13T12:29:14Z

yes, this will increase the build time, but the LTO only enable when make prod. After the strip, the size is about 24M.

xxuejie · 2019-08-14T02:43:56Z

The 24M after strip alone means nothing here, you need to compare that with CKB without LTO after strip as well to see if this actually makes sense.

In addition it's not just using make prod, when debugging contracts, release mode is used quite frequently, I'm afraid saying "this will increase the build time" makes no sense, how much time will LTO bring to the table? With such a change, I think we will need the real numbers.

u2 · 2019-08-14T06:32:05Z

Either -C lto=yes or -C lto=on would cause the error below

error: cannot prefer dynamic linking when performing LTO

note: only 'staticlib', 'bin', and 'cdylib' outputs are supported with LTO

so move lot in the Cargo.toml.

u2 · 2019-08-14T06:35:21Z

The 24M after strip alone means nothing here, you need to compare that with CKB without LTO after strip as well to see if this actually makes sense.

In addition it's not just using make prod, when debugging contracts, release mode is used quite frequently, I'm afraid saying "this will increase the build time" makes no sense, how much time will LTO bring to the table? With such a change, I think we will need the real numbers.

When using the make build, the LTO is off.

u2 · 2019-08-14T07:22:16Z

build time:
lto: 29m 45s
before: 22m 07s

benchmark:
when RUSTFLAGS="-C codegen-units=1" and LTO is on, the benchmark shows that always_success side_branch has improved by about 20%, others have no changes.

strip:
without LTO: 407M
without LTO strip: 26M

zhangsoledad · 2019-08-16T12:45:09Z

https://doc.rust-lang.org/rustc/profile-guided-optimization.html
I incline to try PGO-optimized in 1.37.0 stable

# STEP 1: Compile the binary with instrumentation
rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs

# STEP 2: Run the binary a few times, maybe with common sets of args.
#         Each run will create or update `.profraw` files in /tmp/pgo-data
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv

# STEP 3: Merge and post-process all the `.profraw` files in /tmp/pgo-data
llvm-profdata merge -o ./merged.profdata /tmp/pgo-data

# STEP 4: Use the merged `.profdata` file during optimization. All `rustc`
#         flags have to be the same.
rustc -Cprofile-use=./merged.profdata -O ./main.rs

u2 · 2019-08-16T13:13:00Z

https://doc.rust-lang.org/rustc/profile-guided-optimization.html
I incline to try PGO-optimized in 1.37.0 stable

# STEP 1: Compile the binary with instrumentation
rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs

# STEP 2: Run the binary a few times, maybe with common sets of args.
#         Each run will create or update `.profraw` files in /tmp/pgo-data
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv

# STEP 3: Merge and post-process all the `.profraw` files in /tmp/pgo-data
llvm-profdata merge -o ./merged.profdata /tmp/pgo-data

# STEP 4: Use the merged `.profdata` file during optimization. All `rustc`
#         flags have to be the same.
rustc -Cprofile-use=./merged.profdata -O ./main.rs

I think they are not mutually exclusive.

zhangsoledad · 2019-08-21T13:03:49Z

It's not urgent! We don't have to chage it immediately.
There are still so many questions, even rust team can't just simply and clearly clarify effects of lto, thinlto and codegen-units, rust-lang/rust#48518.
So hold it.
We need more comprehensive tests and research.

zhangsoledad · 2019-08-21T13:23:58Z

RUSTFLAGS="-C codegen-units=1" cargo bench
always_success main_branch/100 time: [295.99 ms 304.90 ms 311.70 ms]

cargo bench
always_success main_branch/100 time: [272.48 ms 291.38 ms 302.40 ms]

RUSTFLAGS="-C codegen-units=1" get slightly worse on my machine, this may be unrelated volatility, but we can't jump to conclusions.

nervos-bot

Hold as requested by @doitian.

doitian · 2019-08-21T13:25:33Z

Hold since we have to evaluate whether the change will affect performance.

u2 requested a review from a team August 13, 2019 10:26

nervos-bot bot assigned xxuejie Aug 13, 2019

zhangsoledad added s:hold Status: Put this issue on hold. t:research Type: Research topic or direction. labels Aug 16, 2019

doitian added the s:waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label Aug 20, 2019

chore: lto makes a smaller and/or faster executable

1f96503

u2 added s:waiting-on-reviewers Status: Waiting for Review and removed s:hold Status: Put this issue on hold. s:waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. labels Aug 21, 2019

zhangsoledad added the s:hold Status: Put this issue on hold. label Aug 21, 2019

doitian changed the title ~~chore: lto makes a smaller and/or faster executable~~ [HOLD] chore: lto makes a smaller and/or faster executable Aug 21, 2019

nervos-bot bot suggested changes Aug 21, 2019

View reviewed changes

u2 removed the s:waiting-on-reviewers Status: Waiting for Review label Aug 22, 2019

u2 mentioned this pull request Aug 22, 2019

[HOLD] update rust to 1.37.0 #1450

Closed

doitian added the s:bench-needed Status: Need to run benchmark label Aug 25, 2019

u2 closed this Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOLD] chore: lto makes a smaller and/or faster executable #1398

[HOLD] chore: lto makes a smaller and/or faster executable #1398

u2 commented Aug 13, 2019 •

edited

Loading

nervos-bot bot commented Aug 13, 2019

xxuejie commented Aug 13, 2019 •

edited

Loading

u2 commented Aug 13, 2019

xxuejie commented Aug 14, 2019

u2 commented Aug 14, 2019

u2 commented Aug 14, 2019

u2 commented Aug 14, 2019 •

edited

Loading

zhangsoledad commented Aug 16, 2019

u2 commented Aug 16, 2019

zhangsoledad commented Aug 21, 2019 •

edited

Loading

zhangsoledad commented Aug 21, 2019

nervos-bot bot left a comment

doitian commented Aug 21, 2019

[HOLD] chore: lto makes a smaller and/or faster executable #1398

[HOLD] chore: lto makes a smaller and/or faster executable #1398

Conversation

u2 commented Aug 13, 2019 • edited Loading

nervos-bot bot commented Aug 13, 2019

xxuejie commented Aug 13, 2019 • edited Loading

u2 commented Aug 13, 2019

xxuejie commented Aug 14, 2019

u2 commented Aug 14, 2019

u2 commented Aug 14, 2019

u2 commented Aug 14, 2019 • edited Loading

zhangsoledad commented Aug 16, 2019

u2 commented Aug 16, 2019

zhangsoledad commented Aug 21, 2019 • edited Loading

zhangsoledad commented Aug 21, 2019

nervos-bot bot left a comment

Choose a reason for hiding this comment

doitian commented Aug 21, 2019

u2 commented Aug 13, 2019 •

edited

Loading

xxuejie commented Aug 13, 2019 •

edited

Loading

u2 commented Aug 14, 2019 •

edited

Loading

zhangsoledad commented Aug 21, 2019 •

edited

Loading