Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD] chore: lto makes a smaller and/or faster executable #1398

Closed
wants to merge 1 commit into from
Closed

Conversation

u2
Copy link
Contributor

@u2 u2 commented Aug 13, 2019

before the size of ckb is 407m,
after lto = true it's 307M,
after codegen-units = 1 it's 256M

https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units

Link-time Optimization (LTO)
…except that we can. Enter the world of link-time optimization.

So the story is as follows: We can individually optimize each crate, and in fact all standard libraries ship in the optimized form. Once the compiler produces an optimized binary, it gets assembled to a single executable by a program called “the linker”. But we don’t need the entirety of standard library: a simple “Hello, world” program definitely does not need std::net for example. Yet, the linker is so stupid that it won’t try to remove unused parts of crates; it will just paste them in.

There is actually a good reason that the traditional linker behaves like this. The linker is commonly used in the C and C++ languages among others, and each file is compiled individually. This is a sharp difference from Rust where the entire crate is compiled altogether. Unless required functions are scattered throughout the files, the linker can fairly easily get rid of unused files all at once. It’s not perfect, but reasonably approximates what we want: removing unused functions. One disadvantage is that the compiler is unable to optimize function calls pointing to other files; it simply lacks the required information.

C and C++ folks had been fine with that approximation for decades, but in the recent decades they decided they'd had enough and started to provide an option to enable link-time optimization (LTO). In this scheme the compiler produces optimized binaries from each file without looking at others, and then the linker actively looks at them all and tries to optimize the binary. It is much harder than working with (internally simplified) sources, and it hugely increases the compilation time, but it is worth trying if a smaller and/or faster executable is needed.

So far we have talked about C and C++, but the LTO is much more beneficial for Rust. Cargo.toml has an option to enable LTO:

[profile.release]
lto = true
Did that work? Well, sort of:

$ ls -al target/release/hello
-rwxrwxr-x 1 lifthrasiir 615725 May 31 20:17 target/release/hello*
It had a larger effect than the optimization level, but not much. Maybe it is time to look at the executable itself.

refs: https://lifthrasiir.github.io/rustlog/why-is-a-rust-executable-large.html

Question: after strip the size is sharply down to 24M, but the stack info is not human readable, what about adding sentry_disable cfg and for this special artificial we could strip it before release. The disadvantage is that the user would always prefer the smaller size one.

@u2 u2 requested a review from a team August 13, 2019 10:26
@nervos-bot
Copy link

nervos-bot bot commented Aug 13, 2019

@xxuejie is assigned as the chief reviewer

@xxuejie
Copy link
Collaborator

xxuejie commented Aug 13, 2019

Any ideas on how long it takes to build CKB in release mode? I think one major concern we had before was the linking time.

Also, I think the size of CKB's large binary is not due to LTO, but the fact that we chose to enable debug infos in release mode now. In fact I do recall the release binary of CKB without debug infos is around 26MB last time I tried

@u2
Copy link
Contributor Author

u2 commented Aug 13, 2019

yes, this will increase the build time, but the LTO only enable when make prod. After the strip, the size is about 24M.

@xxuejie
Copy link
Collaborator

xxuejie commented Aug 14, 2019

The 24M after strip alone means nothing here, you need to compare that with CKB without LTO after strip as well to see if this actually makes sense.

In addition it's not just using make prod, when debugging contracts, release mode is used quite frequently, I'm afraid saying "this will increase the build time" makes no sense, how much time will LTO bring to the table? With such a change, I think we will need the real numbers.

@u2
Copy link
Contributor Author

u2 commented Aug 14, 2019

Either -C lto=yes or -C lto=on would cause the error below

error: cannot prefer dynamic linking when performing LTO

note: only 'staticlib', 'bin', and 'cdylib' outputs are supported with LTO

so move lot in the Cargo.toml.

@u2
Copy link
Contributor Author

u2 commented Aug 14, 2019

The 24M after strip alone means nothing here, you need to compare that with CKB without LTO after strip as well to see if this actually makes sense.

In addition it's not just using make prod, when debugging contracts, release mode is used quite frequently, I'm afraid saying "this will increase the build time" makes no sense, how much time will LTO bring to the table? With such a change, I think we will need the real numbers.

When using the make build, the LTO is off.

@u2
Copy link
Contributor Author

u2 commented Aug 14, 2019

build time:
lto: 29m 45s
before: 22m 07s

benchmark:
when RUSTFLAGS="-C codegen-units=1" and LTO is on, the benchmark shows that always_success side_branch has improved by about 20%, others have no changes.

strip:
without LTO: 407M
without LTO strip: 26M

@zhangsoledad
Copy link
Member

https://doc.rust-lang.org/rustc/profile-guided-optimization.html
I incline to try PGO-optimized in 1.37.0 stable

# STEP 1: Compile the binary with instrumentation
rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs

# STEP 2: Run the binary a few times, maybe with common sets of args.
#         Each run will create or update `.profraw` files in /tmp/pgo-data
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv

# STEP 3: Merge and post-process all the `.profraw` files in /tmp/pgo-data
llvm-profdata merge -o ./merged.profdata /tmp/pgo-data

# STEP 4: Use the merged `.profdata` file during optimization. All `rustc`
#         flags have to be the same.
rustc -Cprofile-use=./merged.profdata -O ./main.rs

@zhangsoledad zhangsoledad added s:hold Status: Put this issue on hold. t:research Type: Research topic or direction. labels Aug 16, 2019
@u2
Copy link
Contributor Author

u2 commented Aug 16, 2019

https://doc.rust-lang.org/rustc/profile-guided-optimization.html
I incline to try PGO-optimized in 1.37.0 stable

# STEP 1: Compile the binary with instrumentation
rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs

# STEP 2: Run the binary a few times, maybe with common sets of args.
#         Each run will create or update `.profraw` files in /tmp/pgo-data
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv

# STEP 3: Merge and post-process all the `.profraw` files in /tmp/pgo-data
llvm-profdata merge -o ./merged.profdata /tmp/pgo-data

# STEP 4: Use the merged `.profdata` file during optimization. All `rustc`
#         flags have to be the same.
rustc -Cprofile-use=./merged.profdata -O ./main.rs

I think they are not mutually exclusive.

@doitian doitian added the s:waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label Aug 20, 2019
@u2 u2 added s:waiting-on-reviewers Status: Waiting for Review and removed s:hold Status: Put this issue on hold. s:waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. labels Aug 21, 2019
@zhangsoledad zhangsoledad added the s:hold Status: Put this issue on hold. label Aug 21, 2019
@zhangsoledad
Copy link
Member

zhangsoledad commented Aug 21, 2019

It's not urgent! We don't have to chage it immediately.
There are still so many questions, even rust team can't just simply and clearly clarify effects of lto, thinlto and codegen-units, rust-lang/rust#48518.
So hold it.
We need more comprehensive tests and research.

@zhangsoledad
Copy link
Member

RUSTFLAGS="-C codegen-units=1" cargo bench
always_success main_branch/100 time: [295.99 ms 304.90 ms 311.70 ms]

cargo bench
always_success main_branch/100 time: [272.48 ms 291.38 ms 302.40 ms]

RUSTFLAGS="-C codegen-units=1" get slightly worse on my machine, this may be unrelated volatility, but we can't jump to conclusions.

@doitian doitian changed the title chore: lto makes a smaller and/or faster executable [HOLD] chore: lto makes a smaller and/or faster executable Aug 21, 2019
Copy link

@nervos-bot nervos-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold as requested by @doitian.

@doitian
Copy link
Member

doitian commented Aug 21, 2019

Hold since we have to evaluate whether the change will affect performance.

@u2 u2 removed the s:waiting-on-reviewers Status: Waiting for Review label Aug 22, 2019
@u2 u2 mentioned this pull request Aug 22, 2019
@doitian doitian added the s:bench-needed Status: Need to run benchmark label Aug 25, 2019
@u2 u2 closed this Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s:bench-needed Status: Need to run benchmark s:hold Status: Put this issue on hold. t:research Type: Research topic or direction.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants