Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BTreeMap instead of a sorted Vec #5877

Conversation

therealprof
Copy link
Contributor

While looking into the still troublesome binary size I was investigating the noticeable amount of monomorphised sort functions. Looking at the few callers I noticed that they're creating custom Vecs and then sorting them with a custom lambda functions to sort them.

As it turns out, it's a win to insert the entries into a BTreeMap instead of creating a Vec and manually sorting it.

E.g. for the stdio-fixture binary, we're going from:

# cargo bloat --release
    Finished `release` profile [optimized] target(s) in 1.81s
    Analyzing target/release/stdio-fixture

 File  .text     Size        Crate Name
 1.7%   3.0%  15.6KiB    [Unknown] __mh_execute_header
 1.6%   2.8%  14.8KiB clap_builder clap_builder::parser::parser::Parser::get_matches_with
 1.5%   2.7%  14.1KiB clap_builder clap_builder::builder::command::Command::_build_self
 1.0%   1.8%   9.4KiB          std std::backtrace_rs::symbolize::gimli::resolve
 0.9%   1.6%   8.6KiB          std std::backtrace_rs::symbolize::gimli::Context::new
 0.8%   1.4%   7.1KiB          std gimli::read::dwarf::Unit<R>::new
 0.7%   1.3%   7.0KiB clap_builder <clap_builder::error::format::RichFormatter as clap_builder::error::format::ErrorFormatter>::format_error
 0.7%   1.2%   6.4KiB clap_builder clap_builder::parser::validator::Validator::validate
 0.6%   1.1%   5.9KiB          std addr2line::ResUnit<R>::find_function_or_location::{{closure}}
 0.6%   1.1%   5.9KiB clap_builder <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
 0.6%   1.0%   5.4KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_all_args
 0.6%   1.0%   5.2KiB clap_builder clap_builder::output::usage::Usage::write_arg_usage
 0.5%   1.0%   5.1KiB          std addr2line::Lines::parse
 0.5%   1.0%   5.1KiB clap_builder clap_builder::parser::parser::Parser::react
 0.5%   0.9%   4.5KiB clap_builder clap_builder::output::usage::Usage::write_usage_no_title
 0.5%   0.8%   4.4KiB clap_builder clap_builder::parser::parser::Parser::parse_help_subcommand
 0.4%   0.8%   4.1KiB          std addr2line::function::Function<R>::parse_children
 0.4%   0.8%   4.1KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_templated_help
 0.4%   0.7%   3.9KiB          std gimli::read::unit::parse_attribute
 0.4%   0.6%   3.3KiB clap_builder <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
42.9%  76.5% 401.1KiB              And 1147 smaller methods. Use -n N to show more.
56.1% 100.0% 524.1KiB              .text section size, the file size is 934.9KiB

to:

# cargo bloat --release
    Finished `release` profile [optimized] target(s) in 0.48s
    Analyzing target/release/stdio-fixture

 File  .text     Size        Crate Name
 1.6%   2.9%  14.8KiB clap_builder clap_builder::parser::parser::Parser::get_matches_with
 1.5%   2.7%  14.1KiB clap_builder clap_builder::builder::command::Command::_build_self
 1.0%   1.8%   9.4KiB          std std::backtrace_rs::symbolize::gimli::resolve
 1.0%   1.7%   9.0KiB    [Unknown] __mh_execute_header
 0.9%   1.7%   8.6KiB          std std::backtrace_rs::symbolize::gimli::Context::new
 0.8%   1.4%   7.1KiB          std gimli::read::dwarf::Unit<R>::new
 0.8%   1.4%   7.0KiB clap_builder <clap_builder::error::format::RichFormatter as clap_builder::error::format::ErrorFormatter>::format_error
 0.7%   1.2%   6.4KiB clap_builder clap_builder::parser::validator::Validator::validate
 0.6%   1.1%   5.9KiB          std addr2line::ResUnit<R>::find_function_or_location::{{closure}}
 0.6%   1.1%   5.9KiB clap_builder <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
 0.6%   1.0%   5.4KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_all_args
 0.6%   1.0%   5.2KiB clap_builder clap_builder::output::usage::Usage::write_arg_usage
 0.6%   1.0%   5.1KiB          std addr2line::Lines::parse
 0.6%   1.0%   5.1KiB clap_builder clap_builder::parser::parser::Parser::react
 0.5%   0.9%   4.5KiB clap_builder clap_builder::output::usage::Usage::write_usage_no_title
 0.5%   0.8%   4.4KiB clap_builder clap_builder::parser::parser::Parser::parse_help_subcommand
 0.4%   0.8%   4.1KiB          std addr2line::function::Function<R>::parse_children
 0.4%   0.8%   4.1KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_templated_help
 0.4%   0.7%   3.9KiB          std gimli::read::unit::parse_attribute
 0.4%   0.6%   3.3KiB clap_builder <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
42.8%  76.1% 392.7KiB              And 1131 smaller methods. Use -n N to show more.
56.2% 100.0% 515.7KiB              .text section size, the file size is 918.2KiB

Turns out this saves some binary size and makes the code simpler.

Signed-off-by: Daniel Egger <daniel@eggers-club.de>
@epage
Copy link
Member

epage commented Jan 9, 2025

Thanks!

Improvement is improvement. BTreeMaps can be expensive code wise as well and i have wondered about a custom sort algorithm optimized for binary size.

@epage epage merged commit fd5e691 into clap-rs:master Jan 9, 2025
24 of 25 checks passed
@therealprof therealprof deleted the features/use-btreemap-instead-of-sortec-vec branch January 9, 2025 15:14
@therealprof
Copy link
Contributor Author

therealprof commented Jan 9, 2025

The main problem as I see it is the monomorphisation, through the use of generics. Even after disabling the suggestions feature, I still see uses of various sort functions:

 0.3%   0.5%   2.0KiB           std core::slice::sort::stable::quicksort::quicksort
 0.3%   0.5%   2.0KiB           std core::slice::sort::stable::quicksort::quicksort
 0.3%   0.5%   2.0KiB           std core::slice::sort::stable::quicksort::quicksort
 0.3%   0.5%   1.8KiB           std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.4%   1.7KiB           std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.4%   1.7KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.4%   1.7KiB           std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.4%   1.5KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.4%   1.5KiB           std core::slice::sort::unstable::quicksort::quicksort
 0.2%   0.4%   1.4KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.4%   1.4KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.3%   1.3KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.3%   1.2KiB           std core::slice::sort::stable::drift::sort
 0.2%   0.3%   1.1KiB           std core::slice::sort::shared::smallsort::small_sort_general
 0.2%   0.3%   1.1KiB           std core::slice::sort::unstable::quicksort::quicksort
 0.1%   0.3%     992B           std core::slice::sort::shared::smallsort::small_sort_general
 0.1%   0.2%     696B           std core::slice::sort::shared::smallsort::sort8_stable
 0.1%   0.2%     648B           std core::slice::sort::unstable::ipnsort
 0.1%   0.1%     484B           std core::slice::sort::unstable::ipnsort
 0.1%   0.1%     436B           std core::slice::sort::shared::smallsort::sort4_stable
 0.1%   0.1%     384B           std core::slice::sort::shared::smallsort::sort4_stable
 0.0%   0.1%     332B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     332B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     308B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     308B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     308B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     308B           std core::slice::sort::stable::driftsort_main
 0.0%   0.1%     308B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     284B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     256B           std core::slice::sort::unstable::heapsort::sift_down
 0.0%   0.1%     208B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     208B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     208B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     204B           std core::slice::sort::unstable::heapsort::heapsort
 0.0%   0.1%     204B           std core::slice::sort::unstable::heapsort::heapsort
 0.0%   0.1%     204B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     204B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.1%     204B           std core::slice::sort::shared::pivot::median3_rec
 0.0%   0.0%     196B           std core::slice::sort::unstable::heapsort::sift_down
 0.0%   0.0%      56B           std core::slice::sort::shared::smallsort::panic_on_ord_violation
 0.0%   0.0%      56B           std core::slice::sort::stable::drift::sqrt_approx

but I've no idea where they're called since the only callers outside of the suggestion code are in clap_builder/src/builder/debug_asserts.rs and clap_builder/src/util/flat_set.rs.

BTW: The use of clap 4.5.24 reduced the size of my application from 5.814MB to 5.797MB. 🥳

@hanna-kruppe
Copy link

Improvement is improvement. BTreeMaps can be expensive code wise as well and i have wondered about a custom sort algorithm optimized for binary size.

I’ve used https://crates.io/crates/tiny_sort for this in some projects and it worked well. More precisely, I’ve only audited and used the unstable sort, which is sufficient when there are no duplicate keys. The stable sort seems to have more and more subtle unsafe code, so I can’t vouch for it.

@therealprof
Copy link
Contributor Author

@hanna-kruppe The issue here is: I've no idea where those still existing sort functions are called from. The visible sorts are now gone (and if necessary we can switch back to using sort if that's beneficial with a dedicated sorting algorithm). Do you have any idea how to get a calltree showing where those sorts are being dragged in? I've tried putting breakpoints on them but I couldn't get them to be called at all and all the additional data rustc can output also doesn't show them...

@hanna-kruppe
Copy link

I’ve mostly done this sort of thing for webassembly modules with the help of twiggy (not quite a general call graph but often good enough especially when for functions only called from one place). I guess reverse engineering tools like binary ninja make this pretty easy to walk the static call graph if you’re used to them (I’m not). For a short amount of time, grepping for symbols in objdump output may be tolerable. (e.g., pick one of the driftsort copies, search for that symbol including hash to see who calls it, repeat process with a caller’s symbol until satisfied)

@hanna-kruppe
Copy link

You could also try building with -Csymbol-mangling-version=v0 so that the demangled symbols contain the name of the element type being sorted. Might give some useful hints about who could be responsible, worth a shot.

@therealprof
Copy link
Contributor Author

@hanna-kruppe Fantastic advise! Thanks. 😃

Unfortunately it doesn't help with the sorts...

 0.2%   0.4%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.4%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.4%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.3%   1.8KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.3%   1.7KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.3%   1.7KiB          std core::slice::sort::stable::drift::sort
 0.2%   0.3%   1.7KiB          std core::slice::sort::stable::quicksort::quicksort
 0.2%   0.3%   1.5KiB          std core::slice::sort::stable::drift::sort
 0.2%   0.3%   1.5KiB          std core::slice::sort::unstable::quicksort::quicksort
 0.1%   0.3%   1.4KiB          std core::slice::sort::stable::drift::sort
 0.1%   0.3%   1.4KiB          std core::slice::sort::stable::drift::sort
 0.1%   0.2%   1.3KiB          std core::slice::sort::stable::drift::sort
 0.1%   0.2%   1.2KiB          std core::slice::sort::stable::drift::sort
 0.1%   0.2%   1.1KiB          std core::slice::sort::shared::smallsort::small_sort_general
 0.1%   0.2%   1.1KiB          std core::slice::sort::unstable::quicksort::quicksort
 0.1%   0.2%     992B          std core::slice::sort::shared::smallsort::small_sort_general
 0.1%   0.1%     696B          std core::slice::sort::shared::smallsort::sort8_stable
 0.1%   0.1%     668B clap_builder clap_builder::output::help_template::option_sort_key
 0.1%   0.1%     648B          std core::slice::sort::unstable::ipnsort

@hanna-kruppe
Copy link

This is cargo-bloat output? You can try if --full-fn helps. Otherwise, you could try https://github.com/google/bloaty (which has no Rust-specific heuristics) and do demangling separately with rustfilt.

@therealprof
Copy link
Contributor Author

@hanna-kruppe Yes, it's cargo-bloat. I'm aware of --full-fn but no dice:

 0.3%   0.5%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort::h7349ebd2e415d6bb
 0.3%   0.5%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort::h40e4796c1700d77a
 0.3%   0.5%   2.0KiB          std core::slice::sort::stable::quicksort::quicksort::he447651e0b3d3a4f
 0.2%   0.4%   1.8KiB          std core::slice::sort::stable::quicksort::quicksort::hb009cddaa9188fc3
 0.2%   0.4%   1.7KiB          std core::slice::sort::stable::quicksort::quicksort::ha7ac1da61cfaf130
 0.2%   0.4%   1.7KiB          std core::slice::sort::stable::drift::sort::h29e182f5edd15c65
 0.2%   0.4%   1.7KiB          std core::slice::sort::stable::quicksort::quicksort::h224bf966919db889
 0.2%   0.4%   1.5KiB          std core::slice::sort::stable::drift::sort::h2367f69a43799f97
 0.2%   0.4%   1.5KiB          std core::slice::sort::unstable::quicksort::quicksort::h9bd2a31eab2cc185
 0.2%   0.3%   1.4KiB          std core::slice::sort::stable::drift::sort::h0638f2fd0fc9fd56
 0.2%   0.3%   1.4KiB          std core::slice::sort::stable::drift::sort::hc6653c23b9c20251
 0.2%   0.3%   1.3KiB          std core::slice::sort::stable::drift::sort::h819e9e4924e8bf4e

@hanna-kruppe
Copy link

hanna-kruppe commented Jan 11, 2025

I took the hard crude route of grepping through objdump -d target/release/stdio-fixture | rustfilt. Turns out that all these sort calls come from the standard library's copies of addr2line/gimli used for getting symbols and source code locations in backtraces (this would explain why opting into v0 symbol mangling didn't affect them). So there's nothing to be done about them from clap's side. But there might still be a small win on top of this PR by going from BTreeMap to size-optimized sort + Vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants