-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce stack usage by boxing File
in Dist
, CachePolicy
and large futures
#947
Reduce stack usage by boxing File
in Dist
, CachePolicy
and large futures
#947
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about boxing BuiltDist
and SourceDist
instead? That should bring the size of Dist
down considerably. You can even leave the box on File
.
At least for the boxes inside types like File
, those look like marginal allocations to me. (That is, File
already has a bunch of tiny little allocations in it.) The main other possible impact is the overall effect of an extra pointer dereference, but usually that has to be in some very critical section to have much of an effect. But still a good idea to run benchmarks.
crates/distribution-types/src/lib.rs
Outdated
@@ -201,21 +202,25 @@ pub struct PathSourceDist { | |||
pub editable: bool, | |||
} | |||
|
|||
assert_eq_size!(Dist, [u8; 240]); | |||
assert_eq_size!(BuiltDist, [u8; 240]); | |||
assert_eq_size!(SourceDist, [u8; 168]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth having these? I have a few tests like this in various crates, but I usually regret them because they can change easily. (Whenever the type definitions or changed, or even compiling on different targets with a different pointer size.) What I usually try to do instead is express them as an inequality. That is, we probably don't care if any these types get smaller in size, but maybe we want to be flagged if they get bigger. Still, even then, I'd use these assertions sparingly personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any recommendations how to express just that Dist
shouldn't become larger, at least not without somebody explicitly raising the limit? static_assertions
doesn't have a macro for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother with static_assertions
for this personally. I would write a normal unit test with assert!(std::mem::size_of::<Dist>() <= N)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR summary says Dist was reduced to 420 bytes. Is that a typo in the summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed
On top of #947, we can also box `PrioritizedDistribution`. In a simple benchmark, this seems to slightly improve performance when comparing only this commit to main, even though the benchmark is too noisy to establish significance: ``` $ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent Time (mean ± σ): 83.6 ms ± 2.0 ms [User: 77.7 ms, System: 20.0 ms] Range (min … max): 81.4 ms … 98.2 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent Time (mean ± σ): 80.8 ms ± 2.2 ms [User: 75.4 ms, System: 19.5 ms] Range (min … max): 78.6 ms … 98.6 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary target/profiling/puffin-dev resolve meine_stadt_transparent ran 1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent ``` The effect on type sizes however is considerable ([downstack PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f) vs. [this PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)): ``` --- branch.txt 2024-01-17 14:26:01.826085176 +0100 +++ boxed-prioritized-dist.txt 2024-01-17 14:25:57.101900963 +0100 @@ -1,19 +1,3 @@ -9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8 8960 values 32 header @@ -74,10 +58,23 @@ 40 __tracing_attr_span 64 variant Unresumed, Returned, Panicked +5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8 + 5647 variant Suspend0 + 5576 __awaitee align=8 + 40 __tracing_attr_span ```
On top of #947, we can also box `PrioritizedDistribution`. In a simple benchmark, this seems to slightly improve performance when comparing only this commit to main, even though the benchmark is too noisy to establish significance: ``` $ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent Time (mean ± σ): 83.6 ms ± 2.0 ms [User: 77.7 ms, System: 20.0 ms] Range (min … max): 81.4 ms … 98.2 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent Time (mean ± σ): 80.8 ms ± 2.2 ms [User: 75.4 ms, System: 19.5 ms] Range (min … max): 78.6 ms … 98.6 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary target/profiling/puffin-dev resolve meine_stadt_transparent ran 1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent ``` The effect on type sizes however is considerable ([downstack PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f) vs. [this PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)): ``` --- branch.txt 2024-01-17 14:26:01.826085176 +0100 +++ boxed-prioritized-dist.txt 2024-01-17 14:25:57.101900963 +0100 @@ -1,19 +1,3 @@ -9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8 8960 values 32 header @@ -74,10 +58,23 @@ 40 __tracing_attr_span 64 variant Unresumed, Returned, Panicked +5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8 + 5647 variant Suspend0 + 5576 __awaitee align=8 + 40 __tracing_attr_span ```
This way ... #[derive(Debug, Clone)]
pub enum Dist {
Built(Box<BuiltDist>),
Source(Box<SourceDist>),
} ... or this ways? #[derive(Debug, Clone)]
pub struct Dist(Box<DistInner>);
#[derive(Debug, Clone)]
pub enum DistInner {
Built(BuiltDist),
Source(SourceDist),
} It changes all our |
Hmmm yes I see the conundrum. Given those choices, and given the fact that you can't pattern match through a But... Looking at the types again, I think what I might do instead is:
This way, I think for the most part pattern matching won't be as annoying/awkward. Some might need to change in places where we match on the struct fields, but it should be pretty straight-forward to tweak those. With that said, this is a larger and annoying refactor. So if this PR solves the immediate issue, I'd be happy punting on this for now. |
78f1eba
to
f5f291d
Compare
c3242bb
to
7b80943
Compare
f5f291d
to
4399843
Compare
File
in Dist
and large futuresFile
in Dist
, CachePolicy
and large futures
Yeah i'd prefer to move ahead without refactoring |
7b80943
to
5a63635
Compare
4399843
to
a61a2ab
Compare
d1204de
to
eb83df3
Compare
a61a2ab
to
dbc6bfa
Compare
52878f7
to
1bb4eb7
Compare
`Dist` was standing out when profiling stack sizes with top-type-sizes. Here, we trade an allocation per `Dist` for a more reasonable stack size.
1bb4eb7
to
1dd47b9
Compare
Looking through the stack trace of `allowed_transitive_url_dependency` with a 800KB stack, i found those two to be the major offenders. These changes make `allowed_transitive_url_dependency` pass with a 800KB stack.
dbc6bfa
to
5cf1bc3
Compare
…e futures (#1004) This is #947 again but this time merging into main instead of downstack, sorry for the noise. --- Windows has a default stack size of 1MB, which makes puffin often fail with stack overflows. The PR reduces stack size by three changes: * Boxing `File` in `Dist`, reducing the size from 496 to 240. * Boxing the largest futures. * Boxing `CachePolicy` ## Method Debugging happened on linux using #941 to limit the stack size to 1MB. Used ran the command below. ``` RUSTFLAGS=-Zprint-type-sizes cargo +nightly build -p puffin-cli -j 1 > type-sizes.txt && top-type-sizes -w -s -h 10 < type-sizes.txt > sizes.txt ``` The main drawback is top-type-sizes not saying what the `__awaitee` is, so it requires manually looking up with a future with matching size. When the `brotli` features on `reqwest` is active, a lot of brotli types show up. Toggling this feature however seems to have no effect. I assume they are false positives since the `brotli` crate has elaborate control about allocation. The sizes are therefore shown with the feature off. ## Results The largest future goes from 12208B to 6416B, the largest type (`PrioritizedDistribution`, see also #948) from 17448B to 9264B. Full diff: https://gist.github.com/konstin/62635c0d12110a616a1b2bfcde21304f For the second commit, i iteratively boxed the largest file until the tests passed, then with an 800KB stack limit looked through the backtrace of a failing test and added some more boxing. Quick benchmarking showed no difference: ```console $ hyperfine --warmup 2 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent Time (mean ± σ): 49.2 ms ± 3.0 ms [User: 39.8 ms, System: 24.0 ms] Range (min … max): 46.6 ms … 63.0 ms 55 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent Time (mean ± σ): 47.4 ms ± 3.2 ms [User: 41.3 ms, System: 20.6 ms] Range (min … max): 44.6 ms … 60.5 ms 62 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary target/profiling/puffin-dev resolve meine_stadt_transparent ran 1.04 ± 0.09 times faster than target/profiling/main-dev resolve meine_stadt_transparent ```
On top of #947, we can also box `PrioritizedDistribution`. In a simple benchmark, this seems to slightly improve performance when comparing only this commit to main, even though the benchmark is too noisy to establish significance: ``` $ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent Time (mean ± σ): 83.6 ms ± 2.0 ms [User: 77.7 ms, System: 20.0 ms] Range (min … max): 81.4 ms … 98.2 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent Time (mean ± σ): 80.8 ms ± 2.2 ms [User: 75.4 ms, System: 19.5 ms] Range (min … max): 78.6 ms … 98.6 ms 300 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary target/profiling/puffin-dev resolve meine_stadt_transparent ran 1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent ``` The effect on type sizes however is considerable ([downstack PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f) vs. [this PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)): ```patch --- branch.txt 2024-01-17 14:26:01.826085176 +0100 +++ boxed-prioritized-dist.txt 2024-01-17 14:25:57.101900963 +0100 @@ -1,19 +1,3 @@ -9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9168 data - 96 edges - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - -9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8 - 9064 vals - 88 keys - 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8 8960 values 32 header @@ -74,10 +58,23 @@ 40 __tracing_attr_span 64 variant Unresumed, Returned, Panicked +5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8 + 5647 variant Suspend0 + 5576 __awaitee align=8 + 40 __tracing_attr_span ```
Windows has a default stack size of 1MB, which makes puffin often fail with stack overflows. The PR reduces stack size by three changes:
File
inDist
, reducing the size from 496 to 240.CachePolicy
Method
Debugging happened on linux using #941 to limit the stack size to 1MB. Used ran the command below.
The main drawback is top-type-sizes not saying what the
__awaitee
is, so it requires manually looking up with a future with matching size.When the
brotli
features onreqwest
is active, a lot of brotli types show up. Toggling this feature however seems to have no effect. I assume they are false positives since thebrotli
crate has elaborate control about allocation. The sizes are therefore shown with the feature off.Results
The largest future goes from 12208B to 6416B, the largest type (
PrioritizedDistribution
, see also #948) from 17448B to 9264B. Full diff: https://gist.github.com/konstin/62635c0d12110a616a1b2bfcde21304fFor the second commit, i iteratively boxed the largest file until the tests passed, then with an 800KB stack limit looked through the backtrace of a failing test and added some more boxing.
Quick benchmarking showed no difference: