Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce stack usage by boxing File in Dist, CachePolicy and large futures #947

Merged
merged 4 commits into from
Jan 19, 2024

Conversation

konstin
Copy link
Member

@konstin konstin commented Jan 17, 2024

Windows has a default stack size of 1MB, which makes puffin often fail with stack overflows. The PR reduces stack size by three changes:

  • Boxing File in Dist, reducing the size from 496 to 240.
  • Boxing the largest futures.
  • Boxing CachePolicy

Method

Debugging happened on linux using #941 to limit the stack size to 1MB. Used ran the command below.

RUSTFLAGS=-Zprint-type-sizes cargo +nightly build -p puffin-cli -j 1 > type-sizes.txt && top-type-sizes -w -s -h 10 < type-sizes.txt > sizes.txt

The main drawback is top-type-sizes not saying what the __awaitee is, so it requires manually looking up with a future with matching size.

When the brotli features on reqwest is active, a lot of brotli types show up. Toggling this feature however seems to have no effect. I assume they are false positives since the brotli crate has elaborate control about allocation. The sizes are therefore shown with the feature off.

Results

The largest future goes from 12208B to 6416B, the largest type (PrioritizedDistribution, see also #948) from 17448B to 9264B. Full diff: https://gist.github.com/konstin/62635c0d12110a616a1b2bfcde21304f

For the second commit, i iteratively boxed the largest file until the tests passed, then with an 800KB stack limit looked through the backtrace of a failing test and added some more boxing.

Quick benchmarking showed no difference:

$ hyperfine --warmup 2 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" 
Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent
  Time (mean ± σ):      49.2 ms ±   3.0 ms    [User: 39.8 ms, System: 24.0 ms]
  Range (min … max):    46.6 ms …  63.0 ms    55 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent
  Time (mean ± σ):      47.4 ms ±   3.2 ms    [User: 41.3 ms, System: 20.6 ms]
  Range (min … max):    44.6 ms …  60.5 ms    62 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  target/profiling/puffin-dev resolve meine_stadt_transparent ran
    1.04 ± 0.09 times faster than target/profiling/main-dev resolve meine_stadt_transparent

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about boxing BuiltDist and SourceDist instead? That should bring the size of Dist down considerably. You can even leave the box on File.

At least for the boxes inside types like File, those look like marginal allocations to me. (That is, File already has a bunch of tiny little allocations in it.) The main other possible impact is the overall effect of an extra pointer dereference, but usually that has to be in some very critical section to have much of an effect. But still a good idea to run benchmarks.

@@ -201,21 +202,25 @@ pub struct PathSourceDist {
pub editable: bool,
}

assert_eq_size!(Dist, [u8; 240]);
assert_eq_size!(BuiltDist, [u8; 240]);
assert_eq_size!(SourceDist, [u8; 168]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth having these? I have a few tests like this in various crates, but I usually regret them because they can change easily. (Whenever the type definitions or changed, or even compiling on different targets with a different pointer size.) What I usually try to do instead is express them as an inequality. That is, we probably don't care if any these types get smaller in size, but maybe we want to be flagged if they get bigger. Still, even then, I'd use these assertions sparingly personally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any recommendations how to express just that Dist shouldn't become larger, at least not without somebody explicitly raising the limit? static_assertions doesn't have a macro for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother with static_assertions for this personally. I would write a normal unit test with assert!(std::mem::size_of::<Dist>() <= N).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR summary says Dist was reduced to 420 bytes. Is that a typo in the summary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed

konstin added a commit that referenced this pull request Jan 17, 2024
On top of #947, we can also box `PrioritizedDistribution`.

In a simple benchmark, this seems to slightly improve performance when comparing only this commit to main, even though the benchmark is too noisy to establish significance:

```
$ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent"
  Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent
    Time (mean ± σ):      83.6 ms ±   2.0 ms    [User: 77.7 ms, System: 20.0 ms]
    Range (min … max):    81.4 ms …  98.2 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent
    Time (mean ± σ):      80.8 ms ±   2.2 ms    [User: 75.4 ms, System: 19.5 ms]
    Range (min … max):    78.6 ms …  98.6 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Summary
    target/profiling/puffin-dev resolve meine_stadt_transparent ran
      1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent
```

The effect on type sizes however is considerable ([downstack PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f) vs. [this PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)):

```
--- branch.txt  2024-01-17 14:26:01.826085176 +0100
+++ boxed-prioritized-dist.txt  2024-01-17 14:25:57.101900963 +0100
@@ -1,19 +1,3 @@
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8
    8960 values
      32 header
@@ -74,10 +58,23 @@
          40 __tracing_attr_span
      64 variant Unresumed, Returned, Panicked

+5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8
+   5647 variant Suspend0
+       5576 __awaitee align=8
+         40 __tracing_attr_span
```
konstin added a commit that referenced this pull request Jan 17, 2024
On top of #947, we can also box `PrioritizedDistribution`.

In a simple benchmark, this seems to slightly improve performance when comparing only this commit to main, even though the benchmark is too noisy to establish significance:

```
$ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent"
  Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent
    Time (mean ± σ):      83.6 ms ±   2.0 ms    [User: 77.7 ms, System: 20.0 ms]
    Range (min … max):    81.4 ms …  98.2 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent
    Time (mean ± σ):      80.8 ms ±   2.2 ms    [User: 75.4 ms, System: 19.5 ms]
    Range (min … max):    78.6 ms …  98.6 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Summary
    target/profiling/puffin-dev resolve meine_stadt_transparent ran
      1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent
```

The effect on type sizes however is considerable ([downstack PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f) vs. [this PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)):

```
--- branch.txt  2024-01-17 14:26:01.826085176 +0100
+++ boxed-prioritized-dist.txt  2024-01-17 14:25:57.101900963 +0100
@@ -1,19 +1,3 @@
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8
    8960 values
      32 header
@@ -74,10 +58,23 @@
          40 __tracing_attr_span
      64 variant Unresumed, Returned, Panicked

+5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8
+   5647 variant Suspend0
+       5576 __awaitee align=8
+         40 __tracing_attr_span
```
@konstin
Copy link
Member Author

konstin commented Jan 17, 2024

What do you think about boxing BuiltDist and SourceDist instead? That should bring the size of Dist down considerably. You can even leave the box on File.

This way ...

#[derive(Debug, Clone)]
pub enum Dist {
    Built(Box<BuiltDist>),
    Source(Box<SourceDist>),
}

... or this ways?

#[derive(Debug, Clone)]
pub struct Dist(Box<DistInner>);

#[derive(Debug, Clone)]
pub enum DistInner {
    Built(BuiltDist),
    Source(SourceDist),
}

It changes all our Dist matches, but if we have consensus i'll do it.

@BurntSushi
Copy link
Member

Hmmm yes I see the conundrum. Given those choices, and given the fact that you can't pattern match through a Box (bah), I'd probably go with the DistInner approach since BuiltDist and SourceDist are themselves both enums.

But... Looking at the types again, I think what I might do instead is:

  1. Make all struct types opaque. So that's RegistryBuiltDist, DirectUrlBuiltDist, PathBuiltDist, RegistrySourceDist, DirectUrlSourceDist, GitSourceDist and PathSourceDist.
  2. Expose their fields as public getter methods.
  3. For all the types above, add a box inside of them, probably requiring an Inner type for each.

This way, I think for the most part pattern matching won't be as annoying/awkward. Some might need to change in places where we match on the struct fields, but it should be pretty straight-forward to tweak those.

With that said, this is a larger and annoying refactor. So if this PR solves the immediate issue, I'd be happy punting on this for now.

@konstin konstin force-pushed the konsti/reduce-stack-usage branch from 78f1eba to f5f291d Compare January 18, 2024 11:13
@konstin konstin changed the base branch from main to konsti/set-explicit-thread-size January 18, 2024 11:14
@konstin konstin force-pushed the konsti/set-explicit-thread-size branch from c3242bb to 7b80943 Compare January 18, 2024 11:45
@konstin konstin force-pushed the konsti/reduce-stack-usage branch from f5f291d to 4399843 Compare January 18, 2024 11:46
@konstin konstin changed the title Reduce stack usage by boxing File in Dist and large futures Reduce stack usage by boxing File in Dist, CachePolicy and large futures Jan 18, 2024
@konstin
Copy link
Member Author

konstin commented Jan 18, 2024

Yeah i'd prefer to move ahead without refactoring Dist more, the payoff from that refactor looks much smaller than the other boxing.

@konstin konstin force-pushed the konsti/set-explicit-thread-size branch from 7b80943 to 5a63635 Compare January 18, 2024 14:28
@konstin konstin force-pushed the konsti/reduce-stack-usage branch from 4399843 to a61a2ab Compare January 18, 2024 14:28
@konstin konstin force-pushed the konsti/set-explicit-thread-size branch 3 times, most recently from d1204de to eb83df3 Compare January 18, 2024 16:30
@konstin konstin force-pushed the konsti/reduce-stack-usage branch from a61a2ab to dbc6bfa Compare January 18, 2024 16:30
@konstin konstin force-pushed the konsti/set-explicit-thread-size branch 2 times, most recently from 52878f7 to 1bb4eb7 Compare January 18, 2024 16:38
`Dist` was standing out when profiling stack sizes with top-type-sizes. Here, we trade an allocation per `Dist` for a more reasonable stack size.
@konstin konstin force-pushed the konsti/set-explicit-thread-size branch from 1bb4eb7 to 1dd47b9 Compare January 19, 2024 09:28
Looking through the stack trace of `allowed_transitive_url_dependency` with a 800KB stack, i found those two to be the major offenders. These changes make `allowed_transitive_url_dependency` pass with a 800KB stack.
@konstin konstin force-pushed the konsti/reduce-stack-usage branch from dbc6bfa to 5cf1bc3 Compare January 19, 2024 09:28
@konstin konstin merged commit 6748cf8 into konsti/set-explicit-thread-size Jan 19, 2024
3 checks passed
@konstin konstin deleted the konsti/reduce-stack-usage branch January 19, 2024 09:29
konstin added a commit that referenced this pull request Jan 19, 2024
konstin added a commit that referenced this pull request Jan 19, 2024
…e futures (#1004)

This is #947 again but this time
merging into main instead of downstack, sorry for the noise.

---

Windows has a default stack size of 1MB, which makes puffin often fail
with stack overflows. The PR reduces stack size by three changes:

* Boxing `File` in `Dist`, reducing the size from 496 to 240.
* Boxing the largest futures.
* Boxing `CachePolicy`

## Method

Debugging happened on linux using
#941 to limit the stack size to
1MB. Used ran the command below.

```
RUSTFLAGS=-Zprint-type-sizes cargo +nightly build -p puffin-cli -j 1 > type-sizes.txt && top-type-sizes -w -s -h 10 < type-sizes.txt > sizes.txt
```

The main drawback is top-type-sizes not saying what the `__awaitee` is,
so it requires manually looking up with a future with matching size.

When the `brotli` features on `reqwest` is active, a lot of brotli types
show up. Toggling this feature however seems to have no effect. I assume
they are false positives since the `brotli` crate has elaborate control
about allocation. The sizes are therefore shown with the feature off.

## Results

The largest future goes from 12208B to 6416B, the largest type
(`PrioritizedDistribution`, see also #948) from 17448B to 9264B. Full
diff: https://gist.github.com/konstin/62635c0d12110a616a1b2bfcde21304f

For the second commit, i iteratively boxed the largest file until the
tests passed, then with an 800KB stack limit looked through the
backtrace of a failing test and added some more boxing.

Quick benchmarking showed no difference:

```console
$ hyperfine --warmup 2 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent" 
Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent
  Time (mean ± σ):      49.2 ms ±   3.0 ms    [User: 39.8 ms, System: 24.0 ms]
  Range (min … max):    46.6 ms …  63.0 ms    55 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent
  Time (mean ± σ):      47.4 ms ±   3.2 ms    [User: 41.3 ms, System: 20.6 ms]
  Range (min … max):    44.6 ms …  60.5 ms    62 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  target/profiling/puffin-dev resolve meine_stadt_transparent ran
    1.04 ± 0.09 times faster than target/profiling/main-dev resolve meine_stadt_transparent
```
konstin added a commit that referenced this pull request Jan 19, 2024
On top of #947, we can also box
`PrioritizedDistribution`.

In a simple benchmark, this seems to slightly improve performance when
comparing only this commit to main, even though the benchmark is too
noisy to establish significance:

```
$ hyperfine --warmup 30 --runs 300 "target/profiling/main-dev resolve meine_stadt_transparent" "target/profiling/puffin-dev resolve meine_stadt_transparent"
  Benchmark 1: target/profiling/main-dev resolve meine_stadt_transparent
    Time (mean ± σ):      83.6 ms ±   2.0 ms    [User: 77.7 ms, System: 20.0 ms]
    Range (min … max):    81.4 ms …  98.2 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Benchmark 2: target/profiling/puffin-dev resolve meine_stadt_transparent
    Time (mean ± σ):      80.8 ms ±   2.2 ms    [User: 75.4 ms, System: 19.5 ms]
    Range (min … max):    78.6 ms …  98.6 ms    300 runs

    Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

  Summary
    target/profiling/puffin-dev resolve meine_stadt_transparent ran
      1.03 ± 0.04 times faster than target/profiling/main-dev resolve meine_stadt_transparent
```

The effect on type sizes however is considerable ([downstack
PR](https://gist.github.com/konstin/38e6c774db541db46d61f1d4ea6b498f)
vs. [this
PR](https://gist.github.com/konstin/003a77fe7d7d246b0d535e3fc843cb36)):

```patch
--- branch.txt  2024-01-17 14:26:01.826085176 +0100
+++ boxed-prioritized-dist.txt  2024-01-17 14:25:57.101900963 +0100
@@ -1,19 +1,3 @@
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9264 alloc::collections::btree::node::InternalNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9168 data
-     96 edges
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::version::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
-9168 alloc::collections::btree::node::LeafNode<pep440_rs::Version, distribution_types::PrioritizedDistribution> align=8
-   9064 vals
-     88 keys
-
 8992 tokio::sync::mpsc::block::Block<hyper::client::dispatch::Envelope<http::request::Request<reqwest::async_impl::body::ImplStream>, http::response::Response<hyper::body::body::Body>>> align=8
    8960 values
      32 header
@@ -74,10 +58,23 @@
          40 __tracing_attr_span
      64 variant Unresumed, Returned, Panicked

+5648 {async fn body@crates/puffin-client/src/registry_client.rs:224:5: 224:30} align=8
+   5647 variant Suspend0
+       5576 __awaitee align=8
+         40 __tracing_attr_span
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants