feat: `ResourceExhausted` for memory limit in `GroupedHashAggregateStream` #4371

crepererum · 2022-11-25T11:21:12Z

Which issue does this PR close?

Update: This does NOT close the issue. I forgot the no-group version (AggregateStream). Will do that in a follow-up PR. It's a rather simple change though.

Rationale for this change

Ensure that users don't run out of memory while performing group-by operations. This is esp. important for servers or multi-tenant systems.

What changes are included in this PR?

This is similar to #4202. It includes an additional type StreamType so we can double-check our test setup (namely: is the stream that we request actually the stream version we want).

Are these changes tested?

Extended test_oom. Also here are the perf results:

❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue3940d-pre                                                                                                                                                                   [32/5030]
   Compiling datafusion v14.0.0 (/home/mneumann/src/arrow-datafusion/datafusion/core)
   Compiling parquet-test-utils v0.1.0 (/home/mneumann/src/arrow-datafusion/parquet-test-utils)
    Finished bench [optimized] target(s) in 5m 11s
     Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-0659981fac849434)
aggregate_query_no_group_by 15 12
                        time:   [686.71 µs 688.17 µs 689.79 µs]
                        change: [+0.9423% +1.5099% +2.1284%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

aggregate_query_no_group_by_min_max_f64
                        time:   [637.75 µs 640.91 µs 644.19 µs]
                        change: [+0.9089% +1.5396% +2.1740%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe

aggregate_query_no_group_by_count_distinct_wide
                        time:   [2.5239 ms 2.5437 ms 2.5641 ms]
                        change: [+1.5365% +2.6581% +3.8220%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
aggregate_query_no_group_by_count_distinct_narrow
                        time:   [1.7286 ms 1.7392 ms 1.7498 ms]
                        change: [+1.2376% +2.3389% +3.5532%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

aggregate_query_group_by
                        time:   [2.2890 ms 2.3063 ms 2.3241 ms]
                        change: [+1.7896% +2.8160% +3.7350%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter
                        time:   [1.1419 ms 1.1444 ms 1.1472 ms]
                        change: [+1.1563% +1.6416% +2.1195%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by_u64 15 12
                        time:   [2.3083 ms 2.3237 ms 2.3394 ms]
                        change: [+1.7416% +2.7403% +3.7301%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter_u64 15 12
                        time:   [1.1492 ms 1.1530 ms 1.1572 ms]
                        change: [+0.4978% +1.1567% +1.7562%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by_u64_multiple_keys
                        time:   [15.070 ms 15.384 ms 15.706 ms]
                        change: [-2.0263% +0.9477% +4.0997%] (p = 0.55 > 0.05)
                        No change in performance detected.

aggregate_query_approx_percentile_cont_on_u64
                        time:   [3.8600 ms 3.8963 ms 3.9341 ms]
                        change: [-0.1229% +1.2647% +2.8308%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

aggregate_query_approx_percentile_cont_on_f32
                        time:   [3.2601 ms 3.2871 ms 3.3147 ms]
                        change: [-0.5989% +0.7136% +2.0221%] (p = 0.27 > 0.05)
                        No change in performance detected.

I think the regressions (<3%) are within the safety margin of such a crucial feature (and also within what a laptop could reliable reproduce).

Are there any user-facing changes?

The V1 group-by op an now emit a ResourceExhausted error if it runs out of memory. Note that the error is kinda nested/wrapped due to #4172.

…ream` Closes apache#3940.

crepererum · 2022-11-25T11:56:50Z

datafusion/common/src/scalar.rs

@@ -2295,7 +2295,7 @@ impl ScalarValue {
    /// Estimate size if bytes including `Self`. For values with internal containers such as `String`
    /// includes the allocated size (`capacity`) rather than the current length (`len`)
    pub fn size(&self) -> usize {
-        std::mem::size_of_val(&self)
+        std::mem::size_of_val(self)


The tests panicked due to an integer underflow. Apart that the vector/hashmap calculations were wrong, this here was also kinda tricky: the size of &ScalarValue is 8 bytes, not 48 🤦.

alamb

This looks great to me -- thank you @crepererum

I have some minor style suggestions, but nothing that would prevent this PR from being merged.

For other reviewers, I found the changes easier to see with whitespace blind diff: https://github.com/apache/arrow-datafusion/pull/4371/files?w=1

I'll plan to merge this PR tomorrow unless anyone else wants more time to review

cc @milenkovicm @yjshen @richox @tustvold

alamb · 2022-11-27T11:11:38Z

datafusion/core/src/physical_plan/aggregates/hash.rs

-        })
+        };
+
+        let stream = futures::stream::unfold(inner, |mut this| async move {


TIL https://docs.rs/futures/0.3.25/futures/stream/fn.unfold.html

alamb · 2022-11-27T11:15:22Z

datafusion/core/src/physical_plan/aggregates/hash.rs

@@ -257,12 +313,32 @@ fn group_aggregate_batch(
                        accumulator_set,
                        indices: vec![row as u32], // 1.3
                    };
+                    // NOTE: do NOT include the `GroupState` struct size in here because this is captured by
+                    // `group_states` (see allocation down below)
+                    allocated += group_state


Figuring out how to encapsulate some of this accounting (so it wasn't inlined into the code) would make it easier to maintain I think. But I don't think that is required

I think as long as the Allocator API in Rust isn't stable, this will be a bit of a mess. Once it is stable, we could have very elegant memory accounting.

alamb · 2022-11-27T11:15:45Z

datafusion/core/src/physical_plan/aggregates/hash.rs

@@ -326,10 +402,20 @@ fn group_aggregate_batch(
                        )
                    })
                    .try_for_each(|(accumulator, values)| match mode {
-                        AggregateMode::Partial => accumulator.update_batch(&values),
+                        AggregateMode::Partial => {
+                            let size_pre = accumulator.size();


You might also consider pulling the size accounting to before/after the match to avoid the duplication

alamb · 2022-11-27T11:17:08Z

datafusion/core/src/physical_plan/aggregates/mod.rs

-                baseline_metrics,
-            )?))
-        }
+        self.execute_typed(partition, context)


alamb · 2022-11-28T14:25:54Z

Thanks again @crepererum

ursabot · 2022-11-28T14:31:08Z

Benchmark runs are scheduled for baseline = fe8aee6 and contender = be1d376. be1d376 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add size_of_ref lint This addresses #9995, which is likely raising a valid point about `std::mem::size_of_val()`: It's [very easy to use double-references as the argument](apache/datafusion#4371 (comment)), which the function will happily accept and give back the size of _the reference_, not the size of the value _behind_ the reference. In the worst case, if the value matches the programmer's expectation, this seems to work, while in fact, everything will go horribly wrong e.g. on a different platform. The size of a `&T` is independent of what `T` is, and people might want to use `std::mem::size_of_val()` to actually get the size of _any_ reference (e.g. via `&&()`). I would rather suggest that this is always bad behavior, though ([instead](https://doc.rust-lang.org/reference/type-layout.html#pointers-and-references-layout), [and](https://doc.rust-lang.org/stable/std/primitive.usize.html#associatedconstant.BITS)). I, therefore, put this lint into `correctness`. Since the problem is usually easily fixed by removing extra `&`, I went light on suggesting code. --- changelog: New lint: [`size_of_ref`] [#10098](#10098)

feat: ResourceExhausted for memory limit in `GroupedHashAggregateSt…

d665450

…ream` Closes apache#3940.

github-actions bot added the core Core DataFusion crate label Nov 25, 2022

fix: ScalarValue size calculations

defec75

crepererum commented Nov 25, 2022

View reviewed changes

alamb approved these changes Nov 27, 2022

View reviewed changes

refactor: de-dup code

3864673

alamb merged commit be1d376 into apache:master Nov 28, 2022

This was referenced Nov 28, 2022

feat: ResourceExhausted for memory limit in AggregateStream #4405

Merged

std::mem::size_of_val w/ double-reference rust-lang/rust-clippy#9995

Closed

lukaslueg mentioned this pull request Dec 17, 2022

Add size_of_ref lint rust-lang/rust-clippy#10098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `ResourceExhausted` for memory limit in `GroupedHashAggregateStream` #4371

feat: `ResourceExhausted` for memory limit in `GroupedHashAggregateStream` #4371

crepererum commented Nov 25, 2022 •

edited

Loading

crepererum Nov 25, 2022

alamb left a comment

alamb Nov 27, 2022

alamb Nov 27, 2022

crepererum Nov 28, 2022

alamb Nov 27, 2022

alamb Nov 27, 2022

crepererum Nov 28, 2022

alamb Nov 27, 2022

alamb commented Nov 28, 2022

ursabot commented Nov 28, 2022

feat: ResourceExhausted for memory limit in GroupedHashAggregateStream #4371

feat: ResourceExhausted for memory limit in GroupedHashAggregateStream #4371

Conversation

crepererum commented Nov 25, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 28, 2022

ursabot commented Nov 28, 2022

feat: `ResourceExhausted` for memory limit in `GroupedHashAggregateStream` #4371

feat: `ResourceExhausted` for memory limit in `GroupedHashAggregateStream` #4371

crepererum commented Nov 25, 2022 •

edited

Loading