Return `ResourceExhausted` errors when memory limit is exceed in `GroupedHashAggregateStreamV2` (Row Hash) #4202

crepererum · 2022-11-14T10:36:48Z

Which issue does this PR close?

Doesn't close, but works towards #3940 (need to migrate V1 as well

Rationale for this change

Ensure that users don't run out of memory while performing group-by operations. This is esp. important for servers or multi-tenant systems.

What changes are included in this PR?

small clean up regarding async usage (first commit)
use a a nested construct (BoxStream) for GroupedHashAggregateStreamV2 so we can call into the async memory manager (I thought about NOT doing this but I think it's worth to consider because on the long run a group-by can get another splillable operation to spill)

Are these changes tested?

new test (test_oom)
perf results (see below)

Perf results:

❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue3940a-pre
    Finished bench [optimized] target(s) in 0.08s
     Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-e9e315ab7a06a262)
aggregate_query_no_group_by 15 12
                        time:   [654.77 µs 655.49 µs 656.29 µs]
                        change: [-1.6711% -1.2910% -0.8435%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

aggregate_query_no_group_by_min_max_f64
                        time:   [579.93 µs 580.59 µs 581.27 µs]
                        change: [-3.8985% -3.2219% -2.6198%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

aggregate_query_no_group_by_count_distinct_wide
                        time:   [2.4610 ms 2.4801 ms 2.4990 ms]
                        change: [-2.9300% -1.8414% -0.7493%] (p = 0.00 < 0.05)
                        Change within noise threshold.

Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
aggregate_query_no_group_by_count_distinct_narrow
                        time:   [1.6578 ms 1.6661 ms 1.6743 ms]
                        change: [-4.5391% -3.5033% -2.5050%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by
                        time:   [2.1767 ms 2.2045 ms 2.2486 ms]
                        change: [-4.1048% -2.5858% -0.3237%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter
                        time:   [1.0916 ms 1.0927 ms 1.0941 ms]
                        change: [-0.8524% -0.4230% -0.0724%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

aggregate_query_group_by_u64 15 12
                        time:   [2.2108 ms 2.2238 ms 2.2368 ms]
                        change: [-4.2142% -3.2743% -2.3523%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60.
aggregate_query_group_by_with_filter_u64 15 12
                        time:   [1.0922 ms 1.0931 ms 1.0940 ms]
                        change: [-0.6872% -0.3192% +0.1193%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high severe

aggregate_query_group_by_u64_multiple_keys
                        time:   [14.714 ms 15.023 ms 15.344 ms]
                        change: [-5.8337% -2.7471% +0.2798%] (p = 0.09 > 0.05)
                        No change in performance detected.

aggregate_query_approx_percentile_cont_on_u64
                        time:   [3.7776 ms 3.8049 ms 3.8329 ms]
                        change: [-4.4977% -3.4230% -2.3282%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

aggregate_query_approx_percentile_cont_on_f32
                        time:   [3.1769 ms 3.1997 ms 3.2230 ms]
                        change: [-4.4664% -3.2597% -2.0955%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

I think the mild improvements are either flux or due to the somewhat
manual memory allocation pattern.

Are there any user-facing changes?

The V2 group-by op an now emit a ResourceExhausted error if it runs out of memory. Note that the error is kinda nested/wrapped due to #4172.

Most of it is refactoring to allow us to call the async memory subsystem while polling the stream. The actual memory accounting is rather easy (since it's only ever growing except when the stream is dropped). Helps with apache#3940. (not closing yet, also need to do V1) Performance Impact: ------------------- ```text ❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue3940a-pre Finished bench [optimized] target(s) in 0.08s Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-e9e315ab7a06a262) aggregate_query_no_group_by 15 12 time: [654.77 µs 655.49 µs 656.29 µs] change: [-1.6711% -1.2910% -0.8435%] (p = 0.00 < 0.05) Change within noise threshold. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low mild 5 (5.00%) high mild 3 (3.00%) high severe aggregate_query_no_group_by_min_max_f64 time: [579.93 µs 580.59 µs 581.27 µs] change: [-3.8985% -3.2219% -2.6198%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low severe 3 (3.00%) low mild 1 (1.00%) high mild 4 (4.00%) high severe aggregate_query_no_group_by_count_distinct_wide time: [2.4610 ms 2.4801 ms 2.4990 ms] change: [-2.9300% -1.8414% -0.7493%] (p = 0.00 < 0.05) Change within noise threshold. Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50. aggregate_query_no_group_by_count_distinct_narrow time: [1.6578 ms 1.6661 ms 1.6743 ms] change: [-4.5391% -3.5033% -2.5050%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe aggregate_query_group_by time: [2.1767 ms 2.2045 ms 2.2486 ms] change: [-4.1048% -2.5858% -0.3237%] (p = 0.00 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter time: [1.0916 ms 1.0927 ms 1.0941 ms] change: [-0.8524% -0.4230% -0.0724%] (p = 0.02 < 0.05) Change within noise threshold. Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 2 (2.00%) high severe aggregate_query_group_by_u64 15 12 time: [2.2108 ms 2.2238 ms 2.2368 ms] change: [-4.2142% -3.2743% -2.3523%] (p = 0.00 < 0.05) Performance has improved. Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.5s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter_u64 15 12 time: [1.0922 ms 1.0931 ms 1.0940 ms] change: [-0.6872% -0.3192% +0.1193%] (p = 0.12 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low mild 4 (4.00%) high severe aggregate_query_group_by_u64_multiple_keys time: [14.714 ms 15.023 ms 15.344 ms] change: [-5.8337% -2.7471% +0.2798%] (p = 0.09 > 0.05) No change in performance detected. aggregate_query_approx_percentile_cont_on_u64 time: [3.7776 ms 3.8049 ms 3.8329 ms] change: [-4.4977% -3.4230% -2.3282%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild aggregate_query_approx_percentile_cont_on_f32 time: [3.1769 ms 3.1997 ms 3.2230 ms] change: [-4.4664% -3.2597% -2.0955%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` I think the mild improvements are either flux or due to the somewhat manual memory allocation pattern.

alamb

Thank you @crepererum I went through this PR carefully and I think it could be merged as is. Thank you for the performance results

Note to other reviewers is that the memory limits are not enabled by default so the additional accounting will not be used except if the memory manager limits are engaged

I had some small suggestions but none that I think are required

I also found this PR easier to review using whitespace blind diff https://github.com/apache/arrow-datafusion/pull/4202/files?w=1

cc @yjshen and @milenkovicm who I think has been working in this area

ALso cc @Dandandan as I know you are often interested in this type of code

alamb · 2022-11-15T18:38:26Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+/// high due to lock contention) and pre-calculating the entire allocation for a whole [`RecordBatch`] is complicated or
+/// expensive.
+///
+/// The pool will try to allocate a whole block of memory and gives back overallocated memory on [drop](Self::drop).


alamb · 2022-11-15T18:42:15Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -70,6 +72,16 @@ use hashbrown::raw::RawTable;
 /// [Compact]: datafusion_row::layout::RowType::Compact
 /// [WordAligned]: datafusion_row::layout::RowType::WordAligned
 pub(crate) struct GroupedHashAggregateStreamV2 {


This looks very much like other stream adapters we have in DataFusion -- perhaps we can name it something more general like SendableRecordBatchStreamWrapper or something and put it in

https://github.com/apache/arrow-datafusion/blob/c9361e0210861962074eb10d7e480949bb862b97/datafusion/core/src/physical_plan/stream.rs#L34

we can always do this as a follow on PR as well

Will do that in a follow-up, since migrating V1 will probably end up with the same helper.

alamb · 2022-11-15T18:48:20Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                        // allocate more
+
+                        // growth factor: 2, but at least 2 elements
+                        let bump_elements = (group_state.indices.capacity() * 2).max(2);


I wonder if we could somehow encapsulate the memory manager interactions into functions on GroupAggrState rather than treating it like a struct. I don't think that is necessary .

However encapsulating might:

Keep this code manageable for future readers

Allow the memory allocation routines to be unit tested (like that when new groups are added that the memory allocation is incremented correctly)

I tend to agree with with @alamb here, IMHO group_aggregate_batch is too busy at the moment, and some kind of separation of concerns would help.

What if group_aggregate_batch returns how much more memory it allocated, and accounting is done after method call? This would help with encapsulation of aggregation algorithm and make it easier to swap. I'm aware that it might not produce 100% correct results but as we discussed in #3941 it is ok to have small discrepancy for short period of time

Also, this way end of the batch would be a "safe point" at which we could trigger spill

I wonder if we could somehow encapsulate the memory manager interactions into functions on GroupAggrState rather than treating it like a struct.

That only works if all interactions with GroupState go throw methods, not only a few of them due to how Rust handles borrowing (= fn f(&self) and fn f(&mut self) borrow the whole struct, so you cannot mutable borrow any member at the same time).

What if group_aggregate_batch returns how much more memory it allocated, and accounting is done after method call? ... Also, this way end of the batch would be a "safe point" at which we could trigger spill

Fair, let me try that.

done.

Let me know if this looks better. I will pull out + document all the helper structs and traits when I port V1 (I want at least a 2nd consumer so I can make sure the interface makes sense).

alamb · 2022-11-15T18:54:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                    if group_state.indices.capacity() == group_state.indices.len() {
+                        // allocate more
+
+                        // growth factor: 2, but at least 2 elements


Growth factors like this are sometimes capped at some large value (like 1G) to avoid the 2x memory overhead associated at large memory levels.

If we use 2x growth with no cap, you can get into situations like the table would fit in 36GB but the code is trying to go from 32GB to 64GB and hits the limit even when the query could complete. This could always be handled in a follow on PR -- users can always disable the memory manager and let the allocations happen and suffer OOMs if they want the current behavior

IIRC Vec doesn't cap, so at least this ain't a regression:

https://github.com/rust-lang/rust/blob/e702534763599db252f2ca308739ec340d0933de/library/alloc/src/raw_vec.rs#L372

milenkovicm · 2022-11-16T14:36:48Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                            timer.done();
+
+                            match result {
+                                Ok(_) => continue,


IMHO, this would be place to do something like:

Ok(_) => { let new_data_size = this.aggr_state.get_current_size(); let acquired = this.memory_manager.can_grow_directly(new_data_size - data_size_before_batch, data_size_before_batch); if !acquired { this.aggr_state.spill(); this.memory_manager.record_free_then_acquire(data_size, 0); } continue; }

we basically assume that group_aggregate_batch can get all the memory it needs, no need to do per row interaction with memory manager.

this would decouple process and accounting

The interaction is not per row. It's per batch. I can place the accounting here. The code you propose is basically the same that currently runs, just inlined (it's the default impl. of MemoryConsumer::try_grow).

Apologies you're right @crepererum it is per batch.

The reason why I believe moving it out makes sense is separation of concerns, but it's up to you.

for example, at line 363

// allocate memory // This happens AFTER we actually used the memory, but simplifies the whole accounting and we are OK with // overshooting a bit. Also this means we either store the whole record batch or not. memory_consumer.alloc(allocated).await?;

can this trigger spill? will the state be consistent if spill is triggered. My guess it will be not, it might be implementation specific, but hard to tell without understanding memory management implementation, and store implementation.

milenkovicm · 2022-11-16T14:39:10Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    fn insert_accounted(
+        &mut self,
+        x: Self::T,
+        hasher: impl Fn(&Self::T) -> u64,


this is coupling with current implementation. for example, what if we decide to keep state in b-tree rather than hash map (we need it sorted due to spill)

Sure, memory accounting is ALWAYS coupled to the data structures that are used.

my bad @crepererum ignore my comment, apologies

Long-term I would wish that Rust stabilizes the Allocator trait so we could plug this into the data structures and measure their usage (no need to guess).

milenkovicm · 2022-11-16T15:49:01Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+        // allocate memory
+        // This happens AFTER we actually used the memory, but simplifies the whole accounting and we are OK with
+        // overshooting a bit. Also this means we either store the whole record batch or not.
+        memory_consumer.alloc(allocated).await?;


as i mentioned above, should this call go before return statement? if it triggers spill we internal state should be consistent.

alamb

I think this is looking great

https://github.com/apache/arrow-datafusion/pull/4202/files?w=1 shows the diff clearly

What are your thoughts @milenkovicm ?

alamb · 2022-11-16T21:28:38Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -418,6 +487,130 @@ impl std::fmt::Debug for AggregationState {
    }
 }

+/// Accounting data structure for memory usage.
+struct AggregationStateMemoryConsumer {


alamb · 2022-11-16T21:29:34Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    fn push_accounted(&mut self, x: Self::T, accounting: &mut usize);
+}
+
+impl<T> VecAllocExt for Vec<T> {


this is very nice

milenkovicm · 2022-11-16T21:44:56Z

I think this is looking great

https://github.com/apache/arrow-datafusion/pull/4202/files?w=1 shows the diff clearly

What are your thoughts @milenkovicm ?

I think @crepererum did fine job here.

Not sure if he will move

memory_consumer.alloc(allocated).await?;

Just before return statement, otherwise it is spot on.

crepererum · 2022-11-17T09:31:55Z

I'll move the alloc statement, give me a few minutes...

crepererum · 2022-11-17T10:00:59Z

I'll move the alloc statement, give me a few minutes...

done

alamb

Looks great -- thank you @milenkovicm and @crepererum -- I will plan to merge this tomorrow unless I hear otherwise

FYI @liukun4515 @Dandandan @avantgardnerio @andygrove

ursabot · 2022-11-18T20:11:36Z

Benchmark runs are scheduled for baseline = 09e1c91 and contender = f3a65c7. f3a65c7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

crepererum added 2 commits November 14, 2022 11:26

refactor: remove needless async

b89d603

github-actions bot added the core Core DataFusion crate label Nov 14, 2022

alamb requested a review from yjshen November 15, 2022 18:28

alamb changed the title ~~wire memory management into GroupedHashAggregateStreamV2~~ Return ResourceExhausted errors when memory limit is exceed in GroupedHashAggregateStreamV2 (Row Hash) Nov 15, 2022

alamb approved these changes Nov 15, 2022

View reviewed changes

refactor: simplify memory accounting

a3ab17b

milenkovicm reviewed Nov 16, 2022

View reviewed changes

alamb reviewed Nov 16, 2022

View reviewed changes

refactor: de-couple memory allocation

abebced

alamb approved these changes Nov 17, 2022

View reviewed changes

alamb merged commit f3a65c7 into apache:master Nov 18, 2022

This was referenced Nov 25, 2022

feat: ResourceExhausted for memory limit in GroupedHashAggregateStream #4371

Merged

feat: ResourceExhausted for memory limit in AggregateStream #4405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return `ResourceExhausted` errors when memory limit is exceed in `GroupedHashAggregateStreamV2` (Row Hash) #4202

Return `ResourceExhausted` errors when memory limit is exceed in `GroupedHashAggregateStreamV2` (Row Hash) #4202

crepererum commented Nov 14, 2022

alamb left a comment

alamb Nov 15, 2022

alamb Nov 15, 2022

crepererum Nov 16, 2022

alamb Nov 15, 2022

milenkovicm Nov 16, 2022

milenkovicm Nov 16, 2022

crepererum Nov 16, 2022

crepererum Nov 16, 2022

alamb Nov 15, 2022

crepererum Nov 16, 2022

milenkovicm Nov 16, 2022

crepererum Nov 16, 2022

milenkovicm Nov 16, 2022

milenkovicm Nov 16, 2022

crepererum Nov 16, 2022

milenkovicm Nov 16, 2022

crepererum Nov 16, 2022

milenkovicm Nov 16, 2022

alamb left a comment •

edited

Loading

alamb Nov 16, 2022

alamb Nov 16, 2022

milenkovicm commented Nov 16, 2022

crepererum commented Nov 17, 2022

crepererum commented Nov 17, 2022

alamb left a comment

ursabot commented Nov 18, 2022

Return ResourceExhausted errors when memory limit is exceed in GroupedHashAggregateStreamV2 (Row Hash) #4202

Return ResourceExhausted errors when memory limit is exceed in GroupedHashAggregateStreamV2 (Row Hash) #4202

Conversation

crepererum commented Nov 14, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milenkovicm commented Nov 16, 2022

crepererum commented Nov 17, 2022

crepererum commented Nov 17, 2022

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Nov 18, 2022

Return `ResourceExhausted` errors when memory limit is exceed in `GroupedHashAggregateStreamV2` (Row Hash) #4202

Return `ResourceExhausted` errors when memory limit is exceed in `GroupedHashAggregateStreamV2` (Row Hash) #4202

alamb left a comment •

edited

Loading