Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44052: [C++][Compute] Reduce the complexity of row segmenter #44053

Merged
merged 17 commits into from
Sep 18, 2024

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented Sep 11, 2024

Rationale for this change

As described in #44052, currently AnyKeysSegmenter::GetNextSegment has O(n*m) complexity, where n is the number of rows in a batch, and m is the number of segments in this batch (a "segment" is the group of contiguous rows who have the same segment key). This is because in each invocation of the method, it computes all the group ids of the remaining rows in this batch, where it's only interested in the first group, making the rest of the computation a waste.

In this PR I introduced a new API GetSegments (and subsequently deprecated the old GetNextSegment) to compute the group ids only once and iterate all the segments outside to avoid the duplicated computation. This reduces the complexity from O(n*m) to O(n).

What changes are included in this PR?

  1. Because grouper.h is a public header, so I assume RowSegmenter::GetNextSegment is a public API and only deprecate it instead of removing it.
  2. Implement new API RowSegmenter::GetSegments and update the call-sites.
  3. Some code reorg of the segmenter code (mostly moving to inside a class).
  4. A new benchmark for the segmented aggregation. (The benchmark result is listed in the comments below, which shows up to 50x speedup, nearly O(n*m) to O(n) complexity reduction.)

Are these changes tested?

Legacy tests are sufficient.

Are there any user-facing changes?

Yes.

This PR includes breaking changes to public APIs.

The API RowSegmenter::GetNextSegment is deprecated due to its inefficiency and replaced with a more efficient one RowSegmenter::GetSegments.

@zanmato1984
Copy link
Collaborator Author

zanmato1984 commented Sep 11, 2024

The benchmark number, ran on my Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, shows a maximum 50x speedup, nearly O(n*m) to O(n) complexity reduction.

Before:

----------------------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:0        24934 ns        24906 ns        27871 bytes_per_second=9.80236Gi/s items_per_second=1.31565G/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:0        24729 ns        24687 ns        27358 bytes_per_second=9.88957Gi/s items_per_second=1.32736G/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:0       24630 ns        24616 ns        28100 bytes_per_second=9.91805Gi/s items_per_second=1.33118G/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:0       24633 ns        24620 ns        28112 bytes_per_second=9.91645Gi/s items_per_second=1.33096G/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:0      24679 ns        24656 ns        28651 bytes_per_second=9.90192Gi/s items_per_second=1.32901G/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       199781 ns       199339 ns         3606 bytes_per_second=2.4495Gi/s items_per_second=164.383M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       222783 ns       220778 ns         2905 bytes_per_second=2.21164Gi/s items_per_second=148.421M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254637 ns       254545 ns         2703 bytes_per_second=1.91825Gi/s items_per_second=128.732M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      452659 ns       452356 ns         1550 bytes_per_second=1.07942Gi/s items_per_second=72.4385M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1241185 ns      1240526 ns          572 bytes_per_second=403.055Mi/s items_per_second=26.4146M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:2       351964 ns       351202 ns         1983 bytes_per_second=1.39031Gi/s items_per_second=93.3024M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:2      1076041 ns      1010477 ns          727 bytes_per_second=494.816Mi/s items_per_second=32.4282M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:2     4587149 ns      4584191 ns          152 bytes_per_second=109.071Mi/s items_per_second=7.14804M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:2    25132123 ns     25118536 ns           28 bytes_per_second=19.9056Mi/s items_per_second=1.30453M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:2  102196767 ns    102086286 ns            7 bytes_per_second=4.89782Mi/s items_per_second=320.983k/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:3       424279 ns       423836 ns         1619 bytes_per_second=1.15205Gi/s items_per_second=77.3129M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:3      1180743 ns      1157215 ns          610 bytes_per_second=432.072Mi/s items_per_second=28.3163M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:3     5738171 ns      5729356 ns          118 bytes_per_second=87.2698Mi/s items_per_second=5.71932M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:3    30695547 ns     30656783 ns           23 bytes_per_second=16.3096Mi/s items_per_second=1.06887M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:3  125980210 ns    125906333 ns            6 bytes_per_second=3.97121Mi/s items_per_second=260.257k/s

After:

----------------------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:0        24717 ns        24709 ns        28098 bytes_per_second=9.8808Gi/s items_per_second=1.32618G/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:0        24283 ns        24275 ns        28709 bytes_per_second=10.0574Gi/s items_per_second=1.34989G/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:0       24426 ns        24413 ns        28328 bytes_per_second=10.0006Gi/s items_per_second=1.34226G/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:0       24331 ns        24320 ns        28073 bytes_per_second=10.0385Gi/s items_per_second=1.34735G/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:0      26190 ns        25537 ns        28471 bytes_per_second=9.56034Gi/s items_per_second=1.28317G/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       197105 ns       196813 ns         3587 bytes_per_second=2.48093Gi/s items_per_second=166.493M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       207379 ns       207241 ns         3384 bytes_per_second=2.3561Gi/s items_per_second=158.115M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254707 ns       254565 ns         2777 bytes_per_second=1.9181Gi/s items_per_second=128.721M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      439539 ns       439415 ns         1589 bytes_per_second=1.11121Gi/s items_per_second=74.5719M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1225127 ns      1224517 ns          576 bytes_per_second=408.324Mi/s items_per_second=26.7599M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:2       355268 ns       355082 ns         1985 bytes_per_second=1.37512Gi/s items_per_second=92.283M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:2       410702 ns       410486 ns         1703 bytes_per_second=1.18952Gi/s items_per_second=79.8273M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:2      667262 ns       667087 ns         1035 bytes_per_second=749.527Mi/s items_per_second=49.121M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:2     1087405 ns      1086757 ns          651 bytes_per_second=460.084Mi/s items_per_second=30.1521M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:2    2133928 ns      2132308 ns          331 bytes_per_second=234.488Mi/s items_per_second=15.3674M/s
BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:3       437444 ns       437048 ns         1592 bytes_per_second=1.11722Gi/s items_per_second=74.9757M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:3       505824 ns       505355 ns         1385 bytes_per_second=989.403Mi/s items_per_second=64.8415M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:3      756347 ns       755999 ns          906 bytes_per_second=661.377Mi/s items_per_second=43.344M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:3     1294590 ns      1293676 ns          546 bytes_per_second=386.496Mi/s items_per_second=25.3294M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:3    2552917 ns      2551420 ns          274 bytes_per_second=195.969Mi/s items_per_second=12.843M/s

@zanmato1984
Copy link
Collaborator Author

Hi friends, @pitrou @felipecrv @mapleFU , would you like to take a look? Thanks.

ARROW_ASSIGN_OR_RAISE(auto segment_exec_batch, batch.SelectValues(ids));
ExecSpan segment_batch(segment_exec_batch);

while (true) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call-site in non-testing codes.

}
// Assert next is the last (empty) segment.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API generates fewer segments than before - there is no meaningless (length == 0) tailing segment.

@@ -629,61 +622,47 @@ TEST(RowSegmenter, Basics) {
auto batch2 = ExecBatchFromJSON(types2, "[[1, 1], [1, 2], [2, 2]]");
auto batch1 = ExecBatchFromJSON(types1, "[[1], [1], [2]]");
ExecBatch batch0({}, 3);
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API doesn't accept offset argument so no need for such testing.

ExecSpan span0(batch0);
TestSegments(segmenter, span0, {{0, 3, true, true}, {3, 0, true, true}});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below are just because the new API doesn't generate the zero-length ending segment.

@@ -54,13 +55,8 @@ using group_id_t = std::remove_const<decltype(kNoGroupId)>::type;
using GroupIdType = CTypeTraits<group_id_t>::ArrowType;
auto g_group_id_type = std::make_shared<GroupIdType>();

inline const uint8_t* GetValuesAsBytes(const ArraySpan& data, int64_t offset = 0) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved w/o touching the code inside.

@@ -102,21 +109,6 @@ Segment MakeSegment(int64_t batch_length, int64_t offset, int64_t length, bool e
return Segment{offset, length, offset + length >= batch_length, extends};
}

// Used by SimpleKeySegmenter::GetNextSegment to find the match-length of a value within a
// fixed-width buffer
int64_t GetMatchLength(const uint8_t* match_bytes, int64_t match_width,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved w/o touching the code inside.

@@ -147,21 +152,15 @@ struct SimpleKeySegmenter : public BaseRowSegmenter {
save_key_data_(static_cast<size_t>(key_type_.type->byte_width())),
extend_was_called_(false) {}

Status CheckType(const DataType& type) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved w/o touching the code inside.

@@ -97,7 +97,12 @@ class ARROW_EXPORT RowSegmenter {
virtual Status Reset() = 0;

/// \brief Get the next segment for the given batch starting from the given offset
/// DEPRECATED: Due to its inefficiency, use GetSegments instead.
ARROW_DEPRECATED("Deprecated in 18.0.0. Use GetSegments instead.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecate instead of removing it because I assume this is a public API (due to grouper.h is a public header).

// Runs the grouper on a single row. This is used to determine the group id of the
// first row of a new segment to see if it extends the previous segment.
template <typename Batch>
Result<group_id_t> MapGroupIdAt(const Batch& batch, int64_t offset) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved w/o touching the code inside.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 11, 2024
//

template <typename... Args>
static void BenchmarkRowSegmenter(benchmark::State& state, Args&&...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we turn this into a more realistic GroupBy benchmark? For example:

  • M=0 to 1 non-segmented keys
  • N=1 to 2 segmented keys
  • K=key cardinality (2,8,64)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do, and the name should be named with GroupBy for sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 911 to 920
ArrayVector segments(num_segments);
for (int i = 0; i < num_segments; ++i) {
ASSERT_OK_AND_ASSIGN(
segments[i],
Constant(std::make_shared<Int64Scalar>(i))->Generate(num_rows / num_segments));
}
// Concat all segments to form the segment key.
ASSERT_OK_AND_ASSIGN(auto segment_key, Concatenate(segments));
// num_segment_keys copies of the segment key.
ArrayVector segment_keys(num_segment_keys, segment_key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less trivially, you can, for each segmented key, 1) generate a random Int64 array of the right cardinality (using the min and max values) 2) sort it to make it segmented. Then the multi-column segments will not trivially map to the individual key segments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the multi-column segments will not trivially map to the individual key segments.

Sorry, I don't follow this part :(

Could you elaborate a bit? Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you have the exact same segment lengths in each segment key column. By using random generation the patterns would be slightly different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean introducing some randomness into the individual segment lengths. That sounds good given that this particular benchmark needs to be more realistic.

Will do. Thank you for elaborating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::vector<std::vector<int64_t>> row_segmenter_args = {
{32 * 1024}, benchmark::CreateRange(1, 256, 4), benchmark::CreateDenseRange(0, 3, 1)};

BENCHMARK(BenchmarkRowSegmenter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a GroupBy benchmark, so should probably have "GroupBy" in its name like other benchmarks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


BenchmarkGroupBy(state, {{"count", ""}}, {arg}, /*keys=*/{}, segment_keys);

state.SetItemsProcessed(num_rows * state.iterations());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we call SetItemsProcessed in BenchmarkGroupBy instead? It should probably be relevant for the other GroupBy benchmarks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they should.

Moved into BenchmarkGroupBy. Other sharing benchmarks will have items_per_second metrics now.

Comment on lines 39 to 40
using arrow::Concatenate;
using arrow::ConstantArrayGenerator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look unnecessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

ARROW_DEPRECATED("Deprecated in 18.0.0. Use GetSegments instead.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in the declaration, is it useful to duplicate the deprecation here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not so useful. Removed.

@pitrou
Copy link
Member

pitrou commented Sep 11, 2024

I'm curious, is the SimpleKeySegmenter useful for performance?

@zanmato1984
Copy link
Collaborator Author

I'm curious, is the SimpleKeySegmenter useful for performance?

If you mean if the SimpleKeySegmenter has better performance than using AnyKeysSegmenter for "simple key", then yes. The former only has to perform a linear search on the input data, as opposed to the later who uses a heavy hash table which introduces all sorts of overhead (memory alloc/insertion/lookup). In short, I think SimpleKeySegmenter is a worthwhile specialization for performance.

I hacked the code and got the benchmark number.

SimpleKeySegmenter (copied from my previous comment):

BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       197105 ns       196813 ns         3587 bytes_per_second=2.48093Gi/s items_per_second=166.493M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       207379 ns       207241 ns         3384 bytes_per_second=2.3561Gi/s items_per_second=158.115M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      254707 ns       254565 ns         2777 bytes_per_second=1.9181Gi/s items_per_second=128.721M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      439539 ns       439415 ns         1589 bytes_per_second=1.11121Gi/s items_per_second=74.5719M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1225127 ns      1224517 ns          576 bytes_per_second=408.324Mi/s items_per_second=26.7599M/s

AnyKeysSegmenter (by hacking the code to use it for even "simple key"):

BenchmarkRowSegmenter/Rows:32768/Segments:1/SegmentKeys:1       271931 ns       271563 ns         2571 bytes_per_second=1.79804Gi/s items_per_second=120.664M/s
BenchmarkRowSegmenter/Rows:32768/Segments:4/SegmentKeys:1       335488 ns       330342 ns         2196 bytes_per_second=1.47811Gi/s items_per_second=99.194M/s
BenchmarkRowSegmenter/Rows:32768/Segments:16/SegmentKeys:1      510618 ns       508622 ns         1387 bytes_per_second=983.048Mi/s items_per_second=64.425M/s
BenchmarkRowSegmenter/Rows:32768/Segments:64/SegmentKeys:1      847002 ns       846245 ns          827 bytes_per_second=590.845Mi/s items_per_second=38.7216M/s
BenchmarkRowSegmenter/Rows:32768/Segments:256/SegmentKeys:1    1745628 ns      1744707 ns          409 bytes_per_second=286.581Mi/s items_per_second=18.7814M/s

@pitrou
Copy link
Member

pitrou commented Sep 11, 2024

I see, thanks. Are segment keys used often? If so, it might worth improving SimpleKeySegmenter and also making it more generic (there's no reason why it should be limited to no-nulls fixed-width types, for example).

@zanmato1984
Copy link
Collaborator Author

I see, thanks. Are segment keys used often? If so, it might worth improving SimpleKeySegmenter and also making it more generic (there's no reason why it should be limited to no-nulls fixed-width types, for example).

Not in traditional RDBMS-es, but could be essential in time-series scenarios. (I believe I saw the original author mentioned that somewhere, but I can't find it at the moment.)

I can help with enhancing the simple key case anyway.

@pitrou
Copy link
Member

pitrou commented Sep 11, 2024

That said, it's quite low priority.

Comment on lines +902 to +906
auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
std::sort(values, values + num_rows);
// num_segment_keys copies of the segment key.
ArrayVector segment_keys(num_segment_keys, segment_key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not much better than before, is it? I would expect something like:

Suggested change
auto segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
std::sort(values, values + num_rows);
// num_segment_keys copies of the segment key.
ArrayVector segment_keys(num_segment_keys, segment_key);
ArrayVector segment_keys(num_segment_keys);
for (auto& segment_key : segment_keys) {
segment_key = rng.Int64(num_rows, /*min=*/0, /*max=*/num_segments - 1);
int64_t* values = segment_key->data()->GetMutableValues<int64_t>(1);
std::sort(values, values + num_rows);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see your point. It is only that I want to make the number of segments to be exactly as specified. Combining independently-random keys, even with the same distribution, will make the number of segments (potentially, much) bigger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's also much more realistic, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, as you prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's also much more realistic, isn't it?

I think having exact number of segments would help on diagnostics of performance issues than introducing yet another level of randomness, so I'd rather keep it as is :)

}
BENCHMARK(CountGroupByIntsSegmentedByInts)
->ArgNames({"Keys", "SegmentKeys", "Segments"})
->ArgsProduct({{1, 2}, {0, 1, 2}, benchmark::CreateRange(1, 256, 8)});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to use 1+ segment keys and 0 non-segment keys to get an idea of the purely-segmented performance, but I get:

Invalid: The provided function (hash_count) is a hash aggregate function.  Since there are no keys to group by, a scalar aggregate function was expected (normally these do not start with hash_)

It's a bit of a bummer, isn't it?

Copy link
Collaborator Author

@zanmato1984 zanmato1984 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my other comment explained, an aggregation w/o any group by keys (aka. non-segment keys here) is a scalar aggregation and requires "count" rather than "hash_count".

This is a "group by aggregation" benchmark (the counter-part of the previous "CountScalar" one) so there has to be at least one group by key.

}

template <typename... Args>
static void CountScalarSegmentedByInts(benchmark::State& state, Args&&...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "CountScalar" is actually the case where Keys=0 but there are segment keys?
This is quite misleading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me explain a bit.

Though both are named "aggregation", depending on whether there are "group by" keys or not (take SQL select count(*) from t group by c and select count(*) from t for instance), most compute engines have two variants of them. Acero calls them "scalar aggregation" and "group by aggregation", and they work with aggregation functions "count/sum/..." and "hash_count/hash_sum/..." respectively. This is understandable because w/o a group by key, the aggregation just needs to hold one "scalar" value (e.g., current count/sum/...) during the whole computation, whereas a group by key immediately requires some structures like a hash table.

And segment keys, on the other hand, working orthogonally with group by keys, apply to both scalar and group by aggregations.

Back to your question, yes, "CountScalar" implies exactly that this is a "scalar aggregation" (i.e., w/o any group by keys) on a "count" function. "SegmentedByInts" implies that there are potential segment keys.

Hope this can clear things a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But select count(*) from t is not what this is doing. It's still doing a "group by", it's just not using a hash table for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, what this benchmark does is not simply a select count(*) from t. This is merely to explain the difference between a scalar agg and a group by agg. But that doesn't mean this benchmark is a group by - at least not in Acero.

Though I'm not entirely sure how the segmented non-group-by agg map to a SQL basis, in Acero, specifying segment keys doesn't actually require group-by's existence. One just specifies segment keys and group by keys (independently) within

class ARROW_ACERO_EXPORT AggregateNodeOptions : public ExecNodeOptions {
public:
/// \brief create an instance from values
explicit AggregateNodeOptions(std::vector<Aggregate> aggregates,
std::vector<FieldRef> keys = {},
std::vector<FieldRef> segment_keys = {})
: aggregates(std::move(aggregates)),
keys(std::move(keys)),
segment_keys(std::move(segment_keys)) {}
// aggregations which will be applied to the targeted fields
std::vector<Aggregate> aggregates;
// keys by which aggregations will be grouped (optional)
std::vector<FieldRef> keys;
// keys by which aggregations will be segmented (optional)
std::vector<FieldRef> segment_keys;
};
and call an "aggregate" node. The plan parsing will generate scalar agg or group by agg depending on if there are group by keys and assign segment keys to whichever of them.

So I would let the naming reflect the underlying implementation (scalar agg vs. group by agg).

(I understand the argument could be: a segment key still implies group by semantic. I agree. But that's a more end-to-end perspective and doesn't reflect how we handle the segment key).

int64_t num_keys = state.range(0);
ArrayVector keys(num_keys);
for (auto& key : keys) {
key = rng.Int64(num_rows, /*min=*/0, /*max=*/64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that this doesn't use num_segments. For comparison purposes, I would expect a similar cardinality in segmented and non-segmented keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the cardinalities of a segment key and a group by key are relatively independent, so I chose not to use the one of segment key for group by as well. If we want to make the group by cardinality variable, we can use another independent parameter.

But my concerns is that the performance of segment keys and group by keys are also independent, that is why this benchmark uses a fixed group by key cardinality. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to compare performance of segmented vs. non-segmented keys, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these two particular benchmarks, they are to show how the number of segment keys and the number of segments perform with a predefined group by setup. In other words, the group by portion of the benchmark is merely to make the benchmark more realistic. They are not designed for the comparison you suggested.

IIUC, an apple to apple comparison though might be: a segmented scalar agg (N segment keys and 0 group by keys) VS. a non-segmented group by agg (0 segment keys and N group by keys), with the same key distribution. In this comparison, we can answer how a group by can be done more efficiently by taking advantage of the segmented nature of the key(s) (i.e., using segmented agg). If you are suggesting such a comparison, yes that makes sense. I think I can add a new benchmark for it.

What do you think? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the explanation. A new benchmark can be added in a later PR if we want, this one is fine now that I understand the motivation.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for this

@pitrou pitrou merged commit 3d6d581 into apache:main Sep 18, 2024
41 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Sep 18, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Sep 18, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3d6d581.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 255 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants