Fix empty cluster handling in tdigest merge #16675

jihoonson · 2024-08-28T00:00:57Z

Description

This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.

The bug exists in merge_tdigests() as it assumes that there is no empty cluster in the merge stage even when there are (has_nulls are fixed to false). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, has_nulls has been renamed to a more descriptive name, may_have_empty_clusters.

The tdigest reduce does not have the same issue as it does not call merge_tdigests().

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-08-28T00:01:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mhaseeb123

Minor nits. Looks good otherwise. Not too sure about renaming has_nulls to may_have_empty_clusters. Maybe @davidwendt can weigh in?

cpp/include/cudf/detail/tdigest/tdigest.hpp

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

mhaseeb123 · 2024-08-30T20:25:31Z

/ok to test

mhaseeb123 · 2024-08-30T20:26:55Z

/ok to test

cpp/tests/groupby/tdigest_tests.cu

…nson/cudf into fix-empty-tdigest-cluster-handling

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>

mhaseeb123 · 2024-09-03T18:47:15Z

/ok to test

mhaseeb123 · 2024-09-03T18:47:43Z

/ok to test

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

ttnghia · 2024-09-04T05:56:11Z

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

+  auto merged_weights = merged->get_column(1).view();
+  // If there are no values, we can simply return a column that has only empty tdigests.
+  if (merged_weights.size() == 0) {
+    return cudf::tdigest::detail::make_tdigest_column_of_empty_clusters(num_groups, stream, mr);
+  }


Should be just auto const merged_weights_size = merged->get_column(1).size();. No need to get a view.

Not even needed to create a temp variable and it can directly be used in if

The merged_weights variable is an existing variable and used in other places as well. I just have moved it from its original line. That being said, I can remove it if it's a problem. I thought creating a column view is cheap. Am I missing something?

A column_view is not a plain type. It has several internal structures such as vector, pointer, counter etc. Creating an instance will require copying/initializing all these fields. That can be "cheap" but still incur some overhead thus we should avoid doing so if possible.

Thanks @ttnghia. Your explanation aligns with my understanding. In that case, I think this is OK because the shortcut in this if clause will be used only when every key has the empty cluster. This is quite a rare case and won't happen often in production.

ttnghia · 2024-09-04T05:57:11Z

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

+  // We do not know whether there is any empty cluster in the input without actually reading the
+  // data, which could be expensive. So, we just assume that there could be empty clusters.
+  auto const may_have_empty_clusters = true;


So we would never change it?

Before this PR, this flag was fixed to false, which implies that empty clusters should never be found during the merge. This is not always true, which is the root cause of the bug. It is rather always true to assume that there might be some clusters until we inspect all clusters, which is expensive. Having this flag fixed to true might be always not optimal when there is indeed no empty cluster. We can make this better later for this case if that turns out to be a problem.

ttnghia · 2024-09-04T05:58:19Z

I'm also confused why has_nulls is not used to track nulls?

…y-tdigest-cluster-handling

jihoonson · 2024-09-09T18:13:37Z

I'm also confused why has_nulls is not used to track nulls?

has_nulls is misleading. It is a flag indicating whether the input column to the tdigest aggregation has nulls. In the tdigest compute, when a null value is found, we create an empty cluster as a place holder for that value since tdigest doesn't accept nulls. These empty clusters are read and processed in the tdigest merge. So, in the tdigest compute, the has_nulls flag can be set to true if your input data has nulls. However, it is invalid to set it to true for the merge since the input column to the merge cannot have nulls. I suppose this was the reason why it was fixed to false previously.

The has_nulls is used to be used mainly in generate_cluster_limits_kernel and build_output_column. The former creates the empty cluster when it has to (when the total weight of the cluster is 0), and the latter wipes them out if necessary as a post-processing. Note that we already know the total weight of each cluster in generate_cluster_limits_kernel. In other words, we already know which cluster is empty when that function is called. The logic to create and process the empty clusters remains the same in both the compute and the merge, and is shared in those. Since has_null is a valid name only for the compute, I think may_have_empty_clusters is a better name.

Someone might ask if we need the may_have_empty_clusters flag at all. I am not 100% sure, but think we may be able to get rid of it. However, I have decided to keep it for now to minimize the code change in this PR. We are planning to improve this whole logic to compute tdigest aggregation anyway.

nvdbaranec

This seems good to me. I'm a little bit lukewarm on the name change from 'make_empty_tdigest_column()' to 'make_tdigest_column_of_empty_clusters', Admittedly, 'make_empty_tdigest_column()' is a bit semantically different from how 'empty' is used in other column type factories. I'm ok with it.

nvdbaranec · 2024-09-10T15:08:11Z

cpp/include/cudf/detail/tdigest/tdigest.hpp

@@ -145,16 +145,17 @@ std::unique_ptr<column> make_tdigest_column(size_type num_rows,
 /**
 * @brief Create an empty tdigest column.
 *
- * An empty tdigest column contains a single row of length 0
+ * An empty tdigest column contains specified number of rows of length 0.


Suggested change

* An empty tdigest column contains specified number of rows of length 0.

* An empty tdigest column contains the specified number of rows of length 0.

Thanks. I fixed this doc to further clarify it.

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

…y-tdigest-cluster-handling

jihoonson · 2024-09-10T17:47:19Z

This seems good to me. I'm a little bit lukewarm on the name change from 'make_empty_tdigest_column()' to 'make_tdigest_column_of_empty_clusters', Admittedly, 'make_empty_tdigest_column()' is a bit semantically different from how 'empty' is used in other column type factories. I'm ok with it.

Thanks @nvdbaranec for the review. I think it is important to use the terms consistently over entire doc unless the context is very clear. Because it should be understandable to anyone who is even not familiar with the code. I was able to interpret relatively easily what "empty" means in the context of this function, but it was confusing until I look at the code implementation. It could be even more difficult for others to understand if they are not familiar enough with tdigest. I think people rather should be able to have a rough understanding of the function behavior by reading its doc without reading actual implementation. A downside of make_tdigest_column_of_empty_clusters I see is that it feels a bit verbose. I think it is better than a confusing name.

ttnghia · 2024-09-10T18:38:22Z

/ok to test

ttnghia · 2024-09-10T18:38:50Z

/merge

This reverts commit 5192b88.

This PR reverts #16675, which has introduced another bug. Our nightly CI pipeline is failing because of this bug (NVIDIA/spark-rapids#11463). I can reproduce the bug within a libcudf unit test. I will make another PR to fix both the original issue and the new bug. Authors: - Jihoon Son (https://github.com/jihoonson) Approvers: - Alessandro Bellina (https://github.com/abellina) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #16800

This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: rapidsai#16675

…est merge (#16897) Fixes #16881. This is a new attempt to fix it. Previously in #16675, I flipped the `has_nulls` flag to true as I thought that empty clusters should be explicitly stored in the offsets and handled properly. It turns out that it was not a good idea. After a long debugging process, I am convinced now that the existing logic is valid and should work well except for one case, where all input tdigests to the tdigest merge are empty. So, I have decided to add a [shortcut to handle that particular edge case](https://github.com/rapidsai/cudf/pull/16897/files#diff-c03df2b421f7a51b28007d575fd32ba2530970351ba7e7e0f7fad8057350870cR1349-R1354) in `group_merge_tdigest()` in this PR. This shortcut is executed only when all clusters are empty in all groups. This PR does not change any other logic. Other changes in this PR are: - New unit tests to cover the edge case. - `make_empty_tdigest_column` has been renamed to `make_tdigest_column_of_empty_clusters` and expanded to take `num_rows`. - Some new documentation based on my understanding for the `merge_tdigests()` function. Before making this PR, I have run the integration tests of the spark-rapids that were previously reported in NVIDIA/spark-rapids#11463 that my first attempt had caused them failing. They have all passed with this PR change. Authors: - Jihoon Son (https://github.com/jihoonson) - Yunsong Wang (https://github.com/PointKernel) Approvers: - https://github.com/nvdbaranec URL: #16897

jihoonson added 3 commits August 27, 2024 16:48

Fix tdigest merge when all clusters are empty

2cb240d

fix nullability of empty cluster tdigest column

99521e1

rename util function that creates a column of empty clusters

c4d968b

jihoonson requested a review from a team as a code owner August 28, 2024 00:00

jihoonson requested review from kingcrimsontianyu and mhaseeb123 August 28, 2024 00:00

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 28, 2024

mhaseeb123 reviewed Aug 30, 2024

View reviewed changes

cpp/include/cudf/detail/tdigest/tdigest.hpp Outdated Show resolved Hide resolved

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

mhaseeb123 added bug Something isn't working 3 - Ready for Review Ready for review by team breaking Breaking change labels Aug 30, 2024

Merge branch 'branch-24.10' into fix-empty-tdigest-cluster-handling

6f53dd2

mhaseeb123 reviewed Aug 30, 2024

View reviewed changes

cpp/tests/groupby/tdigest_tests.cu Outdated Show resolved Hide resolved

cpp/tests/groupby/tdigest_tests.cu Outdated Show resolved Hide resolved

cpp/tests/groupby/tdigest_tests.cu Show resolved Hide resolved

jihoonson added 2 commits September 3, 2024 10:59

Improve docs; add consts

47fcf0d

Merge branch 'fix-empty-tdigest-cluster-handling' of github.com:jihoo…

18b6fd5

…nson/cudf into fix-empty-tdigest-cluster-handling

mhaseeb123 reviewed Sep 3, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

Update cpp/src/quantiles/tdigest/tdigest_aggregation.cu

ceaf17b

Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>

mhaseeb123 approved these changes Sep 3, 2024

View reviewed changes

Merge branch 'branch-24.10' into fix-empty-tdigest-cluster-handling

79cb1c3

ttnghia reviewed Sep 4, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Show resolved Hide resolved

ttnghia reviewed Sep 4, 2024

View reviewed changes

jihoonson added 2 commits September 9, 2024 10:29

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into fix-empt…

0ea092e

…y-tdigest-cluster-handling

fix stale doc

aa063cd

revans2 mentioned this pull request Sep 10, 2024

Optimization of tdigest merge aggregation. #16780

Merged

3 tasks

nvdbaranec approved these changes Sep 10, 2024

View reviewed changes

jihoonson added 2 commits September 10, 2024 10:29

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into fix-empt…

ffa7997

…y-tdigest-cluster-handling

fix docs

3492b9f

rapids-bot bot merged commit 5192b88 into rapidsai:branch-24.10 Sep 10, 2024
97 checks passed

pxLi mentioned this pull request Sep 11, 2024

[BUG] hash_groupby_approx_percentile failed assert is None NVIDIA/spark-rapids#11463

Closed

jihoonson mentioned this pull request Sep 11, 2024

[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column NVIDIA/spark-rapids#11367

Closed

jihoonson added a commit to jihoonson/cudf that referenced this pull request Sep 11, 2024

Revert "Fix empty cluster handling in tdigest merge (rapidsai#16675)"

4fb48f9

This reverts commit 5192b88.

jihoonson mentioned this pull request Sep 11, 2024

Revert "Fix empty cluster handling in tdigest merge (#16675)" #16800

Merged

3 tasks

jihoonson added a commit to jihoonson/cudf that referenced this pull request Sep 11, 2024

Revert "Fix empty cluster handling in tdigest merge (rapidsai#16675)"

fda9518

This reverts commit 5192b88.

This was referenced Sep 16, 2024

Revert "Skip test_hash_groupby_approx_percentile byte and double tests tempor…" NVIDIA/spark-rapids#11472

Closed

Revert "Skip test_hash_groupby_approx_percentile byte and double test… NVIDIA/spark-rapids#11473

Merged

jihoonson mentioned this pull request Sep 24, 2024

Add a shortcut for when the input clusters are all empty for the tdigest merge #16897

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty cluster handling in tdigest merge #16675

Fix empty cluster handling in tdigest merge #16675

jihoonson commented Aug 28, 2024

copy-pr-bot bot commented Aug 28, 2024

mhaseeb123 left a comment

mhaseeb123 commented Aug 30, 2024

mhaseeb123 commented Aug 30, 2024

mhaseeb123 commented Sep 3, 2024

mhaseeb123 commented Sep 3, 2024

ttnghia Sep 4, 2024

mhaseeb123 Sep 4, 2024

jihoonson Sep 9, 2024

ttnghia Sep 9, 2024

jihoonson Sep 9, 2024

ttnghia Sep 4, 2024

jihoonson Sep 9, 2024

ttnghia commented Sep 4, 2024

jihoonson commented Sep 9, 2024

nvdbaranec left a comment

nvdbaranec Sep 10, 2024

jihoonson Sep 10, 2024

jihoonson commented Sep 10, 2024

ttnghia commented Sep 10, 2024

ttnghia commented Sep 10, 2024

	* An empty tdigest column contains specified number of rows of length 0.
	* An empty tdigest column contains the specified number of rows of length 0.

Fix empty cluster handling in tdigest merge #16675

Fix empty cluster handling in tdigest merge #16675

Conversation

jihoonson commented Aug 28, 2024

Description

Checklist

copy-pr-bot bot commented Aug 28, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Aug 30, 2024

mhaseeb123 commented Aug 30, 2024

mhaseeb123 commented Sep 3, 2024

mhaseeb123 commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented Sep 4, 2024

jihoonson commented Sep 9, 2024

nvdbaranec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson commented Sep 10, 2024

ttnghia commented Sep 10, 2024

ttnghia commented Sep 10, 2024