Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty cluster handling in tdigest merge #16675

Merged

Conversation

jihoonson
Copy link
Contributor

Description

This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.

The bug exists in merge_tdigests() as it assumes that there is no empty cluster in the merge stage even when there are (has_nulls are fixed to false). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, has_nulls has been renamed to a more descriptive name, may_have_empty_clusters.

The tdigest reduce does not have the same issue as it does not call merge_tdigests().

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Aug 28, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 28, 2024
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits. Looks good otherwise. Not too sure about renaming has_nulls to may_have_empty_clusters. Maybe @davidwendt can weigh in?

cpp/include/cudf/detail/tdigest/tdigest.hpp Outdated Show resolved Hide resolved
cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved
cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 added bug Something isn't working 3 - Ready for Review Ready for review by team breaking Breaking change labels Aug 30, 2024
@mhaseeb123
Copy link
Member

/ok to test

@mhaseeb123
Copy link
Member

/ok to test

cpp/tests/groupby/tdigest_tests.cu Outdated Show resolved Hide resolved
cpp/tests/groupby/tdigest_tests.cu Outdated Show resolved Hide resolved
cpp/tests/groupby/tdigest_tests.cu Show resolved Hide resolved
Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
@mhaseeb123
Copy link
Member

/ok to test

@mhaseeb123
Copy link
Member

/ok to test

Comment on lines +1156 to +1160
auto merged_weights = merged->get_column(1).view();
// If there are no values, we can simply return a column that has only empty tdigests.
if (merged_weights.size() == 0) {
return cudf::tdigest::detail::make_tdigest_column_of_empty_clusters(num_groups, stream, mr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be just auto const merged_weights_size = merged->get_column(1).size();. No need to get a view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even needed to create a temp variable and it can directly be used in if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged_weights variable is an existing variable and used in other places as well. I just have moved it from its original line. That being said, I can remove it if it's a problem. I thought creating a column view is cheap. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column_view is not a plain type. It has several internal structures such as vector, pointer, counter etc. Creating an instance will require copying/initializing all these fields. That can be "cheap" but still incur some overhead thus we should avoid doing so if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ttnghia. Your explanation aligns with my understanding. In that case, I think this is OK because the shortcut in this if clause will be used only when every key has the empty cluster. This is quite a rare case and won't happen often in production.

Comment on lines +1177 to +1179
// We do not know whether there is any empty cluster in the input without actually reading the
// data, which could be expensive. So, we just assume that there could be empty clusters.
auto const may_have_empty_clusters = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would never change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, this flag was fixed to false, which implies that empty clusters should never be found during the merge. This is not always true, which is the root cause of the bug. It is rather always true to assume that there might be some clusters until we inspect all clusters, which is expensive. Having this flag fixed to true might be always not optimal when there is indeed no empty cluster. We can make this better later for this case if that turns out to be a problem.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 4, 2024

I'm also confused why has_nulls is not used to track nulls?

@jihoonson
Copy link
Contributor Author

I'm also confused why has_nulls is not used to track nulls?

has_nulls is misleading. It is a flag indicating whether the input column to the tdigest aggregation has nulls. In the tdigest compute, when a null value is found, we create an empty cluster as a place holder for that value since tdigest doesn't accept nulls. These empty clusters are read and processed in the tdigest merge. So, in the tdigest compute, the has_nulls flag can be set to true if your input data has nulls. However, it is invalid to set it to true for the merge since the input column to the merge cannot have nulls. I suppose this was the reason why it was fixed to false previously.

The has_nulls is used to be used mainly in generate_cluster_limits_kernel and build_output_column. The former creates the empty cluster when it has to (when the total weight of the cluster is 0), and the latter wipes them out if necessary as a post-processing. Note that we already know the total weight of each cluster in generate_cluster_limits_kernel. In other words, we already know which cluster is empty when that function is called. The logic to create and process the empty clusters remains the same in both the compute and the merge, and is shared in those. Since has_null is a valid name only for the compute, I think may_have_empty_clusters is a better name.

Someone might ask if we need the may_have_empty_clusters flag at all. I am not 100% sure, but think we may be able to get rid of it. However, I have decided to keep it for now to minimize the code change in this PR. We are planning to improve this whole logic to compute tdigest aggregation anyway.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. I'm a little bit lukewarm on the name change from 'make_empty_tdigest_column()' to 'make_tdigest_column_of_empty_clusters', Admittedly, 'make_empty_tdigest_column()' is a bit semantically different from how 'empty' is used in other column type factories. I'm ok with it.

@@ -145,16 +145,17 @@ std::unique_ptr<column> make_tdigest_column(size_type num_rows,
/**
* @brief Create an empty tdigest column.
*
* An empty tdigest column contains a single row of length 0
* An empty tdigest column contains specified number of rows of length 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* An empty tdigest column contains specified number of rows of length 0.
* An empty tdigest column contains the specified number of rows of length 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I fixed this doc to further clarify it.

@jihoonson
Copy link
Contributor Author

This seems good to me. I'm a little bit lukewarm on the name change from 'make_empty_tdigest_column()' to 'make_tdigest_column_of_empty_clusters', Admittedly, 'make_empty_tdigest_column()' is a bit semantically different from how 'empty' is used in other column type factories. I'm ok with it.

Thanks @nvdbaranec for the review. I think it is important to use the terms consistently over entire doc unless the context is very clear. Because it should be understandable to anyone who is even not familiar with the code. I was able to interpret relatively easily what "empty" means in the context of this function, but it was confusing until I look at the code implementation. It could be even more difficult for others to understand if they are not familiar enough with tdigest. I think people rather should be able to have a rough understanding of the function behavior by reading its doc without reading actual implementation. A downside of make_tdigest_column_of_empty_clusters I see is that it feels a bit verbose. I think it is better than a confusing name.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 10, 2024

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Sep 10, 2024

/merge

@rapids-bot rapids-bot bot merged commit 5192b88 into rapidsai:branch-24.10 Sep 10, 2024
97 checks passed
jihoonson added a commit to jihoonson/cudf that referenced this pull request Sep 11, 2024
jihoonson added a commit to jihoonson/cudf that referenced this pull request Sep 11, 2024
rapids-bot bot pushed a commit that referenced this pull request Sep 12, 2024
This PR reverts #16675, which has introduced another bug. Our nightly CI pipeline is failing because of this bug (NVIDIA/spark-rapids#11463). I can reproduce the bug within a libcudf unit test. I will make another PR to fix both the original issue and the new bug.

Authors:
  - Jihoon Son (https://github.com/jihoonson)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #16800
jihoonson added a commit to jihoonson/cudf that referenced this pull request Sep 24, 2024
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.

The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. 

The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`.

Authors:
  - Jihoon Son (https://github.com/jihoonson)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - https://github.com/nvdbaranec

URL: rapidsai#16675
rapids-bot bot pushed a commit that referenced this pull request Oct 1, 2024
…est merge (#16897)

Fixes #16881. This is a new attempt to fix it.

Previously in #16675, I flipped the `has_nulls` flag to true as I thought that empty clusters should be explicitly stored in the offsets and handled properly. It turns out that it was not a good idea. After a long debugging process, I am convinced now that the existing logic is valid and should work well except for one case, where all input tdigests to the tdigest merge are empty. So, I have decided to add a [shortcut to handle that particular edge case](https://github.com/rapidsai/cudf/pull/16897/files#diff-c03df2b421f7a51b28007d575fd32ba2530970351ba7e7e0f7fad8057350870cR1349-R1354) in `group_merge_tdigest()` in this PR. This shortcut is executed only when all clusters are empty in all groups. This PR does not change any other logic. Other changes in this PR are:

- New unit tests to cover the edge case.
- `make_empty_tdigest_column` has been renamed to `make_tdigest_column_of_empty_clusters` and expanded to take `num_rows`.
- Some new documentation based on my understanding for the `merge_tdigests()` function.

Before making this PR, I have run the integration tests of the spark-rapids that were previously reported in NVIDIA/spark-rapids#11463 that my first attempt had caused them failing. They have all passed with this PR change.

Authors:
  - Jihoon Son (https://github.com/jihoonson)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - https://github.com/nvdbaranec

URL: #16897
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants