-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix empty cluster handling in tdigest merge #16675
Fix empty cluster handling in tdigest merge #16675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits. Looks good otherwise. Not too sure about renaming has_nulls
to may_have_empty_clusters
. Maybe @davidwendt can weigh in?
/ok to test |
/ok to test |
…nson/cudf into fix-empty-tdigest-cluster-handling
Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
/ok to test |
/ok to test |
auto merged_weights = merged->get_column(1).view(); | ||
// If there are no values, we can simply return a column that has only empty tdigests. | ||
if (merged_weights.size() == 0) { | ||
return cudf::tdigest::detail::make_tdigest_column_of_empty_clusters(num_groups, stream, mr); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be just auto const merged_weights_size = merged->get_column(1).size();
. No need to get a view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even needed to create a temp variable and it can directly be used in if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merged_weights
variable is an existing variable and used in other places as well. I just have moved it from its original line. That being said, I can remove it if it's a problem. I thought creating a column view is cheap. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A column_view
is not a plain type. It has several internal structures such as vector, pointer, counter etc. Creating an instance will require copying/initializing all these fields. That can be "cheap" but still incur some overhead thus we should avoid doing so if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ttnghia. Your explanation aligns with my understanding. In that case, I think this is OK because the shortcut in this if
clause will be used only when every key has the empty cluster. This is quite a rare case and won't happen often in production.
// We do not know whether there is any empty cluster in the input without actually reading the | ||
// data, which could be expensive. So, we just assume that there could be empty clusters. | ||
auto const may_have_empty_clusters = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we would never change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, this flag was fixed to false, which implies that empty clusters should never be found during the merge. This is not always true, which is the root cause of the bug. It is rather always true to assume that there might be some clusters until we inspect all clusters, which is expensive. Having this flag fixed to true might be always not optimal when there is indeed no empty cluster. We can make this better later for this case if that turns out to be a problem.
I'm also confused why |
…y-tdigest-cluster-handling
The Someone might ask if we need the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me. I'm a little bit lukewarm on the name change from 'make_empty_tdigest_column()' to 'make_tdigest_column_of_empty_clusters', Admittedly, 'make_empty_tdigest_column()' is a bit semantically different from how 'empty' is used in other column type factories. I'm ok with it.
@@ -145,16 +145,17 @@ std::unique_ptr<column> make_tdigest_column(size_type num_rows, | |||
/** | |||
* @brief Create an empty tdigest column. | |||
* | |||
* An empty tdigest column contains a single row of length 0 | |||
* An empty tdigest column contains specified number of rows of length 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* An empty tdigest column contains specified number of rows of length 0. | |
* An empty tdigest column contains the specified number of rows of length 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I fixed this doc to further clarify it.
…y-tdigest-cluster-handling
Thanks @nvdbaranec for the review. I think it is important to use the terms consistently over entire doc unless the context is very clear. Because it should be understandable to anyone who is even not familiar with the code. I was able to interpret relatively easily what "empty" means in the context of this function, but it was confusing until I look at the code implementation. It could be even more difficult for others to understand if they are not familiar enough with tdigest. I think people rather should be able to have a rough understanding of the function behavior by reading its doc without reading actual implementation. A downside of |
/ok to test |
/merge |
This reverts commit 5192b88.
This reverts commit 5192b88.
This PR reverts #16675, which has introduced another bug. Our nightly CI pipeline is failing because of this bug (NVIDIA/spark-rapids#11463). I can reproduce the bug within a libcudf unit test. I will make another PR to fix both the original issue and the new bug. Authors: - Jihoon Son (https://github.com/jihoonson) Approvers: - Alessandro Bellina (https://github.com/abellina) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #16800
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367. The bug exists in `merge_tdigests()` as it assumes that there is no empty cluster in the merge stage even when there are (`has_nulls` are fixed to `false`). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also, `has_nulls` has been renamed to a more descriptive name, `may_have_empty_clusters`. The tdigest reduce does not have the same issue as it does not call `merge_tdigests()`. Authors: - Jihoon Son (https://github.com/jihoonson) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - https://github.com/nvdbaranec URL: rapidsai#16675
…est merge (#16897) Fixes #16881. This is a new attempt to fix it. Previously in #16675, I flipped the `has_nulls` flag to true as I thought that empty clusters should be explicitly stored in the offsets and handled properly. It turns out that it was not a good idea. After a long debugging process, I am convinced now that the existing logic is valid and should work well except for one case, where all input tdigests to the tdigest merge are empty. So, I have decided to add a [shortcut to handle that particular edge case](https://github.com/rapidsai/cudf/pull/16897/files#diff-c03df2b421f7a51b28007d575fd32ba2530970351ba7e7e0f7fad8057350870cR1349-R1354) in `group_merge_tdigest()` in this PR. This shortcut is executed only when all clusters are empty in all groups. This PR does not change any other logic. Other changes in this PR are: - New unit tests to cover the edge case. - `make_empty_tdigest_column` has been renamed to `make_tdigest_column_of_empty_clusters` and expanded to take `num_rows`. - Some new documentation based on my understanding for the `merge_tdigests()` function. Before making this PR, I have run the integration tests of the spark-rapids that were previously reported in NVIDIA/spark-rapids#11463 that my first attempt had caused them failing. They have all passed with this PR change. Authors: - Jihoon Son (https://github.com/jihoonson) - Yunsong Wang (https://github.com/PointKernel) Approvers: - https://github.com/nvdbaranec URL: #16897
Description
This PR fixes an edge case bug in the tdigest merge. When there are multiple distinct keys but all values are empty clusters, the value column is currently merged into a single empty cluster after merge, which leads to an error while creating a result table because of the mismatching number of rows in the key and value columns. This bug can be reproduced only when all values are empty clusters. If some values are empty but some are not, the current implementation returns a valid result. This bug was originally reported in NVIDIA/spark-rapids#11367.
The bug exists in
merge_tdigests()
as it assumes that there is no empty cluster in the merge stage even when there are (has_nulls
are fixed tofalse
). It is rather safe to assume that always there could be empty clusters. This PR fixes the flag by fixing it to true. Also,has_nulls
has been renamed to a more descriptive name,may_have_empty_clusters
.The tdigest reduce does not have the same issue as it does not call
merge_tdigests()
.Checklist