-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully support lexicographic comparison for nested types in table self-comparator #12879
Conversation
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
This reverts commit 3609edf.
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
I ran a benchmark comparing sorting lists columns (nested lists-of-lists, or LoL) and sorting lists-of-structs columns (LoS). These LoS columns are generated by just putting the children of the LoL columns under a structs column. So under the hood, these LoL and LoS columns contain the same information to compare. The benchmarks show very interesting results:
In summary:
I've filed the issue to leverage this in the future: #12932. |
This reverts commit 1cf20a2.
@vyasr I think I found what's wrong with structs-of-lists: For comparing lists, we need to generate dremel data. However:
So in order to fix this completely, we need both to generate+consider dremel data for nested lists columns as well as fixing the issue with type dispatcher above. I'm not sure about the complexity of such work, so it should be done in a separate follow up PR. Update: I've successfully fixed the dremel issue above. However, that may require changing to the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting this partial review. I will review the rest at a later time.
* @param flattened_input_aux_data The data structure generated from | ||
* `cudf::structs::detail::flatten_nested_columns` that contains additional information used for | ||
* row comparisons. | ||
* @param transformed_structs_columns The intermediate columns resulted from transforming child |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can name this in a way that indicates what the transformation does. For example, list_of_structs_ranked_columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rename it to structs_ranked_columns
.
|
||
// Auxiliary data generated from `cudf::structs::detail::flatten_nested_columns` | ||
// that needs to be kept alive. | ||
std::unique_ptr<cudf::structs::detail::flattened_table> _flattened_input_aux_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a unique_ptr
when all the other owned members are not? Should this be an optional
? Or just a plain member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as discussed in #12878 (comment). In particular, using smart pointer allows to forward-declare the class flattened_table
which in turn avoids including the relevant header into this already-huge header.
|
||
// Auxiliary data generated from transforming lists-of-structs into lists-of-integer | ||
// that needs to be kept alive. | ||
std::vector<std::unique_ptr<column>> _transformed_structs_aux_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here -- why does this need to be a vector of unique_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to store the output of cudf::rank
thus must be unique_ptr
.
* | ||
* @param input The input column to transform | ||
* @param stream CUDA stream used for device memory operations and kernel launches | ||
* @return A pair of new column_view representing the transformed input and the generated rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use make_lists_column
and return the lists column directly, I think. I am guessing that the motivation to return this pair and not call make_lists_column
is to avoid copying the input null mask and offsets, but that's not a great long term solution. For the "concat and rank" approach with two-table comparator, the mask/offsets cannot be reused and we will require a new list column to be constructed (the concatenation of the struct columns, which is then ranked and returned). I'd like have the self-comparator and two-table comparator act the same way, if possible, even though it may introduce an additional copy. The ranked lists of size_type
are going to consume less memory than the input lists-of-structs, in most cases, so it's probably a memory cost we can afford to pay for a simplified implementation and easier lifetime management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't need to call make_lists_column
as that will copy the offsets column of the input lists. For the case of two-table comparator when we concate the child columns, we will split the output ranks into two ranks column and still apply the input null masks and offsets of the two input lists columns.
FYI, I've implemented the code for handling two-table comparator here: https://github.com/rapidsai/cudf/pull/12953/files#diff-c143ae2ecfe63796e604e5854f4a47cf6b831ebe5d68ce7c19a0e1533fceddc7R394-R413
// However, first ranks are computed faster and are good enough for ordering them. Structs | ||
// compared equal always have consecutive rank values (in stable order) thus they are still | ||
// sorted correctly by their ranks. | ||
auto ranks = cudf::detail::rank(child, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some details we need to pay attention to when performing the ranking. In particular, I want to be sure that NaN handling (via the physical element comparator) and nullability are handled correctly with respect to the target comparator. I'm not sure if this has complete/correct support for those yet?
This is what I was talking about in #11222:
Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of
list<struct<...>>
data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).
See also:
cudf/cpp/include/cudf/table/experimental/row_operators.cuh
Lines 287 to 290 in 9313b4c
* @param null_precedence Optional, device array the same length as a row and indicates how null | |
* values compare to all other for every column. If `nullopt`, then null precedence would be | |
* `null_order::BEFORE` for all columns. | |
* @param comparator Physical element relational comparison functor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really tricky. The rank
API calls sort_order
which in turn uses the less<>
comparator with default template argument PhysicalElementComparator = sorting_physical_element_comparator
. We cannot control that element comparator in this preprocessing step, as we have no way to pass in such element comparator into cudf::rank
.
So the only way I can think of now is to enforce using that default element comparator later and nothing else (of course, only if cudf::rank
has been called during preprocessing).
Can you create an issue documenting how to do this? Just to ensure we are tracking that there's possibly a better solution than flattening that we need to explore. |
I'll update the original issue (#11672) with my finding. |
Close this. I will separate it into two separate PRs for structs-of-lists and lists-of-structs. |
This PR implements full support for lexicographic comparison for nested types, for table self-comparator (support for two-table comparator will come in a follow-up PR).
Warning: Supporting structs-of-lists is a temporary workaround, not an optimal long-term solution.
Algorithm:
cudf::structs::detail::flatten_nested_columns
. While waiting for a long-term optimal solution, this workaround is fairly simple and efficient.Depends on:
cudf::structs::detail::flatten_nested_columns
to smart pointer #12878stream
andmr
parameters forstructs::detail::flatten_nested_columns
#12892Closes:
Tested by spark-rapids integration tests. All tests pass. Data generator used to test: