Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully support lexicographic comparison for nested types in table self-comparator #12879

Closed
wants to merge 66 commits into from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 3, 2023

This PR implements full support for lexicographic comparison for nested types, for table self-comparator (support for two-table comparator will come in a follow-up PR).

Warning: Supporting structs-of-lists is a temporary workaround, not an optimal long-term solution.

Algorithm:

  • If a column contains structs-of-lists: flatten it using cudf::structs::detail::flatten_nested_columns. While waiting for a long-term optimal solution, this workaround is fairly simple and efficient.
  • If a column contains lists-of-structs: follow the algorithm described in Proposal for lexicographic comparators for lists of structs #11222.
  • If the input doesn't contain any lists-of-structs or structs-of-lists, there will not be any change/transformation applied to it, and the performance will not be impacted at all.

Depends on:

Closes:


Tested by spark-rapids integration tests. All tests pass. Data generator used to test:

struct_of_array_gen = [StructGen([['child0', ArrayGen(int_gen, max_length=10)]]),
                       StructGen([['child0', byte_gen],
                                  ['child1', ArrayGen(int_gen, max_length=10)],
                                  ['child2', int_gen]]),
                       StructGen([['child0', ArrayGen(int_gen, max_length=10)],
                                  ['child1', int_gen]]),
                       StructGen([['child0', ArrayGen(int_gen, max_length=10)],
                                  ['child1', ArrayGen(int_gen, max_length=10)],
                                  ['child2', ArrayGen(int_gen, max_length=10)]])]
nested_array_of_struct_gen = [ArrayGen(sub_gen, max_length=10) for sub_gen in struct_of_array_gen]
deeply_nested_array_gen = [ArrayGen(sub_gen, max_length=10) for sub_gen in nested_array_of_struct_gen] + \
                          [ArrayGen(sub_gen, max_length=10) for sub_gen in struct_of_array_gen]
deeply_nested_struct_gen = [StructGen([['child0', sub_gen] for ind, sub_gen in enumerate(deeply_nested_array_gen)])]

# This is used to test:
deeply_nested_gen = struct_of_array_gen + nested_array_of_struct_gen + deeply_nested_array_gen + \
                    deeply_nested_struct_gen

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 3, 2023
@ttnghia ttnghia self-assigned this Mar 3, 2023
@github-actions github-actions bot added the CMake CMake build issue label Mar 3, 2023
@ttnghia ttnghia changed the title Fully support lexicographic comparison for nested types Support lexicographic comparison for structs of lists Mar 4, 2023
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 13, 2023

I ran a benchmark comparing sorting lists columns (nested lists-of-lists, or LoL) and sorting lists-of-structs columns (LoS). These LoS columns are generated by just putting the children of the LoL columns under a structs column. So under the hood, these LoL and LoS columns contain the same information to compare.

The benchmarks show very interesting results:

# sort_list

## [0] Quadro RTX 6000

|  size_bytes  |  depth  |  num_columns  |  null_frequency  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |             Diff |   %Diff |  Status  |
|--------------|---------|---------------|------------------|------------|-------------|------------|-------------|------------------|---------|----------|
|     2^10     |    1    |      10       |        0         |   4.075 us |     204.71% |   2.184 us |     291.37% |        -1.891 us | -46.40% |   PASS   |
|     2^18     |    1    |      10       |        0         |  32.663 ms |       3.13% |  38.994 ms |       2.32% |         6.331 ms |  19.38% |   FAIL   |
|     2^24     |    1    |      10       |        0         |   9.545 ms |       9.76% |  29.418 ms |       5.41% |        19.873 ms | 208.21% |   FAIL   |
|     2^10     |    2    |      10       |        0         |   6.744 us |      38.68% |   2.906 us |     207.01% |        -3.838 us | -56.91% |   FAIL   |
|     2^18     |    2    |      10       |        0         | 901.082 ms |       4.00% |  28.292 ms |       3.16% |   -872790.516 us | -96.86% |   FAIL   |
|     2^24     |    2    |      10       |        0         | 113.982 ms |       0.86% | 115.617 ms |       0.99% |         1.635 ms |   1.43% |   FAIL   |
|     2^10     |    3    |      10       |        0         |   9.589 us |      59.38% |   3.461 us |      85.14% |        -6.128 us | -63.91% |   FAIL   |
|     2^18     |    3    |      10       |        0         |   9.805 us |      42.68% |   3.259 us |      96.03% |        -6.546 us | -66.76% |   FAIL   |
|     2^24     |    3    |      10       |        0         |   55.376 s |        inf% | 154.392 ms |       2.17% | -55221986.520 us | -99.72% |   FAIL   |
|     2^10     |    4    |      10       |        0         |  12.853 us |      42.00% |   4.005 us |     154.35% |        -8.848 us | -68.84% |   FAIL   |
|     2^18     |    4    |      10       |        0         |  12.584 us |      39.01% |   3.753 us |      78.64% |        -8.831 us | -70.18% |   FAIL   |
|     2^24     |    4    |      10       |        0         |  13.187 us |      91.99% |   3.787 us |      85.74% |        -9.400 us | -71.28% |   PASS   |

In summary:

  • For shallow LoS, sorting them is much slower than sorting LoL. This is due to the overhead of computing ranks.
  • However, when the LoL columns have higher depth (just more than 1 depth level) then sorting them becomes slower than sorting LoS due to the iterative mechanism of the comparator when working with multiple levels of lists.

I've filed the issue to leverage this in the future: #12932.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 13, 2023

Before we resort to flattening to support structs of lists, could we try to resolve #11672 properly? We originally expected structs of lists to "just work" tm, and the only reason I didn't investigate further at the time was the urgency imposed on getting the comparator merged. I would like to have a sense for the feasibility of the long-term solution before we resort to a workaround.

@vyasr I think I found what's wrong with structs-of-lists: For comparing lists, we need to generate dremel data. However:

  • Dremel data is generated only for lists columns that are not children of any other column:
    for (auto const& col : table) {
    if (col.type().id() == type_id::LIST) {
    dremel_data.push_back(detail::get_comparator_data(col, {}, false, stream));
    dremel_device_views.push_back(dremel_data.back());
    }
    }
  • On device, dremel data is considered for comparison only in the same situation:
    auto [l_dremel_i, r_dremel_i] = [&]() {
    if (_lhs.column(i).type().id() == type_id::LIST) {
    auto idx = list_column_index++;
    return std::make_tuple(optional_dremel_view(_l_dremel[idx]),
    optional_dremel_view(_r_dremel[idx]));
    } else {
    return std::make_tuple(optional_dremel_view{}, optional_dremel_view{});
    }
    }();
  • Even that, structs column cannot have lists children. Otherwise, we will encounter compiler stack overflow with device code due to type dispatcher. The current code only processes child structs iteratively while ignoring lists (dispatch_void_if_nested):
    return cudf::type_dispatcher<dispatch_void_if_nested>(
    lcol.type(),
    element_comparator{_check_nulls, lcol, rcol, _null_precedence, depth, _comparator},
    lhs_element_index,
    rhs_element_index);

So in order to fix this completely, we need both to generate+consider dremel data for nested lists columns as well as fixing the issue with type dispatcher above. I'm not sure about the complexity of such work, so it should be done in a separate follow up PR.

Update: I've successfully fixed the dremel issue above. However, that may require changing to the function decompose_structs which is also used in equality comparators. Such changes may affect comparison performance. Overall, too much risky for a limited time budget. So the issue needs to be addressed separately and I'll post a PR for it after this one merged whenever I can.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting this partial review. I will review the rest at a later time.

* @param flattened_input_aux_data The data structure generated from
* `cudf::structs::detail::flatten_nested_columns` that contains additional information used for
* row comparisons.
* @param transformed_structs_columns The intermediate columns resulted from transforming child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can name this in a way that indicates what the transformation does. For example, list_of_structs_ranked_columns.

Copy link
Contributor Author

@ttnghia ttnghia Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rename it to structs_ranked_columns.

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

// Auxiliary data generated from `cudf::structs::detail::flatten_nested_columns`
// that needs to be kept alive.
std::unique_ptr<cudf::structs::detail::flattened_table> _flattened_input_aux_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a unique_ptr when all the other owned members are not? Should this be an optional? Or just a plain member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as discussed in #12878 (comment). In particular, using smart pointer allows to forward-declare the class flattened_table which in turn avoids including the relevant header into this already-huge header.


// Auxiliary data generated from transforming lists-of-structs into lists-of-integer
// that needs to be kept alive.
std::vector<std::unique_ptr<column>> _transformed_structs_aux_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here -- why does this need to be a vector of unique_ptr?

Copy link
Contributor Author

@ttnghia ttnghia Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to store the output of cudf::rank thus must be unique_ptr.

*
* @param input The input column to transform
* @param stream CUDA stream used for device memory operations and kernel launches
* @return A pair of new column_view representing the transformed input and the generated rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use make_lists_column and return the lists column directly, I think. I am guessing that the motivation to return this pair and not call make_lists_column is to avoid copying the input null mask and offsets, but that's not a great long term solution. For the "concat and rank" approach with two-table comparator, the mask/offsets cannot be reused and we will require a new list column to be constructed (the concatenation of the struct columns, which is then ranked and returned). I'd like have the self-comparator and two-table comparator act the same way, if possible, even though it may introduce an additional copy. The ranked lists of size_type are going to consume less memory than the input lists-of-structs, in most cases, so it's probably a memory cost we can afford to pay for a simplified implementation and easier lifetime management.

Copy link
Contributor Author

@ttnghia ttnghia Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't need to call make_lists_column as that will copy the offsets column of the input lists. For the case of two-table comparator when we concate the child columns, we will split the output ranks into two ranks column and still apply the input null masks and offsets of the two input lists columns.

FYI, I've implemented the code for handling two-table comparator here: https://github.com/rapidsai/cudf/pull/12953/files#diff-c143ae2ecfe63796e604e5854f4a47cf6b831ebe5d68ce7c19a0e1533fceddc7R394-R413

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
// However, first ranks are computed faster and are good enough for ordering them. Structs
// compared equal always have consecutive rank values (in stable order) thus they are still
// sorted correctly by their ranks.
auto ranks = cudf::detail::rank(child,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some details we need to pay attention to when performing the ranking. In particular, I want to be sure that NaN handling (via the physical element comparator) and nullability are handled correctly with respect to the target comparator. I'm not sure if this has complete/correct support for those yet?

This is what I was talking about in #11222:

Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of list<struct<...>> data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).

See also:

* @param null_precedence Optional, device array the same length as a row and indicates how null
* values compare to all other for every column. If `nullopt`, then null precedence would be
* `null_order::BEFORE` for all columns.
* @param comparator Physical element relational comparison functor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really tricky. The rank API calls sort_order which in turn uses the less<> comparator with default template argument PhysicalElementComparator = sorting_physical_element_comparator. We cannot control that element comparator in this preprocessing step, as we have no way to pass in such element comparator into cudf::rank.

So the only way I can think of now is to enforce using that default element comparator later and nothing else (of course, only if cudf::rank has been called during preprocessing).

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
@divyegala
Copy link
Member

Update: I've successfully fixed the dremel issue above. However, that may require changing to the function decompose_structs which is also used in equality comparators. Such changes may affect comparison performance. Overall, too much risky for a limited time budget. So the issue needs to be addressed separately and I'll post a PR for it after this one merged whenever I can.

Can you create an issue documenting how to do this? Just to ensure we are tracking that there's possibly a better solution than flattening that we need to explore.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 20, 2023

Can you create an issue documenting how to do this? Just to ensure we are tracking that there's possibly a better solution than flattening that we need to explore.

I'll update the original issue (#11672) with my finding.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 4, 2023

Close this. I will separate it into two separate PRs for structs-of-lists and lists-of-structs.

@ttnghia ttnghia closed this Apr 4, 2023
@ttnghia ttnghia deleted the sort_nested_types branch May 1, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants