Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Faster struct row comparator (#10164)
The existing `row_lexicographical_comparator` cannot compare struct columns, so the current solution is to `flatten` a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shape ``` Struct(1)<int, Struct(2)<float, string>> ``` we would generate columns like this: [`bool(Struct(1))`, `int`, `bool(Struct(2))`, `float`, `string`] The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1] The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns: [`Struct(1)<int>`, `Struct(1)<Struct(2)<float>>`, `Struct(1)<Struct(2)<string>>`] Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy. Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this: [`Struct(1)<int>`, `Struct(2)<float>`, `string`] This approach has 2 benefits: 1. The new transformation does not require the use of extra memory. The new views can be constructed from data and nullmask pointers from old views. 2. Due to reading less data from device memory, sorting is faster by at least 34% but gets better with struct depth. Benchmark arguments: `num_rows {1<<24, 1<<26}`, `depth {1, 8}` ``` Comparing benchmarks/COMPARE_BENCH to benchmarks/COMPARE_BENCH_new Benchmark Time CPU Time Old Time New CPU Old CPU New ------------------------------------------------------------------------------------------------------------------------------------------ Sort<false>/unstable/16777216/1/manual_time -0.3417 -0.3408 60 39 60 39 Sort<false>/unstable/67108864/1/manual_time -0.3471 -0.3471 243 159 243 159 Sort<false>/unstable/16777216/8/manual_time -0.6201 -0.6201 444 169 444 169 Sort<false>/unstable/67108864/8/manual_time -0.6290 -0.6290 1776 659 1776 659 ``` [1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh ### API changes This PR adds an owning type `self_comparator` that takes a `table_view` and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device. Another owning type is added called `preprocessed_table` which can also be constructed from `table_view` and does the same preprocessing. `self_comparator` can also be constructed from a `preprocessed_table`. It is useful when trying to use the same preprocessed table in different comparators. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #10164
- Loading branch information