Skip to content

Commit

Permalink
Faster struct row comparator (#10164)
Browse files Browse the repository at this point in the history
The existing `row_lexicographical_comparator` cannot compare struct columns, so the current solution is to `flatten` a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shape
```
Struct(1)<int, Struct(2)<float, string>>
```
we would generate columns like this:

[`bool(Struct(1))`, `int`, `bool(Struct(2))`, `float`, `string`]

The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1]

The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns:

[`Struct(1)<int>`, `Struct(1)<Struct(2)<float>>`, `Struct(1)<Struct(2)<string>>`]

Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy.

Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this:

[`Struct(1)<int>`, `Struct(2)<float>`, `string`]

This approach has 2 benefits:
1. The new transformation does not require the use of extra memory. The new views can be constructed from data and nullmask pointers from old views.
2. Due to reading less data from device memory, sorting is faster by at least 34% but gets better with struct depth. Benchmark arguments: `num_rows {1<<24, 1<<26}`, `depth {1, 8}`
```
Comparing benchmarks/COMPARE_BENCH to benchmarks/COMPARE_BENCH_new
Benchmark                                                     Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable/16777216/1/manual_time                -0.3417         -0.3408            60            39            60            39
Sort<false>/unstable/67108864/1/manual_time                -0.3471         -0.3471           243           159           243           159
Sort<false>/unstable/16777216/8/manual_time                -0.6201         -0.6201           444           169           444           169
Sort<false>/unstable/67108864/8/manual_time                -0.6290         -0.6290          1776           659          1776           659
```

[1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh


### API changes
This PR adds an owning type `self_comparator` that takes a `table_view` and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device.

Another owning type is added called `preprocessed_table` which can also be constructed from `table_view` and does the same preprocessing. `self_comparator` can also be constructed from a `preprocessed_table`. It is useful when trying to use the same preprocessed table in different comparators.

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10164
  • Loading branch information
devavret committed Mar 22, 2022
1 parent 76c772e commit e7dba35
Show file tree
Hide file tree
Showing 9 changed files with 856 additions and 21 deletions.
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,7 @@ add_library(
src/structs/structs_column_factories.cu
src/structs/structs_column_view.cpp
src/structs/utilities.cpp
src/table/row_operators.cu
src/table/table.cpp
src/table/table_device_view.cu
src/table/table_view.cpp
Expand Down
1 change: 1 addition & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ ConfigureBench(SEARCH_BENCH search/search.cpp)
# ##################################################################################################
# * sort benchmark --------------------------------------------------------------------------------
ConfigureBench(SORT_BENCH sort/rank.cpp sort/sort.cpp sort/sort_strings.cpp)
ConfigureNVBench(SORT_NVBENCH sort/sort_structs.cpp)

# ##################################################################################################
# * quantiles benchmark
Expand Down
84 changes: 84 additions & 0 deletions cpp/benchmarks/sort/sort_structs.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/fixture/rmm_pool_raii.hpp>

#include <cudf/detail/sorting.hpp>

#include <cudf_test/column_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <nvbench/nvbench.cuh>

#include <random>

void nvbench_sort_struct(nvbench::state& state)
{
cudf::rmm_pool_raii pool_raii;

using Type = int;
using column_wrapper = cudf::test::fixed_width_column_wrapper<Type>;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 100);

const cudf::size_type n_rows{static_cast<cudf::size_type>(state.get_int64("NumRows"))};
const cudf::size_type n_cols{1};
const cudf::size_type depth{static_cast<cudf::size_type>(state.get_int64("Depth"))};
const bool nulls{static_cast<bool>(state.get_int64("Nulls"))};

// Create columns with values in the range [0,100)
std::vector<column_wrapper> columns;
columns.reserve(n_cols);
std::generate_n(std::back_inserter(columns), n_cols, [&]() {
auto const elements = cudf::detail::make_counting_transform_iterator(
0, [&](auto row) { return distribution(generator); });
if (!nulls) return column_wrapper(elements, elements + n_rows);
auto valids =
cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 10 != 0; });
return column_wrapper(elements, elements + n_rows, valids);
});

std::vector<std::unique_ptr<cudf::column>> cols;
std::transform(columns.begin(), columns.end(), std::back_inserter(cols), [](column_wrapper& col) {
return col.release();
});

std::vector<std::unique_ptr<cudf::column>> child_cols = std::move(cols);
// Lets add some layers
for (int i = 0; i < depth; i++) {
std::vector<bool> struct_validity;
std::uniform_int_distribution<int> bool_distribution(0, 100 * (i + 1));
std::generate_n(
std::back_inserter(struct_validity), n_rows, [&]() { return bool_distribution(generator); });
cudf::test::structs_column_wrapper struct_col(std::move(child_cols), struct_validity);
child_cols = std::vector<std::unique_ptr<cudf::column>>{};
child_cols.push_back(struct_col.release());
}

// Create table view
auto const input = cudf::table(std::move(child_cols));

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
rmm::cuda_stream_view stream_view{launch.get_stream()};
cudf::detail::sorted_order(input, {}, {}, stream_view, rmm::mr::get_current_device_resource());
});
}

NVBENCH_BENCH(nvbench_sort_struct)
.set_name("sort_struct")
.add_int64_power_of_two_axis("NumRows", {10, 18, 26})
.add_int64_axis("Depth", {1, 8})
.add_int64_axis("Nulls", {0, 1});
Loading

0 comments on commit e7dba35

Please sign in to comment.