Faster struct row comparator (#10164)

The existing `row_lexicographical_comparator` cannot compare struct columns, so the current solution is to `flatten` a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shape ``` Struct(1)<int, Struct(2)<float, string>> ``` we would generate columns like this: [`bool(Struct(1))`, `int`, `bool(Struct(2))`, `float`, `string`] The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1] The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns: [`Struct(1)<int>`, `Struct(1)<Struct(2)<float>>`, `Struct(1)<Struct(2)<string>>`] Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy. Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this: [`Struct(1)<int>`, `Struct(2)<float>`, `string`] This approach has 2 benefits: 1. The new transformation does not require the use of extra memory. The new views can be constructed from data and nullmask pointers from old views. 2. Due to reading less data from device memory, sorting is faster by at least 34% but gets better with struct depth. Benchmark arguments: `num_rows {1<<24, 1<<26}`, `depth {1, 8}` ``` Comparing benchmarks/COMPARE_BENCH to benchmarks/COMPARE_BENCH_new Benchmark Time CPU Time Old Time New CPU Old CPU New ------------------------------------------------------------------------------------------------------------------------------------------ Sort<false>/unstable/16777216/1/manual_time -0.3417 -0.3408 60 39 60 39 Sort<false>/unstable/67108864/1/manual_time -0.3471 -0.3471 243 159 243 159 Sort<false>/unstable/16777216/8/manual_time -0.6201 -0.6201 444 169 444 169 Sort<false>/unstable/67108864/8/manual_time -0.6290 -0.6290 1776 659 1776 659 ``` [1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh ### API changes This PR adds an owning type `self_comparator` that takes a `table_view` and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device. Another owning type is added called `preprocessed_table` which can also be constructed from `table_view` and does the same preprocessing. `self_comparator` can also be constructed from a `preprocessed_table`. It is useful when trying to use the same preprocessed table in different comparators. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #10164
rapidsai · Mar 22, 2022 · e7dba35 · e7dba35
1 parent 76c772e
commit e7dba35
Show file tree

Hide file tree

Showing 9 changed files with 856 additions and 21 deletions.
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
@@ -470,6 +470,7 @@ add_library(
   src/structs/structs_column_factories.cu
   src/structs/structs_column_view.cpp
   src/structs/utilities.cpp
+  src/table/row_operators.cu
   src/table/table.cpp
   src/table/table_device_view.cu
   src/table/table_view.cpp

diff --git a/cpp/benchmarks/CMakeLists.txt b/cpp/benchmarks/CMakeLists.txt
@@ -164,6 +164,7 @@ ConfigureBench(SEARCH_BENCH search/search.cpp)
 # ##################################################################################################
 # * sort benchmark --------------------------------------------------------------------------------
 ConfigureBench(SORT_BENCH sort/rank.cpp sort/sort.cpp sort/sort_strings.cpp)
+ConfigureNVBench(SORT_NVBENCH sort/sort_structs.cpp)
 
 # ##################################################################################################
 # * quantiles benchmark

diff --git a/cpp/benchmarks/sort/sort_structs.cpp b/cpp/benchmarks/sort/sort_structs.cpp
@@ -0,0 +1,84 @@
+/*
+ * Copyright (c) 2022, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <benchmarks/fixture/rmm_pool_raii.hpp>
+
+#include <cudf/detail/sorting.hpp>
+
+#include <cudf_test/column_utilities.hpp>
+#include <cudf_test/column_wrapper.hpp>
+
+#include <nvbench/nvbench.cuh>
+
+#include <random>
+
+void nvbench_sort_struct(nvbench::state& state)
+{
+  cudf::rmm_pool_raii pool_raii;
+
+  using Type           = int;
+  using column_wrapper = cudf::test::fixed_width_column_wrapper<Type>;
+  std::default_random_engine generator;
+  std::uniform_int_distribution<int> distribution(0, 100);
+
+  const cudf::size_type n_rows{static_cast<cudf::size_type>(state.get_int64("NumRows"))};
+  const cudf::size_type n_cols{1};
+  const cudf::size_type depth{static_cast<cudf::size_type>(state.get_int64("Depth"))};
+  const bool nulls{static_cast<bool>(state.get_int64("Nulls"))};
+
+  // Create columns with values in the range [0,100)
+  std::vector<column_wrapper> columns;
+  columns.reserve(n_cols);
+  std::generate_n(std::back_inserter(columns), n_cols, [&]() {
+    auto const elements = cudf::detail::make_counting_transform_iterator(
+      0, [&](auto row) { return distribution(generator); });
+    if (!nulls) return column_wrapper(elements, elements + n_rows);
+    auto valids =
+      cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 10 != 0; });
+    return column_wrapper(elements, elements + n_rows, valids);
+  });
+
+  std::vector<std::unique_ptr<cudf::column>> cols;
+  std::transform(columns.begin(), columns.end(), std::back_inserter(cols), [](column_wrapper& col) {
+    return col.release();
+  });
+
+  std::vector<std::unique_ptr<cudf::column>> child_cols = std::move(cols);
+  // Lets add some layers
+  for (int i = 0; i < depth; i++) {
+    std::vector<bool> struct_validity;
+    std::uniform_int_distribution<int> bool_distribution(0, 100 * (i + 1));
+    std::generate_n(
+      std::back_inserter(struct_validity), n_rows, [&]() { return bool_distribution(generator); });
+    cudf::test::structs_column_wrapper struct_col(std::move(child_cols), struct_validity);
+    child_cols = std::vector<std::unique_ptr<cudf::column>>{};
+    child_cols.push_back(struct_col.release());
+  }
+
+  // Create table view
+  auto const input = cudf::table(std::move(child_cols));
+
+  state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
+    rmm::cuda_stream_view stream_view{launch.get_stream()};
+    cudf::detail::sorted_order(input, {}, {}, stream_view, rmm::mr::get_current_device_resource());
+  });
+}
+
+NVBENCH_BENCH(nvbench_sort_struct)
+  .set_name("sort_struct")
+  .add_int64_power_of_two_axis("NumRows", {10, 18, 26})
+  .add_int64_axis("Depth", {1, 8})
+  .add_int64_axis("Nulls", {0, 1});