Faster struct row comparator #10164

devavret · 2022-01-28T23:04:12Z

The existing row_lexicographical_comparator cannot compare struct columns, so the current solution is to flatten a struct column with pre-order traversal. This involves creating a bool column for each struct level. e.g. for a struct of the following shape

Struct(1)<int, Struct(2)<float, string>>

we would generate columns like this:

[bool(Struct(1)), int, bool(Struct(2)), float, string]

The reason this is done is because struct traversal in row comparator would require recursion, which is prohibitively expensive on the GPU because stack size cannot be determined at compile time. An alternative was also explored as part of my current effort.[1]

The proposed solution is to "verticalize" (please suggest a better name) the struct columns. This means the struct columns are converted into a format that does not require a stack storage and traversing it will require a state with fixed storage. For the above example struct, the conversion would yield 3 columns:

[Struct(1)<int>, Struct(1)<Struct(2)<float>>, Struct(1)<Struct(2)<string>>]

Using this with row comparator required adding a loop that traverses down the hierarchy and only checks for nulls at the struct level. Since the hierarchy is guaranteed to have only one child, there is no stack required to keep track of the location in the hierarchy.

Further, it can be shown that the Parents that have appeared once in the transformed columns need not appear again because in a lexicographical comparison, they'd already have been compared. Thus the final transformed columns can look like this:

[Struct(1)<int>, Struct(2)<float>, string]

This approach has 2 benefits:

The new transformation does not require the use of extra memory. The new views can be constructed from data and nullmask pointers from old views.
Due to reading less data from device memory, sorting is faster by at least 34% but gets better with struct depth. Benchmark arguments: num_rows {1<<24, 1<<26}, depth {1, 8}

Comparing benchmarks/COMPARE_BENCH to benchmarks/COMPARE_BENCH_new
Benchmark                                                     Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable/16777216/1/manual_time                -0.3417         -0.3408            60            39            60            39
Sort<false>/unstable/67108864/1/manual_time                -0.3471         -0.3471           243           159           243           159
Sort<false>/unstable/16777216/8/manual_time                -0.6201         -0.6201           444           169           444           169
Sort<false>/unstable/67108864/8/manual_time                -0.6290         -0.6290          1776           659          1776           659

[1] The alternative was to convert recursion to iteration by constructing a manually controlled call stack with stack memory backed storage. This would be limited by the stack memory and was found to be more expensive than the current approach. The code for this is in row_operators2.cuh

API changes

This PR adds an owning type self_comparator that takes a table_view and preprocesses it as mentioned and stores the necessary device objects needed for comparison. The owning type then provides a functor for use on the device.

Another owning type is added called preprocessed_table which can also be constructed from table_view and does the same preprocessing. self_comparator can also be constructed from a preprocessed_table. It is useful when trying to use the same preprocessed table in different comparators.

Many iterations already happened. I just realized late that I should commit

jrhemstad · 2022-01-28T23:33:46Z

The transform still requires new memory for the purpose of holding the result of superimpose_parent_nulls()

Why do we need superimpose_parent_nulls in verticalization? I recall from when you first showed me this that it wouldn't require allocating new memory. Is this because of stripping off parents that have already appeared so you need to push the nulls down?

devavret · 2022-01-28T23:50:35Z

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

jrhemstad · 2022-01-28T23:57:06Z

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

What's the overhead of just keeping those parents around?

devavret · 2022-01-29T00:12:00Z

Is this because of stripping off parents that have already appeared so you need to push the nulls down?

Yes.

What's the overhead of just keeping those parents around?

Cannot generalize but for a Struct of depth 8, (S<S<S<S<S<S<S<S<int, int>>>>>>>>) the runtime increases from 770ms to 1040ms. Compared to 2223ms for existing flattening method (I reduced the value distribution to create more cases where checking the second column would be required)

jrhemstad · 2022-01-29T00:40:50Z

Cannot generalize but for a Struct of depth 8, (S<S<S<S<S<S<S<S<int, int>>>>>>>>) the runtime increases from 770ms to 1040ms.

Does that include the superimpose part of the process? And what's the memory overhead?

codecov · 2022-01-29T01:40:41Z

Codecov Report

Merging #10164 (be37199) into branch-22.04 (0be0b00) will increase coverage by 0.03%.
The diff coverage is 96.06%.

❗ Current head be37199 differs from pull request most recent head 9bfd08e. Consider uploading reports for the commit 9bfd08e to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04   #10164      +/-   ##
================================================
+ Coverage         86.13%   86.17%   +0.03%     
================================================
  Files               139      141       +2     
  Lines             22438    22501      +63     
================================================
+ Hits              19328    19391      +63     
  Misses             3110     3110

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/tests/test_accessor.py	`98.41% <ø> (ø)`
python/cudf/cudf/core/_base_index.py	`85.92% <50.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/decimal.py	`91.30% <73.68%> (-1.01%)`	⬇️
python/cudf/cudf/core/column/categorical.py	`89.63% <84.61%> (-0.29%)`	⬇️
python/dask_cudf/dask_cudf/io/text.py	`85.29% <85.29%> (ø)`
python/cudf/cudf/core/column/string.py	`88.91% <94.44%> (+0.64%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`95.62% <95.83%> (+0.64%)`	⬆️
python/cudf/cudf/core/frame.py	`91.84% <96.42%> (+0.12%)`	⬆️
python/cudf/cudf/_typing.py	`94.11% <100.00%> (+0.78%)`	⬆️
python/cudf/cudf/api/types.py	`89.79% <100.00%> (ø)`
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61772d8...9bfd08e. Read the comment docs.

devavret · 2022-01-31T18:55:33Z

Does that include the superimpose part of the process?

All prep work (superimpose / nullmask to bool of old method etc.) are a very small part of the total time taken in is sorting use case. 5 ms out of the total 1776 ms in case of old method (superimpose + nullmask to bool) and 1 ms out of 659 ms for new method (superimpose only). This may vary in lighter, non-sort use cases.

And what's the memory overhead?

That is just same as the byte size of the struct's nullmask. For the same 8 level deep struct column of length 1<<26, whose size in bytes is 320 MB (256 MB data + 8x8MB) we allocate an additional 64 MB. Compared to 512 MB for the additional nulls to bools. But this depends heavily on the struct's shape. All the null masks are going to be duplicated in byte size basically.

I'm thinking about a method that can avoid superimpose nulls entirely but it would also require a device allocation for a small num_columns sized array.

sliced no longer works

- Add a create method to preprocessed_table that takes care of checks and copies

cpp/benchmarks/sort/sort_structs.cpp

cpp/include/cudf/table/experimental/row_operators.cuh

jrhemstad · 2022-03-21T14:24:13Z

cpp/include/cudf/table/experimental/row_operators.cuh

+  static std::shared_ptr<preprocessed_table> create(table_view const& table,
+                                                    host_span<order const> column_order,
+                                                    host_span<null_order const> null_precedence,
+                                                    rmm::cuda_stream_view stream);


Do we really want to force people to always construct as a shared_ptr?

I suppose if someone is going through the trouble of constructing a preprocessed_table themselves, then that means they are intending to use it in more than one comparator, in which case it will already need to be a shared_ptr.

exactly. Why not return a shared_ptr if everything that can accept it needs a shared_ptr.

jrhemstad · 2022-03-21T14:25:17Z

cpp/src/table/row_operators.cu

+  return std::shared_ptr<preprocessed_table>(new preprocessed_table(
+    std::move(d_t), std::move(d_column_order), std::move(d_null_precedence), std::move(d_depths)));


make_shared

can't do it for the same reason table_device_view couldn't. make_shared needs a public ctor.

jrhemstad · 2022-03-21T17:13:20Z

cpp/include/cudf/table/experimental/row_operators.cuh

+ * @tparam Nullate A cudf::nullate type describing how to check for nulls.
+ */
+template <typename Nullate>
+class device_row_comparator {


Adding element_comparator as a nested type made me realize that device_row_comparator isn't publicly constructible either. That makes me think we should nest it inside of self_comparator. Especially since we'll need to add a different version for the non-self comparator.

I purposefully left it out because then it can be used by both the one table comparator and the two table comparator. The one table comparator will use it directly while the two table comparator can wrap it in another index side aware class. That index side aware wrapper class can live inside two table comparator.

vyasr

Couple of minor suggestions, otherwise this LGTM now! Since it's still experimental and verticalization is an internal implementation detail I am fine merging it as is and revisiting naming in order to get ahead of burndown.

cpp/include/cudf/table/experimental/row_operators.cuh

vyasr · 2022-03-21T17:54:28Z

cpp/src/table/row_operators.cu

+    CUDF_EXPECTS(c.type().id() != type_id::LIST,
+                 "Cannot lexicographic compare a table with a LIST column");
+    if (not is_nested(c.type())) {
+      CUDF_EXPECTS(is_relationally_comparable(c.type()),
+                   "Cannot lexicographic compare a table with a column of type " +
+                     jit::get_type_name(c.type()));
+    }
+    for (auto child = c.child_begin(); child < c.child_end(); ++child) {
+      check_column(*child);
+    }
+  };
+  for (column_view const& c : input) {
+    check_column(c);
+  }


Minor nit: It might be cleaner to implement check_column as a function returning a bool (maybe renamed to is_lex_compatible or so) instead of one that raises. Then you could use std::any (both for the main invocation and the recursive invocations inside the function) and call CUDF_EXPECTS if the function returns false. Not a blocker for merge, might be nice for future work though.

I did consider it. But I wanted to have descriptive exceptions for exactly why this table is incompatible.

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>

devavret · 2022-03-22T17:42:43Z

@gpucibot merge

…0483) Includes `<cstddef>` for `ptrdiff_t` in `parquet/compact_protocol_reader.hpp`. Compilation fails on GCC 11 without this include. Targeting 22.04 since this was broken yesterday in #10063. Error output: ``` cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:51:17: error: 'ptrdiff_t' does not name a type 51 | [[nodiscard]] ptrdiff_t bytecount() const noexcept { return m_cur - m_base; } | cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:22:1: note: 'ptrdiff_t' is defined in header '<cstddef>'; did you forget to '#include <cstddef>'? ``` Also includes `<optional>` in `cpp/include/cudf/table/experimental/row_operators.cuh`, which was broken by #10164. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Conor Hoekstra (https://github.com/codereport) - Yunsong Wang (https://github.com/PointKernel) URL: #10483

devavret added 11 commits August 26, 2021 19:28

First commit

933c974

Many iterations already happened. I just realized late that I should commit

testing and profiling deep single hierarchy struct

a1636e5

Merge branch 'branch-22.02' into struct-row-comp

d59f54c

Merge branch 'branch-22.02' into struct-row-comp

765dd8d

Make the sandboxed test compile again

3d21daf

Update my row_comparator with nullate

9f32e6b

Merge branch 'branch-22.02' into struct-row-comp

53d3c90

Basic verticalization utility and experimental namespace

022e2a4

clean up most of row operators that I didn't change.

7fef643

Sliced column test

930d8de

column order and null precendence support

0ecc4f8

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jan 28, 2022

Manually managed stack

ff36d2d

devavret requested a review from jrhemstad January 28, 2022 23:09

devavret mentioned this pull request Feb 1, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

devavret added 6 commits February 3, 2022 01:01

New depth based method to avoid superimpose nulls

cd0f938

sliced no longer works

Put sort2 impl in separate TU

7b8e060

Merge branch 'branch-22.04' into struct-row-comp

25eb237

Merge branch 'branch-22.04' into struct-row-comp

d2937cf

Move verticalization code to row_comparator.cpp

d55c9c7

Owning row lex operator

3bd749e

devavret added 3 commits March 19, 2022 01:19

Review changes

096593f

Avoid WAR of storing a table_device_view

f539647

- Add a create method to preprocessed_table that takes care of checks and copies

Rename struct_linearize to decompose_structs and Improve docs

01be0bc

ttnghia reviewed Mar 19, 2022

View reviewed changes

devavret requested a review from vyasr March 21, 2022 08:12

devavret added 2 commits March 21, 2022 13:50

review changes req by @ttnghia

de95530

Namespace changes and making element comparator private

6c45cd4

jrhemstad reviewed Mar 21, 2022

View reviewed changes

devavret requested review from robertmaynard and jrhemstad March 21, 2022 16:00

jrhemstad reviewed Mar 21, 2022

View reviewed changes

vyasr approved these changes Mar 21, 2022

View reviewed changes

devavret requested a review from jrhemstad March 21, 2022 19:37

Update cpp/include/cudf/table/experimental/row_operators.cuh

9bfd08e

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>

jrhemstad approved these changes Mar 22, 2022

View reviewed changes

rapids-bot bot merged commit e7dba35 into rapidsai:branch-22.04 Mar 22, 2022

bdice mentioned this pull request Mar 22, 2022

Include <cstddef> to fix compilation of parquet reader on GCC 11. #10483

Merged

devavret mentioned this pull request Mar 22, 2022

Nested struct binop comparison #9452

Closed

jrhemstad mentioned this pull request Mar 24, 2022

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

Closed

sperlingxx mentioned this pull request Apr 6, 2022

[BUG] struct row comparator crashes on empty structs #10603

Closed

devavret mentioned this pull request Apr 7, 2022

Fix struct row comparator's exception on empty structs #10604

Merged

ttnghia mentioned this pull request Apr 14, 2022

Support equally comparing two different tables in the newly experimental row comparator #10667

Closed

PointKernel mentioned this pull request May 12, 2022

Update groupby::hash to use new row operators for keys #10770

Merged

bdice mentioned this pull request Jul 7, 2022

Proposal for lexicographic comparators for lists of structs #11222

Closed

vyasr mentioned this pull request Sep 8, 2022

[FEA] Enable structs of lists in the new row lexicographic comparator #11672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster struct row comparator #10164

Faster struct row comparator #10164

devavret commented Jan 28, 2022 •

edited

Loading

jrhemstad commented Jan 28, 2022

devavret commented Jan 28, 2022

jrhemstad commented Jan 28, 2022

devavret commented Jan 29, 2022

jrhemstad commented Jan 29, 2022

codecov bot commented Jan 29, 2022 •

edited

Loading

devavret commented Jan 31, 2022

jrhemstad Mar 21, 2022

devavret Mar 21, 2022

jrhemstad Mar 21, 2022

devavret Mar 21, 2022

jrhemstad Mar 21, 2022

devavret Mar 21, 2022

vyasr left a comment

vyasr Mar 21, 2022

devavret Mar 22, 2022

devavret commented Mar 22, 2022

		return std::shared_ptr<preprocessed_table>(new preprocessed_table(
		std::move(d_t), std::move(d_column_order), std::move(d_null_precedence), std::move(d_depths)));

Faster struct row comparator #10164

Faster struct row comparator #10164

Conversation

devavret commented Jan 28, 2022 • edited Loading

API changes

jrhemstad commented Jan 28, 2022

devavret commented Jan 28, 2022

jrhemstad commented Jan 28, 2022

devavret commented Jan 29, 2022

jrhemstad commented Jan 29, 2022

codecov bot commented Jan 29, 2022 • edited Loading

Codecov Report

devavret commented Jan 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devavret commented Mar 22, 2022

devavret commented Jan 28, 2022 •

edited

Loading

codecov bot commented Jan 29, 2022 •

edited

Loading