Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor contains_table with cuco::static_set #14064

Merged
merged 28 commits into from
Sep 26, 2023

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Sep 8, 2023

Description

Contributes to #12261

This PR refactors contains_table to use the new cuco::static_set data structure. It also adds a contains_table benchmark to track the performance before and after this work.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 8, 2023
@PointKernel PointKernel self-assigned this Sep 8, 2023
@PointKernel PointKernel marked this pull request as ready for review September 13, 2023 22:29
@PointKernel PointKernel requested a review from a team as a code owner September 13, 2023 22:29
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 13, 2023
Comment on lines 131 to 132
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or
needles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or
needles
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or
needles

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's worth the effort. I assume we shouldn't rely on any manual alignment for code formatting/linting.

Copy link
Contributor

@ttnghia ttnghia Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual alignment is for dev reading, to enhance aesthetics/readability, not for the generated docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of alignment, I think all lines should start with a single space and an asterisk.
https://github.com/rapidsai/cudf/blob/branch-23.10/cpp/doxygen/developer_guide/DOCUMENTATION.md#block-comments
I think this should apply to blank lines within the block comment as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all lines should start with a single space and an asterisk.

Oh, good catch. Definitely yes, something went wrong with my IDE formatting. Fixed

@github-actions github-actions bot added the CMake CMake build issue label Sep 19, 2023
@PointKernel
Copy link
Member Author

PointKernel commented Sep 19, 2023

The peak memory usage drops about 50% after this PR in all test cases.

As for performance, there is a trade-off between nested and flat types. The legacy map uses a hardcoded CG size = 4 and there is no way to change it on the user side. With the new set data structure and CG size = 4, the new code outperforms the legacy implementation for both int column and list column:

type null_probability table_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 10000 38.075 us 14.80% 26.917 us 15.44% -11.158 us -29.31% FAIL
I32 0.1 10000 35.490 us 6.37% 25.740 us 7.00% -9.750 us -27.47% FAIL
I32 0 100000 38.219 us 16.96% 28.168 us 6.23% -10.051 us -26.30% FAIL
I32 0.1 100000 40.250 us 10.13% 30.142 us 6.66% -10.108 us -25.11% FAIL
I32 0 1000000 89.140 us 2.51% 67.346 us 2.73% -21.794 us -24.45% FAIL
I32 0.1 1000000 105.566 us 2.34% 83.149 us 4.47% -22.417 us -21.23% FAIL
I32 0 10000000 613.279 us 0.93% 478.395 us 0.83% -134.883 us -21.99% FAIL
I32 0.1 10000000 756.407 us 0.70% 634.450 us 0.68% -121.957 us -16.12% FAIL
cudf::list_view 0 10000 500.887 us 0.51% 488.441 us 0.65% -12.447 us -2.48% FAIL
cudf::list_view 0.1 10000 624.067 us 0.44% 619.780 us 0.36% -4.287 us -0.69% FAIL
cudf::list_view 0 100000 527.369 us 0.54% 513.879 us 0.34% -13.490 us -2.56% FAIL
cudf::list_view 0.1 100000 557.109 us 0.66% 539.606 us 0.60% -17.503 us -3.14% FAIL
cudf::list_view 0 1000000 3.363 ms 0.14% 3.348 ms 0.06% -15.632 us -0.46% FAIL
cudf::list_view 0.1 1000000 4.276 ms 0.15% 4.317 ms 0.07% 41.135 us 0.96% FAIL
cudf::list_view 0 10000000 2.257 ms 0.20% 2.249 ms 0.13% -8.445 us -0.37% FAIL
cudf::list_view 0.1 10000000 2.862 ms 0.14% 2.886 ms 0.12% 24.101 us 0.84% FAIL

If we reduce the CG size to 1, the performance for int column can be further improved to about 3.5x speedups while the performance for list column can be as much as 30% slower than the current implementation:

type null_probability table_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 10000 37.817 us 15.00% 26.301 us 9.96% -11.516 us -30.45% FAIL
I32 0.1 10000 35.636 us 6.63% 25.913 us 6.84% -9.722 us -27.28% FAIL
I32 0 100000 36.421 us 6.37% 26.731 us 6.48% -9.690 us -26.61% FAIL
I32 0.1 100000 38.886 us 8.68% 28.775 us 6.04% -10.111 us -26.00% FAIL
I32 0 1000000 89.063 us 2.69% 38.619 us 4.26% -50.444 us -56.64% FAIL
I32 0.1 1000000 105.316 us 2.32% 46.594 us 3.75% -58.723 us -55.76% FAIL
I32 0 10000000 612.065 us 0.75% 168.778 us 1.13% -443.287 us -72.42% FAIL
I32 0.1 10000000 757.799 us 0.77% 217.097 us 0.81% -540.702 us -71.35% FAIL
cudf::list_view 0 10000 487.529 us 0.59% 488.617 us 0.56% 1.087 us 0.22% PASS
cudf::list_view 0.1 10000 608.179 us 0.48% 621.207 us 0.32% 13.028 us 2.14% FAIL
cudf::list_view 0 100000 526.683 us 0.52% 630.362 us 0.32% 103.680 us 19.69% FAIL
cudf::list_view 0.1 100000 557.078 us 0.69% 745.262 us 0.50% 188.184 us 33.78% FAIL
cudf::list_view 0 1000000 3.355 ms 0.08% 3.605 ms 0.06% 250.828 us 7.48% FAIL
cudf::list_view 0.1 1000000 4.267 ms 0.08% 4.548 ms 0.33% 281.113 us 6.59% FAIL
cudf::list_view 0 10000000 2.252 ms 0.17% 2.426 ms 0.10% 174.065 us 7.73% FAIL
cudf::list_view 0.1 10000000 2.861 ms 0.19% 3.014 ms 0.10% 153.529 us 5.37% FAIL

@ttnghia
Copy link
Contributor

ttnghia commented Sep 19, 2023

If we reduce the CG size to 1 or 2, the performance for int column can be further improved to about 3.5x speedups while the performance for list column can be as much as 30% slower than the current implementation:

Can we selectively create a map (using separate code path) with different CG if the input is flat/nested?

@PointKernel
Copy link
Member Author

Can we selectively create a map (using separate code path) with different CG if the input is flat/nested?

That's exactly what we are doing in contains_column where it uses cudf multiset for flat types and invokes contains_table if it's a nested type. The cudf multiset doesn't have the CG concept and the performance should be similar to the CG size = 1 implementation with new cuco set.

@PointKernel
Copy link
Member Author

The latest performance with CG tuning based on haystack data type:

type null_probability table_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 10000 38.075 us 14.80% 26.776 us 14.09% -11.299 us -29.68% FAIL
I32 0.1 10000 35.490 us 6.37% 25.960 us 6.55% -9.530 us -26.85% FAIL
I32 0 100000 38.219 us 16.96% 26.678 us 6.56% -11.541 us -30.20% FAIL
I32 0.1 100000 40.250 us 10.13% 28.822 us 6.07% -11.427 us -28.39% FAIL
I32 0 1000000 89.140 us 2.51% 38.517 us 4.52% -50.623 us -56.79% FAIL
I32 0.1 1000000 105.566 us 2.34% 46.443 us 3.72% -59.123 us -56.01% FAIL
I32 0 10000000 613.279 us 0.93% 169.362 us 2.67% -443.916 us -72.38% FAIL
I32 0.1 10000000 756.407 us 0.70% 215.990 us 0.82% -540.417 us -71.45% FAIL
cudf::list_view 0 10000 500.887 us 0.51% 487.216 us 0.46% -13.671 us -2.73% FAIL
cudf::list_view 0.1 10000 624.067 us 0.44% 618.258 us 0.26% -5.809 us -0.93% FAIL
cudf::list_view 0 100000 527.369 us 0.54% 498.654 us 0.44% -28.715 us -5.44% FAIL
cudf::list_view 0.1 100000 557.109 us 0.66% 539.606 us 0.59% -17.503 us -3.14% FAIL
cudf::list_view 0 1000000 3.363 ms 0.14% 3.340 ms 0.07% -23.658 us -0.70% FAIL
cudf::list_view 0.1 1000000 4.276 ms 0.15% 4.307 ms 0.06% 31.480 us 0.74% FAIL
cudf::list_view 0 10000000 2.257 ms 0.20% 2.233 ms 0.09% -24.332 us -1.08% FAIL
cudf::list_view 0.1 10000000 2.862 ms 0.14% 2.867 ms 0.08% 4.870 us 0.17% FAIL

cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved
helper_func);
} else {
if (cudf::detail::has_nested_columns(needles)) {
dispatch_nan_comparator<false, true>(compare_nulls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case I'm interested in investigating in the other thread. Is the <false, true> dispatch providing special functionality or performance advantages that can't be achieved with 2 dispatches (<true, true> or <false, false>)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see your point.

Ideally, users should always check nested against nested. Having the <false, true> dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case), and the haystack set insertion would be faster than the <true, true> dispatch.

For my education, in this particular case, I assume <true, true> would work fine but less efficiently since we do redundant nested-related checks for a flat column. But would <false, false> produce proper results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would <false, false> produce proper results?

Checked with @divyegala, this will cause runtime errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the <false, true> dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case)

Maybe I'm misunderstanding the problem -- but it seems like a nested needle could never be found in a haystack without nesting. Shouldn't that fail because the types of the needle and haystack don't match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that fail because the types of the needle and haystack don't match?

I'm not sure. The current contains_table allows this combination and returns 0 in this case. I'm trying to align with the existing behavior. Maybe @ttnghia knows more about what is the expected behavior here.

Copy link
Contributor

@ttnghia ttnghia Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the thing: when hashing the column with nulls, the validity (true/false) of the column will be hashed into the output hash:

if (_check_nulls) {
          auto validity_it = detail::make_validity_iterator<true>(curr_col);
          hash             = detail::accumulate(..., hash,...);

When using hash table (map/set), we must ensure that the hash values for rows in both tables are consistent. That means, if one side table has nulls and hashed with nulls, the other side must also be hashed with the same has_nulls. Otherwise, one table is hashed differently than the other and you will not find the correct "contains" output. That's why you see the hashers are constructed using has_any_nulls.

And that is only relevant to hashing.

For equality comparator only (without hashing), we can use different has_nulls for tables.

Copy link
Member Author

@PointKernel PointKernel Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a different issue. Null handling is properly propagated to both self and two-table comparators.

The focus was whether it's legit to have mismatched column types between haystack and needles. After checking the public contains API, I think the answer is clearer now: I've updated the code so now it will throw if column types mismatch. The implementation is therefore simplified as @bdice suggested. Nice catch, thanks! 🙏

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the dispatch logic here. I knew that seemed fishy! Everything looks good here to me. I haven't looked at benchmarks or compile times yet. Anything noteworthy there to consider before merging?

edit: compile time for this PR is 9:10. I don't know how this compares to previous implementations.
image

@davidwendt
Copy link
Contributor

A snapshot of the compile times near the beginning of 23.10 shows src/search/contains_table.cu.o at between 10.5 to 12 minute range when doing a full build. So I think the build time here would be about on par or a little better.

@PointKernel
Copy link
Member Author

PointKernel commented Sep 26, 2023

I haven't looked at benchmarks or compile times yet. Anything noteworthy there to consider before merging?

@davidwendt Thanks for double-checking the build time.

Otherwise, as for benchmark results, we get about 30% to 3.5x speedups for flat columns and about the same performance (slightly better in most cases) for nested ones.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake changes.

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 030c0f4 into rapidsai:branch-23.10 Sep 26, 2023
58 checks passed
@PointKernel PointKernel deleted the cuco-contains-table branch September 26, 2023 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants