Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for equi-join on struct #7720

Merged
merged 196 commits into from
Apr 1, 2021
Merged
Show file tree
Hide file tree
Changes from 193 commits
Commits
Show all changes
196 commits
Select commit Hold shift + click to select a range
4a4b4af
Merge branch 'branch-0.17' into branch-0.18
shwina Dec 11, 2020
223f2b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 15, 2020
abd6ad2
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 17, 2020
18863b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 4, 2021
0fbdd31
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
dc9b943
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
d586aa7
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 7, 2021
996fda8
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 8, 2021
2808a5c
Add a compute_hash_join_indices that returns just the join indices
shwina Jan 11, 2021
ef0baee
Don't need common_columns stuff for join that returns a gathermap
shwina Jan 11, 2021
18f3074
Add hash_join_impl methods that return gathermaps
shwina Jan 11, 2021
70abf48
Add overloads to public hash_join class
shwina Jan 11, 2021
13dff67
Add top-level join APIs that return gathermaps
shwina Jan 11, 2021
3300fe1
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 12, 2021
7ed694c
Use device_uvector instead of device_vector in join
shwina Jan 12, 2021
636c2ea
Undo some API changes
shwina Jan 12, 2021
b79da68
Add join_result
shwina Jan 13, 2021
380aa59
Add APIs that return join_result
shwina Jan 13, 2021
3cbb2b4
Remove column_in_common
shwina Jan 13, 2021
53ae7c9
Add an inner join API that returns gathermaps
shwina Jan 14, 2021
fde172b
Add remaining APIs to return gathermaps
shwina Jan 14, 2021
4a286dd
Add gathermap join test
shwina Jan 18, 2021
c756db9
Replace -1 with INT_MIN
shwina Jan 18, 2021
6a3d23e
Make join_result columns instead of column_views
shwina Jan 20, 2021
5dfc2a0
Replace join_result with a pair of columns
shwina Jan 20, 2021
362829b
Add gathermap test for outer join
shwina Jan 20, 2021
4e4380c
Add and pass full join gathermap test
shwina Jan 20, 2021
339a13d
Begin Python-side refactor
shwina Jan 21, 2021
2b07802
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 25, 2021
0d5a19c
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 28, 2021
fdbdc12
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 1, 2021
5dd5d29
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 5, 2021
6b20429
Merge branch 'branch-0.19' into gathermap-based-join-apis
shwina Feb 8, 2021
044eac1
Add left_semi and left_anti join APIs that return gathermaps
shwina Feb 8, 2021
555d5ec
Add Cython bindings
shwina Feb 8, 2021
56ae616
full -> outer
shwina Feb 9, 2021
dd05121
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 9, 2021
d447924
Progress
shwina Feb 9, 2021
484512e
More progress on py refactor
shwina Feb 9, 2021
5227582
Remove breakpoint
shwina Feb 10, 2021
9cd870e
Fix neg index handling
shwina Feb 10, 2021
8e4f193
Use nullify gather in join
shwina Feb 10, 2021
29fe140
Handle outer joins better
shwina Feb 10, 2021
b634055
Fix index construction
shwina Feb 10, 2021
cd53d6c
Fix sorting behaviour
shwina Feb 10, 2021
75f1efd
Fix Index.join
shwina Feb 10, 2021
1f5d6ad
Progress on semi/anti joins
shwina Feb 10, 2021
de30520
Add simple join test
shwina Feb 10, 2021
66a0de5
Semi-join fix
shwina Feb 11, 2021
ca72295
Only combine key columns in outer join if they have the same name
shwina Feb 11, 2021
ee2242d
Handle when both _on and _index are provided
shwina Feb 11, 2021
e531725
Fix sorting join result
shwina Feb 11, 2021
c8b4948
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 11, 2021
674095c
whitespace
shwina Feb 12, 2021
cbd9dc3
Make construct_join_output_df work with column views
shwina Feb 12, 2021
3f3c3cb
Get rid of hash_join::left_join
shwina Feb 12, 2021
01415fc
More join C++ cleanup
shwina Feb 12, 2021
6185492
Even more cleaning
shwina Feb 17, 2021
d736d1c
More join tests
shwina Feb 18, 2021
b58591d
Fix all join tests
shwina Feb 18, 2021
be560bb
Python regressions
shwina Feb 18, 2021
efb60d6
Revert
shwina Feb 18, 2021
fe6d0b8
Invalid -> Unkown
shwina Feb 18, 2021
547027c
Don't mutate lhs/rhs
shwina Feb 18, 2021
5f93d23
Fix join tests
shwina Feb 19, 2021
b7bf821
Fix semi/anti join trivial cases
shwina Feb 19, 2021
50a2fb2
When testing join results, use a helper that sorts values
shwina Feb 19, 2021
ff0ae79
Totally broken commit
shwina Feb 19, 2021
07cd052
Cleanup
shwina Feb 20, 2021
bd6bf77
Warnings
shwina Feb 20, 2021
a40063e
Cleanup
shwina Feb 22, 2021
ccef9d0
Cleanup
shwina Feb 22, 2021
210244b
Cleanup
shwina Feb 22, 2021
b57348c
Add typing for join helpers
shwina Feb 22, 2021
e19c30c
add struct_compare at row_lexicographic_comparator
karthikeyann Feb 22, 2021
19310e5
add unit test to sort with struct column
karthikeyann Feb 22, 2021
5c2c9b3
Typing for Join class
shwina Feb 22, 2021
558aa15
Simplify joiner API
shwina Feb 22, 2021
3184896
Example doc
shwina Feb 22, 2021
d3535dc
Refactor join APIs to return a device_uvector
shwina Feb 25, 2021
f2e55a7
remove nullable in is_null condition
karthikeyann Feb 26, 2021
0d30d8a
Add null_compare (DRY)
karthikeyann Feb 26, 2021
3b0a2a5
Merge tag 'branch-0.19-latest' of https://github.com/rapidsai/cudf in…
shwina Mar 1, 2021
b82181d
docs
shwina Mar 3, 2021
77d2bfd
Finish up docs?
shwina Mar 3, 2021
bb64b06
remove stale comment
karthikeyann Mar 4, 2021
a79a92a
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
karthikeyann Mar 4, 2021
0bf34e8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 4, 2021
26a3fb0
Fix join tests
shwina Mar 4, 2021
8a60d62
Refactor join APIs to work with unique_ptr<rmm::device_uvector>>
shwina Mar 5, 2021
387a953
Update join Cython
shwina Mar 5, 2021
6cd6433
Need to resize the gathermap
shwina Mar 5, 2021
c67dcce
Doc
shwina Mar 5, 2021
30c22ed
Changelog
shwina Mar 5, 2021
2723ffb
bypass single nested type column to normal sort
karthikeyann Mar 8, 2021
69f37d0
add support for nested struct columns with depth>1
karthikeyann Mar 8, 2021
a8ad736
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
karthikeyann Mar 9, 2021
245116b
add struct sort tests
karthikeyann Mar 9, 2021
f73199d
Add helper to convert gather_map_type->Column
shwina Mar 9, 2021
393c06a
Update python/cudf/cudf/core/frame.py
shwina Mar 9, 2021
e91f554
Cannot specify both column and index
shwina Mar 9, 2021
0185896
Vaildate how
shwina Mar 9, 2021
b232f85
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 9, 2021
1eb495d
Can't use a set
shwina Mar 9, 2021
4f1f072
Avoid function local import
shwina Mar 10, 2021
4aa8fec
False -> NotImplementedError
shwina Mar 10, 2021
ae0e5f9
Update cpp/include/cudf/join.hpp
shwina Mar 10, 2021
f47cf7e
Reuse some join logic
shwina Mar 10, 2021
2a201c3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 10, 2021
230ca08
Formatting
shwina Mar 10, 2021
498a621
Update cpp/include/cudf/join.hpp
shwina Mar 11, 2021
2de26f3
Docs?
shwina Mar 11, 2021
d6f128c
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 11, 2021
b7d8d8a
Use mr
shwina Mar 11, 2021
9efc761
Docs
shwina Mar 15, 2021
bc598a1
remove struct comparator support, add flatten_table
karthikeyann Mar 16, 2021
b1add6a
review comments update
karthikeyann Mar 16, 2021
8779bc7
Simplify suffix handling
shwina Mar 16, 2021
8116c49
fix sliced struct column sort, add unit tests
karthikeyann Mar 16, 2021
496c9ca
move flatten_talbe to structs/utilities.hpp
karthikeyann Mar 16, 2021
aebf4ec
style fixes
karthikeyann Mar 16, 2021
4c651ac
Simplify joiner requirements
shwina Mar 17, 2021
b4f4d7c
Do less work in SemiJoin._merge_results
shwina Mar 17, 2021
d353c92
Doc
shwina Mar 17, 2021
580a346
Doc
shwina Mar 17, 2021
328dafd
Return None from semi_join
shwina Mar 17, 2021
297d20a
Init common_type
shwina Mar 17, 2021
7621d5e
add is_sorted struct support, reorder tuple returned
karthikeyann Mar 19, 2021
320c94b
add is_sorted struct unit tests
karthikeyann Mar 19, 2021
50b862f
add logic_error in row_lexicographic_comparator constructor if columns
karthikeyann Mar 22, 2021
4ce4bae
update copyright year
karthikeyann Mar 23, 2021
76894e1
Partial clean up of ORC writer (#7324)
vuule Mar 4, 2021
cc73af6
Java cleaner synchronization (#7474)
abellina Mar 4, 2021
213f1ad
Change jit launch to safe_launch (#7510)
devavret Mar 4, 2021
d77a393
Add cython for converting strings/fixed-point functions (#7429)
davidwendt Mar 4, 2021
94dd756
JNI: Support skipping nulls for collect aggregation (#7457)
firestarman Mar 5, 2021
520d92c
Rename ARROW_STATIC_LIB because it conflicts with one in FindArrow.cm…
trxcllnt Mar 6, 2021
6f6b5ab
Statistics cleanup (#7439)
kaatish Mar 6, 2021
b7d9d7d
Update JNI build to use CUDF_USE_ARROW_STATIC (#7526)
jlowe Mar 6, 2021
ec90eff
FIX Remove random build directory generation for ccache (#7508)
dillon-cullinan Mar 8, 2021
7b02cb1
Java support for casting of nested child columns (#7417)
razajafri Mar 8, 2021
e6659ca
bitmask_or implementation with bitmask refactor (#7406)
rwlee Mar 8, 2021
6b5b477
Resolving unlinked type shorthands in cudf doc (#7416)
isVoid Mar 8, 2021
c8a8669
Change dask and distributed branch to main (#7532)
dantegd Mar 8, 2021
4f702a0
Add gbenchmarks for strings extract function (#7522)
davidwendt Mar 9, 2021
2d5d5d5
Reduce compile time/size for scan.cu (#7516)
davidwendt Mar 9, 2021
8e76075
fix missing renames of dask git branches from master to main (#7535)
Mar 9, 2021
0784a31
Remove detail from device_span (#7533)
rwlee Mar 9, 2021
75f4db8
Make sure rmm::rmm CMake target is visibile to cudf users (#7524)
robertmaynard Mar 9, 2021
1f89144
FIX Retry conda output location (#7540)
dillon-cullinan Mar 9, 2021
b301977
Decimal32 Build Fix (#7544)
razajafri Mar 9, 2021
22604d7
Update missing docstring examples in python public APIs (#7546)
galipremsagar Mar 10, 2021
af91aca
Enable type conversion from float to decimal type (#7450)
ChrisJar Mar 10, 2021
2f6e019
Update Changelog Link (#7550)
ajschmidt8 Mar 10, 2021
34f6de8
Fix contiguous_split not properly handling output partitions > 2 GB. …
nvdbaranec Mar 10, 2021
4a0be16
FIX Revert gpuci_conda_retry on conda file output locations (#7552)
dillon-cullinan Mar 10, 2021
2818928
Add `Series.drop` api (#7304)
isVoid Mar 10, 2021
0e2736a
Support `Series.__setitem__` with key to a new row (#7443)
isVoid Mar 10, 2021
4d1812f
Fix offset_end iterator for lists_column_view, which was not correctl…
ttnghia Mar 10, 2021
e10fc49
Fix no such file dlpack.h error when build libcudf (#7549)
chenrui17 Mar 10, 2021
8ea0a7f
Fix index mismatch issue in equality related APIs (#7555)
galipremsagar Mar 10, 2021
0d8db61
FIX Fix Anaconda upload args (#7558)
dillon-cullinan Mar 11, 2021
fa66823
Change device_vector to device_uvector in nvtext source files (#7512)
davidwendt Mar 11, 2021
d0f6c3c
Removed unneeded includes from traits.hpp (#7509)
davidwendt Mar 11, 2021
b7dd2cd
Fix cudf::lists::sort_lists failing for sliced column (#7564)
ttnghia Mar 11, 2021
5752ad3
Remove unneeded step parameter from strings::detail::copy_slice (#7525)
davidwendt Mar 11, 2021
e9e70c1
Another fix for offsets_end() iterator in lists_column_view (#7575)
ttnghia Mar 12, 2021
e373a68
Implement drop_list_duplicates (#7528)
ttnghia Mar 12, 2021
5c5beb1
Fix ORC writer output corruption with string columns (#7565)
vuule Mar 12, 2021
66beb63
Fix missing Dask imports (#7580)
Mar 12, 2021
7d3420a
Add `__repr__` for Column and ColumnAccessor (#7531)
shwina Mar 12, 2021
9db65f8
`fixed_point` + `cudf::binary_operation` API Changes (#7435)
codereport Mar 12, 2021
29f7092
CMAKE_CUDA_ARCHITECTURES doesn't change when build-system invokes cma…
robertmaynard Mar 13, 2021
89377dc
ENH Fix stale GHA and prevent duplicates (#7594)
mike-wendt Mar 14, 2021
cebc67e
Revert "ENH Fix stale GHA and prevent duplicates (#7594)" (#7595)
mike-wendt Mar 14, 2021
a7ff744
Use device_uvector, device_span in sort groupby (#7523)
karthikeyann Mar 15, 2021
6564581
Fix ORC issue with incorrect timestamp nanosecond values (#7581)
vuule Mar 15, 2021
234c562
review comments (code_report)
karthikeyann Mar 23, 2021
106f13b
style fix
karthikeyann Mar 23, 2021
f7ab0d2
name with non-ASCII UTF-8 char
karthikeyann Mar 23, 2021
a83c5d5
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
karthikeyann Mar 23, 2021
fcac9ef
merged in join rework
hyperbolic2346 Mar 23, 2021
11b4e8d
add is_relationally_comparable(table_view), review comments
karthikeyann Mar 24, 2021
e76a338
copyright year update
karthikeyann Mar 24, 2021
728ea01
Merge remote-tracking branch 'karthik/fea-sort_struct' into mwilson/s…
hyperbolic2346 Mar 25, 2021
541c38c
Adding support for equijoin on structs
hyperbolic2346 Mar 25, 2021
8f0f2c2
Updating test
hyperbolic2346 Mar 25, 2021
afc4925
Merge branch 'branch-0.19' into mwilson/struct_join
hyperbolic2346 Mar 26, 2021
5154d4e
adding some more struct join tests
hyperbolic2346 Mar 29, 2021
65fcc0e
Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/stru…
hyperbolic2346 Mar 29, 2021
b6fe885
Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/stru…
hyperbolic2346 Mar 30, 2021
2dae5a4
reverting files from branch merging
hyperbolic2346 Mar 30, 2021
3fe20de
fixing merge issue with tests
hyperbolic2346 Mar 30, 2021
681b7af
moving flatten into the hash_join object
hyperbolic2346 Mar 31, 2021
d9a7f52
Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/stru…
hyperbolic2346 Mar 31, 2021
b946217
removing semi_join special code
hyperbolic2346 Mar 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions cpp/src/join/join.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include <cudf/dictionary/detail/update_keys.hpp>
#include <cudf/join.hpp>
#include <cudf/table/table.hpp>
#include <structs/utilities.hpp>

#include <rmm/cuda_stream_view.hpp>

Expand All @@ -34,10 +35,15 @@ inner_join(table_view const& left_input,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
// flatten any structs out. Note this happens before dictionary matching because
// structs can contain dictionaries.
auto const flattened_left = structs::detail::flatten_nested_columns(left_input, {}, {});
auto const flattened_right = structs::detail::flatten_nested_columns(right_input, {}, {});
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved

// Make sure any dictionary columns have matched key sets.
// This will return any new dictionary columns created as well as updated table_views.
auto matched = cudf::dictionary::detail::match_dictionaries(
{left_input, right_input},
{std::get<0>(flattened_left), std::get<0>(flattened_right)},
stream,
rmm::mr::get_current_device_resource()); // temporary objects returned

Expand Down Expand Up @@ -101,10 +107,15 @@ left_join(table_view const& left_input,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
// flatten any structs out. Note this happens before dictionary matching because
// structs can contain dictionaries.
auto const flattened_left = structs::detail::flatten_nested_columns(left_input, {}, {});
auto const flattened_right = structs::detail::flatten_nested_columns(right_input, {}, {});

// Make sure any dictionary columns have matched key sets.
// This will return any new dictionary columns created as well as updated table_views.
auto matched = cudf::dictionary::detail::match_dictionaries(
{left_input, right_input}, // these should match
{std::get<0>(flattened_left), std::get<0>(flattened_right)}, // these should match
stream,
rmm::mr::get_current_device_resource()); // temporary objects returned
// now rebuild the table views with the updated ones
Expand Down Expand Up @@ -164,10 +175,15 @@ full_join(table_view const& left_input,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
// flatten any structs out. Note this happens before dictionary matching because
// structs can contain dictionaries.
auto const flattened_left = structs::detail::flatten_nested_columns(left_input, {}, {});
auto const flattened_right = structs::detail::flatten_nested_columns(right_input, {}, {});

// Make sure any dictionary columns have matched key sets.
// This will return any new dictionary columns created as well as updated table_views.
auto matched = cudf::dictionary::detail::match_dictionaries(
{left_input, right_input}, // these should match
{std::get<0>(flattened_left), std::get<0>(flattened_right)}, // these should match
stream,
rmm::mr::get_current_device_resource()); // temporary objects returned
// now rebuild the table views with the updated ones
Expand Down
9 changes: 8 additions & 1 deletion cpp/src/join/semi_join.cu
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/table/table.hpp>
#include <cudf/utilities/error.hpp>
#include <structs/utilities.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>
Expand Down Expand Up @@ -173,10 +174,16 @@ std::unique_ptr<cudf::table> left_semi_anti_join(
return std::make_unique<table>(left, stream, mr);
}

// flatten any structs out. Note this happens before dictionary matching because
// structs can contain dictionaries.
auto const flattened_left = structs::detail::flatten_nested_columns(left.select(left_on), {}, {});
auto const flattened_right =
structs::detail::flatten_nested_columns(right.select(right_on), {}, {});

// Make sure any dictionary columns have matched key sets.
// This will return any new dictionary columns created as well as updated table_views.
auto matched = cudf::dictionary::detail::match_dictionaries(
{left.select(left_on), right.select(right_on)},
{std::get<0>(flattened_left), std::get<0>(flattened_right)},
stream,
rmm::mr::get_current_device_resource()); // temporary objects returned

Expand Down
Loading