ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

Dandandan · 2021-01-16T10:16:23Z

This is a WIP PR for implementing a similar approach to hashing as used in the hash join.
For the hash-aggregate heavy query TCPH query 1 this speeds it up (without collision checking, although that should not significantly impact performance) by ~30%.

TODO:

Implement collision checking
Add test for collisions
Move some code to hash utils
Implement hashing for f32 / f64 (a PR has been merged recently with group by f32/f64 support)

Benchmark results
PR

Query 1 iteration 0 took 457.0 ms
Query 1 iteration 1 took 459.7 ms
Query 1 iteration 2 took 459.3 ms
Query 1 iteration 3 took 461.1 ms
Query 1 iteration 4 took 456.8 ms
Query 1 iteration 5 took 460.6 ms
Query 1 iteration 6 took 462.0 ms
Query 1 iteration 7 took 462.3 ms
Query 1 iteration 8 took 461.0 ms
Query 1 iteration 9 took 466.4 ms
Query 1 avg time: 460.63 ms

Vectorized hashing branch:

Query 1 iteration 0 took 650.0 ms
Query 1 iteration 1 took 648.5 ms
Query 1 iteration 2 took 646.8 ms
Query 1 iteration 3 took 646.2 ms
Query 1 iteration 4 took 645.7 ms
Query 1 iteration 5 took 643.0 ms
Query 1 iteration 6 took 649.5 ms
Query 1 iteration 7 took 649.5 ms
Query 1 iteration 8 took 643.4 ms
Query 1 iteration 9 took 643.6 ms
Query 1 avg time: 646.63 ms

Use unique_ptr to hold FunctionOptions derived classes instances to fix `invalid-offsetof` python build warnings. ~/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: warning: offsetof within non-standard-layout type ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof] x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__); Closes apache#9274 from cyb70289/offsetof Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

@mbrubeck

… -97% for length) # Rational Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it. Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions. # This PR This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like). This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator. The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`. Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways). Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6). ```bash git checkout master cargo bench --bench arithmetic_kernels git checkout length_faster cargo bench --bench arithmetic_kernels git checkout 16bc720 cargo bench --bench length_kernel git checkout length_faster ``` ``` Switched to branch 'length_faster' (use "git pull" to merge the remote branch into yours) Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 1m 02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83 Gnuplot not found, using plotters backend add 512 time: [522.24 ns 523.67 ns 525.26 ns] change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe subtract 512 time: [503.18 ns 504.93 ns 506.81 ns] change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe multiply 512 time: [508.25 ns 512.04 ns 516.06 ns] change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild divide 512 time: [1.4711 us 1.4753 us 1.4799 us] change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe limit 512, 512 time: [373.47 ns 377.76 ns 382.21 ns] change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05) Performance has regressed. add_nulls_512 time: [502.94 ns 504.51 ns 506.28 ns] change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high severe divide_nulls_512 time: [1.4843 us 1.4931 us 1.5053 us] change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05) Performance has improved. Found 24 outliers among 100 measurements (24.00%) 15 (15.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ``` Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master): ``` length time: [1.5379 us 1.5408 us 1.5437 us] change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild 4 (4.00%) high severe ``` Closes apache#9235 from jorgecarleitao/length_faster Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…atch, remove test::format_batch The `test::format_batch` function does not have wide range of type support (e.g. it doesn't support dictionaries) and its output makes tests hard to read / update, in my opinion. This PR consolidates the datafusion tests to use `arrow::util::pretty::pretty_format_batches` both to reduce code duplication as well as increase type support This PR removes the `test::format_batch(&batch);` function and replaces it with `arrow::util::pretty::pretty_format_batches` and some macros. It has no code changes. This change the following benefits: 1. Better type support (I immediately can compare RecordBatches with `Dictionary` types in tests without having to update `format_batch` and apache#9233 gets simpler) 2. Better readability and error reporting (at least I find the code and diffs easier to understand) 3. Easier test update / review: it is easier to update the diffs (you can copy/paste the test output into the source code) and to review them This is a variant of a strategy that I been using with success in IOx [source link](https://github.com/influxdata/influxdb_iox/blob/main/arrow_deps/src/test_util.rs#L15) and I wanted to contribute it back. An example failure with this PR: ``` ---- physical_plan::hash_join::tests::join_left_one stdout ---- thread 'physical_plan::hash_join::tests::join_left_one' panicked at 'assertion failed: `(left == right)` left: `["+----+----+----+----+", "| a1 | b2 | c1 | c2 |", "+----+----+----+----+", "| 1 | 1 | 7 | 70 |", "| 2 | 2 | 8 | 80 |", "| 2 | 2 | 9 | 80 |", "+----+----+----+----+"]`, right: `["+----+----+----+----+----+", "| a1 | b1 | c1 | a2 | c2 |", "+----+----+----+----+----+", "| 1 | 4 | 7 | 10 | 70 |", "| 2 | 5 | 8 | 20 | 80 |", "| 3 | 7 | 9 | | |", "+----+----+----+----+----+"]`: expected: [ "+----+----+----+----+", "| a1 | b2 | c1 | c2 |", "+----+----+----+----+", "| 1 | 1 | 7 | 70 |", "| 2 | 2 | 8 | 80 |", "| 2 | 2 | 9 | 80 |", "+----+----+----+----+", ] actual: [ "+----+----+----+----+----+", "| a1 | b1 | c1 | a2 | c2 |", "+----+----+----+----+----+", "| 1 | 4 | 7 | 10 | 70 |", "| 2 | 5 | 8 | 20 | 80 |", "| 3 | 7 | 9 | | |", "+----+----+----+----+----+", ] ``` You can copy/paste the output of `actual` directly into the test code for an update. Closes apache#9264 from alamb/remove_test_format_batch Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…null

@paddyhoran

This PR changes the Rust CI job to include the prettyprint feature. (so we have CI coverage of that feature) As @paddyhoran noted, passing in feature flags to the root workspace doesn't do anything, and as it turns out we already invoke `cargo test` for the `arrow` crate separately. Prior to this PR I think that just ran the same tests again with the same features. After this PR, the second run of the arrow tests are run with the `prettyprint` feature Here is evidence that the tests are running twice. I picked a PR (randomly) https://github.com/apache/arrow/pull/9258/checks?check_run_id=1725967358 If you look at the logs for the run-tests action ![Screen Shot 2021-01-19 at 7 42 13 AM](https://user-images.githubusercontent.com/490673/105036642-c4188680-5a2a-11eb-8968-c570d855724e.png) You can see that the same test is run twice: ``` 2021-01-19T07:02:49.9476111Z test compute::kernels::boolean::tests::test_nonnull_array_is_not_null ... ok 2021-01-19T07:02:49.9476854Z test compute::kernels::boolean::tests::test_nonnull_array_is_null ... ok 2021-01-19T07:02:49.9477616Z test compute::kernels::boolean::tests::test_nonnull_array_with_offset_is_not_null ... ok 2021-01-19T07:02:49.9478427Z test compute::kernels::boolean::tests::test_nonnull_array_with_offset_is_null ... ok 2021-01-19T07:02:49.9479207Z test compute::kernels::boolean::tests::test_nullable_array_is_null ... ok 2021-01-19T07:02:49.9479948Z test compute::kernels::boolean::tests::test_nullable_array_is_not_null ... ok 2021-01-19T07:02:49.9480744Z test compute::kernels::boolean::tests::test_nullable_array_with_offset_is_not_null ... ok 2021-01-19T07:02:49.9481640Z test compute::kernels::boolean::tests::test_nullable_array_with_offset_is_null ... ok 2021-01-19T07:02:49.9487019Z test compute::kernels::boolean::tests::test_nullif_int_array_offset ... ok 2021-01-19T07:02:49.9487773Z test compute::kernels::boolean::tests::test_nullif_int_array ... ok ... 2021-01-19T07:07:23.4568865Z test compute::kernels::boolean::tests::test_nonnull_array_is_not_null ... ok 2021-01-19T07:07:23.4569576Z test compute::kernels::boolean::tests::test_nonnull_array_is_null ... ok 2021-01-19T07:07:23.4570337Z test compute::kernels::boolean::tests::test_nonnull_array_with_offset_is_not_null ... ok 2021-01-19T07:07:23.4571133Z test compute::kernels::boolean::tests::test_nonnull_array_with_offset_is_null ... ok 2021-01-19T07:07:23.4571885Z test compute::kernels::boolean::tests::test_nullable_array_is_not_null ... ok 2021-01-19T07:07:23.4572614Z test compute::kernels::boolean::tests::test_nullable_array_is_null ... ok 2021-01-19T07:07:23.4573383Z test compute::kernels::boolean::tests::test_nullable_array_with_offset_is_not_null ... ok 2021-01-19T07:07:23.4574183Z test compute::kernels::boolean::tests::test_nullable_array_with_offset_is_null ... ok 2021-01-19T07:07:23.4574929Z test compute::kernels::boolean::tests::test_nullif_int_array ... ok 2021-01-19T07:07:23.4575636Z test compute::kernels::boolean::tests::test_nullif_int_array_offset ... ok ``` Closes apache#9262 from alamb/alamb/run_prettyprint_ci Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…ull) values The idea of this PR is to have a function `from_iter_values` that (just like `from_iter`) creates an array based on an iterator, but from `T` instead of `Option<T>`. I have seen some places in DataFusion (especially `to_array_of_size`) where an `Array` is generated from a `Vec` of items, which could be replaced by this. The other iterators have some memory / time overhead in both creating and manipulating the null buffer (and in the case of `Vec` for allocating / dropping the Vec) Closes apache#9293 from Dandandan/array_iter_non_null Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…perf

…ing_agg

alamb · 2021-03-03T11:27:40Z

@Dandandan I am closing this PR for the time being to clean up the Rust/Arrow PR backlog. Please let me know if this is a mistake

Dandandan · 2021-03-03T12:03:36Z

@alamb thanks, probably will open this or a new one if/when I continue with it

alamb · 2021-03-03T13:58:58Z

Closing this PR for now; We can reopen it if need be

Dandandan added 30 commits January 1, 2021 12:48

Calculate column indices upfront

19c84d7

Make function part of HashJoinExec

45fa819

Small refactor by using struct

baa5dce

Add comments

5b6bf83

Experiment with single batch

38d4651

Use index as offset

54b9c3a

Use rows

2db270c

Refactor with single batch

5842c73

Refactor

1511f0e

Small simplification

a976ace

Merge

bc5bdfa

Add comment

797ffb6

Small simplifiction

b865b8a

Merge

2418113

Reuse num rows

0efb5a7

Rename as offset

bbec070

Use u64 for left indices

6d182c9

Use appending by slice, small refactor

d58abf6

Style tweak

69d3286

Style

c96ba9f

Doc updates

33b96f3

Start vectorized hashing

c3bdde1

Fmt

9cf8023

Function to combine hash values

dc0c036

Fix test

a2537ef

Remove prints

703438f

fmt

24537da

Merge branch 'master' into vectorized_hashing

de17003

Merge master

ac0de44

Clippy

28cd876

cyb70289 and others added 3 commits January 22, 2021 13:20

Merge

c4e7e94

Dandandan force-pushed the vectorized_hashing_agg branch from d68f99e to c4e7e94 Compare January 23, 2021 11:30

alamb and others added 17 commits January 23, 2021 06:56

Merge branch 'master' of github.com:apache/arrow into array_iter_non_…

79c92aa

…null

Use extend

941ee5d

Use .collect() api

a37941c

Use iterators in to_array_of_size

e448fcc

Add benchmark

69c298e

Use none

d612b0f

Use None for Microsecond as well

6144a23

Use None

f2c4e26

Merge remote-tracking branch 'upstream/master' into to_array_of_size_…

e07f7e5

…perf

Merge branch 'master' of github.com:apache/arrow into vectorized_hash…

8ea696b

…ing_agg

Merge branch 'to_array_of_size_perf' into vectorized_hashing_agg

28ea738

Use None for strings

eddf021

fmt

555eb1d

Merge branch 'to_array_of_size_perf' into vectorized_hashing_agg

ed6b68a

Dandandan force-pushed the vectorized_hashing_agg branch from 5e1be03 to ed6b68a Compare January 25, 2021 19:06

kszucs force-pushed the master branch from d238279 to 437c8c9 Compare January 25, 2021 21:53

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

gangliao mentioned this pull request Feb 24, 2021

Todo: Distributed Dataflow model flock-lab/flock#188

Closed

alamb closed this Mar 3, 2021

Dandandan mentioned this pull request Apr 21, 2021

Vectorized hashing for hash aggregation code apache/datafusion#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

Dandandan commented Jan 16, 2021 •

edited

Loading

alamb commented Mar 3, 2021

Dandandan commented Mar 3, 2021

alamb commented Mar 3, 2021

ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

Conversation

Dandandan commented Jan 16, 2021 • edited Loading

alamb commented Mar 3, 2021

Dandandan commented Mar 3, 2021

alamb commented Mar 3, 2021

Dandandan commented Jan 16, 2021 •

edited

Loading