ARROW-12170: [Rust][DataFusion] Introduce repartition optimization #9865

Dandandan · 2021-03-31T19:58:23Z

This introduces a optimization pass to introduce repartition whenever the number of partitions of the plan drops below the configured amount of concurrency to optimize the amount of achievable concurrency.

This PR separates the optimizations into a PhysicalOptimizer, so this can be extended and built upon later.

The performance benefit is clear when loading data into memory with a single partition, to test the use case whenever we would have single files or in memory data has high enough throughput, but the single partition causes too little parallelism.
This has a similar performance benefit of pre-partitioning the data and loading it in memory in those queries.

cargo run --release --bin tpch --features "snmalloc" -- benchmark --iterations 30 --path [path --format parquet --query 1 --batch-size 8192 --concurrency 16 -m -n 1

Master

Query 1 avg time: 411.57 ms
Query 3 avg time: 147.32 ms
Query 5 avg time: 237.62 ms
Query 6 avg time: 46.00 ms
Query 12 avg time: 124.02 ms

PR

Query 1 avg time: 76.37 ms
Query 3 avg time: 67.51 ms
Query 5 avg time: 134.14 ms
Query 6 avg time: 9.58 ms
Query 12 avg time: 20.60 ms

All in all, looking good, we observe speed ups up to 6x for this test!

andygrove · 2021-03-31T23:21:16Z

I like where this is heading @Dandandan 🚀

github-actions · 2021-04-01T01:20:22Z

https://issues.apache.org/jira/browse/ARROW-12170

alamb

This is looking pretty cool @Dandandan

alamb · 2021-04-01T21:04:35Z

rust/datafusion/src/physical_optimizer/coalesce_batches.rs

+        // wrap operators in CoalesceBatches to avoid lots of tiny batches when we have
+        // highly selective filters
+        let children = plan
+            .children()
+            .iter()
+            .map(|child| self.optimize(child.clone(), config))
+            .collect::<Result<Vec<_>>>()?;


I realize you are just moving code around so this comment is outside the context of this PR....

However, I wonder if it would be more performant to do the coalescing directly in the filter kernel code -- the way coalsce is written today requires copying the the (filtered) output into a different (coalesced) array

I think @ritchie46 had some code that allowed incrementally building up output in several chunks as part of polars which may be relevant

I think this code is good, but I wanted to plant a seed 🌱 for future optimizations

I think that might be a useful direction indeed!

I think indeed it can be more efficient in some cases for nodes to write to mutable buffers than produce smaller batches and concatenate them afterwards, although currently it does not seem to me like it would be a enormous performance improvement based on what I saw in profiling info.

Probably not something in the scope of this PR indeed as it's already getting pretty big.

Some other notes:

In this PR I think I had to create the physical optimizer abstraction, as otherwise I felt the planner would become unmaintainable. The planning and optimization are now separated and not in the same pass like before (I was a bit confused actually about how it worked before!)

Currently I added the AddMergeExec as an optimization pass, as that was like that in the code before, however it feels a bit off as optimization pass? But I will probably keep it like that for now.

rust/datafusion/src/physical_optimizer/coalesce_batches.rs

rust/datafusion/src/physical_optimizer/merge_exec.rs

alamb · 2021-04-01T21:08:09Z

rust/datafusion/src/physical_optimizer/repartition.rs

+    // Recurse into children bottom-up (added nodes should be as deep as possible)
+
+    let new_plan = if plan.children().is_empty() {
+        // leaf node - don't replace children


it seems like new_with_children could handle the case of zero children as well, FWIW

alamb · 2021-04-01T21:10:01Z

rust/datafusion/src/physical_plan/planner.rs

-                        .collect(),
-                ),
-            }
+        let optimizers = &ctx_state.config.physical_optimizers;


seddonm1 · 2021-04-02T02:14:48Z

@Dandandan great that someone is working through these optimisations!

Dandandan · 2021-04-03T19:01:39Z

No worries @alamb

codecov-io · 2021-04-03T19:16:25Z

Codecov Report

Merging #9865 (4bb4dc5) into master (81f6521) will increase coverage by 0.07%.
The diff coverage is 87.85%.

@@            Coverage Diff             @@
##           master    #9865      +/-   ##
==========================================
+ Coverage   82.70%   82.77%   +0.07%     
==========================================
  Files         257      260       +3     
  Lines       60486    60625     +139     
==========================================
+ Hits        50027    50185     +158     
+ Misses      10459    10440      -19

Impacted Files	Coverage Δ
rust/benchmarks/src/bin/tpch.rs	`38.33% <0.00%> (ø)`
rust/datafusion/src/physical_plan/mod.rs	`88.00% <ø> (ø)`
rust/datafusion/src/physical_plan/parquet.rs	`87.57% <ø> (+0.29%)`	⬆️
rust/datafusion/src/physical_plan/hash_join.rs	`85.20% <77.08%> (+0.60%)`	⬆️
...st/datafusion/src/physical_optimizer/merge_exec.rs	`77.77% <77.77%> (ø)`
rust/datafusion/src/execution/context.rs	`92.58% <87.50%> (-0.45%)`	⬇️
...afusion/src/physical_optimizer/coalesce_batches.rs	`88.23% <88.23%> (ø)`
rust/datafusion/src/physical_plan/planner.rs	`80.22% <91.89%> (+0.29%)`	⬆️
...t/datafusion/src/physical_optimizer/repartition.rs	`96.72% <96.72%> (ø)`
rust/datafusion/src/datasource/parquet.rs	`94.40% <100.00%> (-0.04%)`	⬇️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5489bbf...4bb4dc5. Read the comment docs.

alamb · 2021-04-04T11:00:52Z

I spent some time looking at this.

It seems as if at least one problem is that the TopKExec user defined plan should declare that it takes only a single input partition from its child:

#[async_trait]
impl ExecutionPlan for TopKExec {
...
    fn required_child_distribution(&self) -> Distribution {
        Distribution::SinglePartition
    }
...

However, that is still not enough to get the tests to pass. My debugging of cargo test -p datafusion --test user_defined_plan suggests that the actual TopK stream never runs; I'll keep putzing with it

Dandandan · 2021-04-04T19:00:37Z

Thanks @alamb that gives some more pointers to continue with. I am also interested what the eventual (optimized) physical plan looks like. At least not having the single partition will avoid having the mergeexec and/or allow the repartitioning to be added as a direct child.

Dandandan · 2021-04-05T09:58:51Z

Somehow this is what the plan looks like - already before optimization(?) it seems it doesn't have children somehow

TopKExec

Dandandan · 2021-04-05T10:03:30Z

It seems something in the Repartition optimizer, when disabling it it succeeds now. An added repartition should always leed to a mergeexec, whenever the partition count is too high, so it must be something between repartion and addmergeexec?

alamb · 2021-04-05T10:38:32Z

Somehow this is what the plan looks like - already before optimization(?) it seems it doesn't have children somehow

I think this is due to the Debug implementation not printing out children recursively:

arrow/rust/datafusion/tests/user_defined_plan.rs

Line 336 in 9262a5d

write!(f, "TopKExec")

Dandandan · 2021-04-05T10:41:02Z

Somehow this is what the plan looks like - already before optimization(?) it seems it doesn't have children somehow

I think this is due to the Debug implementation not printing out children recursively:

arrow/rust/datafusion/tests/user_defined_plan.rs

Line 336 in 9262a5d

write!(f, "TopKExec")

Yeah, I realized that later. Don't know how it is handled in others though, seems formatters implementations only need to include themselves, not their children?

Dandandan · 2021-04-05T10:43:25Z

Somehow this is what the plan looks like - already before optimization(?) it seems it doesn't have children somehow

I think this is due to the Debug implementation not printing out children recursively:

arrow/rust/datafusion/tests/user_defined_plan.rs

Line 336 in 9262a5d

write!(f, "TopKExec")

Yeah, I realized that later. Don't know how it is handled in others though, seems formatters implementations only need to include themselves, not their children?

Ah it is only defined like that on the logical plan...

Dandandan · 2021-04-05T14:07:47Z

This is the optimized plan, which looks ok, there is a RepartitionExec wrapping the CsvExec (which is ok) and a MergeExec just before TopKExec:

TopKExec { input: MergeExec { input: ProjectionExec { expr: [(Column { name: "customer_id" }, "customer_id"), (Column { name: "revenue" }, "revenue")], schema: Schema { fields: [Field { name: "customer_id", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "revenue", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, input: CoalesceBatchesExec { input: RepartitionExec { input: CsvExec { path: "tests/customer.csv", filenames: ["tests/customer.csv"], schema: Schema { fields: [Field { name: "customer_id", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "revenue", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, has_header: false, delimiter: Some(44), file_extension: ".csv", projection: Some([0, 1]), projected_schema: Schema { fields: [Field { name: "customer_id", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "revenue", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, batch_size: 8192, limit: None }, partitioning: RoundRobinBatch(16), channels: Mutex { data: {} } }, target_batch_size: 4096 } } }, k: 3 }

…rition-opt

Dandandan · 2021-04-08T17:28:15Z

Something is happening with polling here and marking the stream as done. When I remove that part, the TopKExec will get results in a later moment. So the fold/aggregate is not getting all batches somehow?

Dandandan · 2021-04-08T19:08:18Z

@alamb I managed to fix the TopKExec test by moving the self.done = true inside the poll_next_unpin. I am not 100% what is the exact difference is in semantics? But it seems more similar as what's done in e.g. limit (#9926)

alamb

I think this looks great @Dandandan - thanks!

alamb · 2021-04-08T21:57:35Z

rust/datafusion/tests/user_defined_plan.rs

@@ -471,12 +471,10 @@ impl Stream for TopKReader {
            return Poll::Ready(None);
        }
        // this aggregates and thus returns a single RecordBatch.
-        self.done = true;


this is probably a real bug -- the topK isn't actually done until it produces output -- and the way this code is written top_values.poll_next_unpin() may not return Poll::NotReady ? Maybe?

Anyhow, looks great

That was my intuition too, yeah. This seems to be more correct and fixes the issue.

alamb · 2021-04-08T21:58:06Z

I'll plan to merge this once CI is done (it looks like there is a bit of a backup now)

This introduces a optimization pass to introduce repartition whenever the number of partitions of the plan drops below the configured amount of concurrency to optimize the amount of achievable concurrency. This PR separates the optimizations into a `PhysicalOptimizer`, so this can be extended and built upon later. The performance benefit is clear when loading data into memory with a single partition, to test the use case whenever we would have single files or in memory data has high enough throughput, but the single partition causes too little parallelism. This has a similar performance benefit of pre-partitioning the data and loading it in memory in those queries. ``` cargo run --release --bin tpch --features "snmalloc" -- benchmark --iterations 30 --path [path --format parquet --query 1 --batch-size 8192 --concurrency 16 -m -n 1 ``` Master ``` Query 1 avg time: 411.57 ms Query 3 avg time: 147.32 ms Query 5 avg time: 237.62 ms Query 6 avg time: 46.00 ms Query 12 avg time: 124.02 ms ``` PR ``` Query 1 avg time: 76.37 ms Query 3 avg time: 67.51 ms Query 5 avg time: 134.14 ms Query 6 avg time: 9.58 ms Query 12 avg time: 20.60 ms ``` All in all, looking good, we observe speed ups up to 6x for this test! Closes apache#9865 from Dandandan/reparition-opt Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan added 15 commits March 31, 2021 00:30

WIP

ad51f3f

WIP

0cfa2c6

Add test

a30afc3

WIP

823cf54

WIP

c2f4de8

WIP

065abf4

WIP

69d1bd9

Fix memec

4ce8ec6

Merge branch 'master' into reparition-opt

9180921

Fix test

97e071d

Fmt

556779e

Reorganize

dfc8b6c

Reorganize

5c9cadf

Fix

24c4941

Reorganize

0a50e91

github-actions bot added Component: Rust - DataFusion Component: Rust labels Apr 1, 2021

Dandandan added 6 commits April 1, 2021 08:33

Update tests expectations

4595484

Update tests expectations

bd83b96

Update tests expectations

5f05dae

Add CoalesceBatches / AddMergeExec as optimizers

3aa5bb1

Fix tests

2262c39

Merge remote-tracking branch 'upstream/master' into reparition-opt

beb5863

alamb reviewed Apr 1, 2021

View reviewed changes

Dandandan added 2 commits April 2, 2021 08:25

Docs, test

15009ab

Merge remote-tracking branch 'upstream/master' into reparition-opt

7aafebd

Dandandan marked this pull request as ready for review April 2, 2021 06:50

Disable rule for concurrency of 1

fb2183b

Change to Distribution::SinglePartition

bf434a9

Dandandan added 5 commits April 5, 2021 16:25

Use derive(Debug)

d9c7a2c

Revert debug implementation

1cad621

Merge remote-tracking branch 'upstream/master' into reparition-opt

c6da67d

Merge branch 'reparition-opt' of github.com:Dandandan/arrow into repa…

c674970

…rition-opt

Merge remote-tracking branch 'upstream/master' into reparition-opt

6bf6a72

Dandandan added 3 commits April 8, 2021 20:18

Merge remote-tracking branch 'upstream/master' into reparition-opt

40fc828

Fix topkexec

2a1e53f

Remove print

4bb4dc5

alamb approved these changes Apr 8, 2021

View reviewed changes

alamb closed this in 6bace6e Apr 9, 2021

alamb mentioned this pull request Apr 16, 2021

ARROW-12421: [Rust] [DataFusion] Disable repartition rule #10069

Closed

asfimport mentioned this pull request Apr 9, 2021

[Rust][DataFusion] Introduce repartition optimization #27991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12170: [Rust][DataFusion] Introduce repartition optimization #9865

ARROW-12170: [Rust][DataFusion] Introduce repartition optimization #9865

Dandandan commented Mar 31, 2021 •

edited

Loading

andygrove commented Mar 31, 2021

github-actions bot commented Apr 1, 2021

alamb left a comment

alamb Apr 1, 2021

Dandandan Apr 1, 2021 •

edited

Loading

alamb Apr 1, 2021

alamb Apr 1, 2021

seddonm1 commented Apr 2, 2021

Dandandan commented Apr 3, 2021

codecov-io commented Apr 3, 2021 •

edited

Loading

alamb commented Apr 4, 2021

Dandandan commented Apr 4, 2021 •

edited

Loading

Dandandan commented Apr 5, 2021 •

edited

Loading

Dandandan commented Apr 5, 2021 •

edited

Loading

alamb commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 8, 2021

Dandandan commented Apr 8, 2021 •

edited

Loading

alamb left a comment

alamb Apr 8, 2021

Dandandan Apr 9, 2021

alamb commented Apr 8, 2021

ARROW-12170: [Rust][DataFusion] Introduce repartition optimization #9865

ARROW-12170: [Rust][DataFusion] Introduce repartition optimization #9865

Conversation

Dandandan commented Mar 31, 2021 • edited Loading

andygrove commented Mar 31, 2021

github-actions bot commented Apr 1, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 1, 2021

Choose a reason for hiding this comment

Dandandan Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Apr 1, 2021

Choose a reason for hiding this comment

alamb Apr 1, 2021

Choose a reason for hiding this comment

seddonm1 commented Apr 2, 2021

Dandandan commented Apr 3, 2021

codecov-io commented Apr 3, 2021 • edited Loading

Codecov Report

alamb commented Apr 4, 2021

Dandandan commented Apr 4, 2021 • edited Loading

Dandandan commented Apr 5, 2021 • edited Loading

Dandandan commented Apr 5, 2021 • edited Loading

alamb commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 5, 2021

Dandandan commented Apr 8, 2021

Dandandan commented Apr 8, 2021 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 8, 2021

Choose a reason for hiding this comment

Dandandan Apr 9, 2021

Choose a reason for hiding this comment

alamb commented Apr 8, 2021

Dandandan commented Mar 31, 2021 •

edited

Loading

Dandandan Apr 1, 2021 •

edited

Loading

codecov-io commented Apr 3, 2021 •

edited

Loading

Dandandan commented Apr 4, 2021 •

edited

Loading

Dandandan commented Apr 5, 2021 •

edited

Loading

Dandandan commented Apr 5, 2021 •

edited

Loading

Dandandan commented Apr 8, 2021 •

edited

Loading