Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup TreeNode implementations #8672

Merged
merged 9 commits into from
Jan 1, 2024
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 29, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

This is motivated by recent several refactoring on TreeNode. Currently we have many duplicate code (e.g., apply_children) around several TreeNode implementations. This patch cleans them up by moving duplicate function implementations back to TreeNode.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Dec 29, 2023
@@ -33,6 +33,9 @@ use crate::Result;
/// [`LogicalPlan`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/logical_plan/enum.LogicalPlan.html
/// [`Expr`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/expr/enum.Expr.html
pub trait TreeNode: Sized {
/// Returns all children of the TreeNode
fn children_nodes(&self) -> Vec<Self>;
Copy link
Member Author

@viirya viirya Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First tried with Vec<&Self>: https://github.com/apache/arrow-datafusion/compare/main...viirya:arrow-datafusion:refactor_treenode?expand=1.

Most cases it works well. But one trouble is the TreeNode implementation for Arc<T>:

impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
    fn children_nodes(&self) -> Vec<&Arc<T>> {
        // DynTreeNode.arc_children returns Vec<Arc<Self>> and this function cannot return reference of temporary object. The implementations of `DynTreeNode.arc_children` have same issue. It cannot be changed to return Vec<&Arc<Self>> too
        unimplemented!("Call arc_children instead")
    }

So changed to Vec<Self>.

@ozankabak
Copy link
Contributor

I like this cleanup. @berkaysynnada is working on an approach that builds on this idea and #8664 to further simplify things. I am hopeful we will arrive at a good refactor very soon

{
let children = match self {
Expr::Alias(Alias{expr,..})
fn children_nodes(&self) -> Vec<Self> {
Copy link
Contributor

@peter-toth peter-toth Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a nice refactor but I'm not sure it is needed at all (1.) and I'm also not sure cloning a node's children into a Vec is good direction (2.). Let me share my thoughts on these 2.

  1. I think in DataFusion there are only 4 real tree types: LogicalPlan, Expr, ExecutionPlan and PhysicalExpr and there 7 special tree types based on the above 4: SortPushDown, PlanWithRequitements, ExprOrdering, OrderPreservationContext, PipelineStatePropagator, PlanWithCorrespondingCoalescePartitions and PlanWithCorrespondingSort.
    The only reason why DataFusion currently has these 7 special trees is to be able to propagate down/up some additional information during tree transformations. But I think these special trees are simply not needed and can be removed. If we add some transformation helper functions like TreeNode.transform_down_with_payload() and TreeNode.transform_up_with_payload() we can easily remove these 7 special types completely.
    Get rid of special TreeNodes #8663 is about this issue and I removed SortPushDown, PlanWithRequitements and ExprOrdering as exaples in this PR: Transform with payload #8664.

  2. If we just focus on the 4 base tree types we can distingsuish 2 different types: Expr uses Boxes but the other 3 (LogicalPlan, ExecutionPlan and PhysicalExpr) use Arcs for the links between the nodes. I think this is very important in their behaviour. E.g. a big issue for Exprs when the nodes are cloned as it basicaly means cloning the whole subtree under the node. But cloning is not an issue for the other 3 as cloning Arcs is cheap. So I think the problem with the current Expr.apply_children() and the Expr.children_nodes() in this PR that it doesn't move the code into the a direction where cloning is not necessary on trees made of Boxes.
    What I would suggest long term is in Refactor TreeNode recursions #7942. Instead of the old TreeNode.apply_children() (or collecting the children into a Vec) we could use TreeNode.visit_children() (https://github.com/apache/arrow-datafusion/pull/7942/files#diff-6515fda3c67b9d487ab491fd21c27e32c411a8cbee24725d737290d19c10c199R31). IMO we need to implement visit_children() for the 4 base tree types only (see 1.).
    (Unfortunately I didn't noted the visit related changes in the Refactor TreeNode recursions #7942 PR description, but it is a big PR with many different ideas and focusing on transform seemed more important.)

Copy link
Member Author

@viirya viirya Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in above comment, the first try I did is to return Vec<&Self> for children_nodes. I think it makes more sense to return reference of children nodes instead of owned types (although for other implementations it is smart pointers, but I don't want to assume that the TreeNode impls must link with children nodes with Arc etc.).

But there is one trouble when implement TreeNode for Arc<T>: #8672 (comment). I've tried some ideas locally to get rid of it, but in the end they don't work so I changed children_nodes to return Vec<Self>. (still thinking if I can remove it but it may be more change. 🤔 )

This can be seen as an incremental cleanup patch as it doesn't change current API and keep current behavior but clean up the code as its purpose. In the long term I'd like to see more fundamental change to TreeNode and this can be seen as a short term effort to make the code easier to look/work on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to return Vec<Cow<Self>> that would contain borrowed elements for Expr but owned for the others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes there may be cases where one needs to do multiple passes over the same tree (e.g. doing one top down and one bottom up pass consecutively). Or we may want to run different logic on the same tree.

So it may be a little premature to conclude we can do away with "derived" trees and just keep the base four.

I suggest we take small steps and progressively discover the requirements/use cases. I think refactoring to get rid of duplicate apply_children and map_children is a good first step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's what I thought so this restrains to be limited change (mostly apply_children).

Copy link
Contributor

@peter-toth peter-toth Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes there may be cases where one needs to do multiple passes over the same tree (e.g. doing one top down and one bottom up pass consecutively). Or we may want to run different logic on the same tree.

This is also supported with transform_with_payload() (https://github.com/apache/arrow-datafusion/pull/8664/files#diff-1148652cb6868dfd5d45d595a1013232c2407f3c89d3850a4ac8e48b8a0884e1R127-R151) that does a top-down and a bottom-up transforms too and capable to propagate different payloads in the 2 directions.

So it may be a little premature to conclude we can do away with "derived" trees and just keep the base four.

I'm happy to try refactoring the remaining 4 (OrderPreservationContext, PipelineStatePropagator, PlanWithCorrespondingCoalescePartitions and PlanWithCorrespondingSort) in scope of #8664 next week and check the above statement.
(I thought a smaller PR that shows 1-2 examples for both up and down directions would be easier to review, but probably we can do it in 1 big PR if that's prefered.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated the Vec<Cow<Self>> idea (#8672 (comment)) a bit further and it seems possible to avoid cloning at many places. This could be especially useful for Exprs where cloning is very expensive. I've opened a PR to this PR to here: viirya#132

fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
where
F: FnMut(&Self) -> Result<VisitRecursion>,
{
for child in self.arc_children() {
match op(&child)? {
for child in &self.arc_children() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this 2nd apply_children() implementation here? The base implementation in TreeNode might be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, missed this. This can be removed too.

Avoid cloning in `TreeNode.children_nodes()` implementations where possible using `Cow`
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks very good now.

@@ -368,7 +371,7 @@ impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
let arc_self = Arc::clone(&self);
self.with_new_arc_children(arc_self, new_children?)
} else {
Ok(self)
Ok(self.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor note, that this clone seems to be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, probably I added it during trying different ideas locally. Removed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @viirya and the reviews @ozankabak and @peter-toth .

This PR seems like an improvement to me, but I do also wonder how much overlap it has with #8664 🤔

@peter-toth
Copy link
Contributor

peter-toth commented Dec 31, 2023

This PR seems like an improvement to me, but I do also wonder how much overlap it has with #8664 🤔

This looks like a nice improvement to me too and now I think the 2 PRs are orthogonal.

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step towards simplifying TreeNode-related code. Thanks @viirya

@viirya
Copy link
Member Author

viirya commented Dec 31, 2023

Thank you @alamb @ozankabak @peter-toth

@viirya viirya merged commit bf3bd92 into apache:main Jan 1, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants