Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide documentation of expose APIs to enable handling of type coercion at UNION plan construction. #12142

Merged
merged 3 commits into from
Aug 26, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Aug 23, 2024

Which issue does this PR close?

Closes #12105

Rationale for this change

We construct our own logical plans for a SQL-derivative languages (e.g. InfluxQL). The construction of logic plans using UNION requires that upon construction of the union plan node, we type-coerce the expressions prior to building the rest of the plan. If we don't perform this expression rewrite, then our consuming LIMIT, GROUP BY, ORDER BY, and aggregates will not be built properly.

We do not think our use case is unique, and therefore we are (a) exposing the APIs need to perform this coercion and (b) providing docs.

What changes are included in this PR?

  • Expose the expression rewriter.
  • expose helper method coerce_union_schema
  • provide docs delineating the different type-coercion interfaces:
    • TypeCoercion
    • vs TypeCoercionRewriter
    • vs helper coerce_union_schema.
  • update the union() API, which does the logical plan construction , as to the how/when/why to use type coercion.

Are these changes tested?

No code changes.

Are there any user-facing changes?

No. Only more exposed interfaces to use.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Aug 23, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Aug 23, 2024

Note to @alamb -- exposing the rewriter api (as suggested here), enables the union type-coercion with less of the schema shenangians and overhead (vs the full analyzer). Gives the user options.

@wiedld wiedld marked this pull request as ready for review August 23, 2024 21:43
/// Union two [`LogicalPlan`]s.
///
/// Constructs the UNION plan, but does not perform type-coercion. Therefore the
/// subtree expressions will not be properly typed until the optimizer pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, does that mean if there is no optimizer phase the query will crash?

Copy link
Contributor Author

@wiedld wiedld Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe this depends on how users construct their own logical plans? And if they are already ensuring type correctness in their own code?

Our logical plan construction relied upon the previous behavior of the union() api, which was immediate type coercion. IMO - this was not an explicit api contract, rather an implementation detail. Regardless, we use the typed-coerced union schema & expressions when adding other logical nodes (e.g. limit, sort, group by). Once the behavior of the api changed, we started building incorrect plans -- which produced incorrect results and in one case errored.

I filed this ticket as a doc enhancement, not a bug, since I didn't think(?) type coercion was part of the api contract. My hope was to (a) make it clear when/how unions can be type coerced, and (b) make public the APIs to do so. In other words, make it easier for users to ensure type correctness before (or without) the optimizer pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I have phrased the docs differently? Or is there another approach we should take?

Copy link
Contributor Author

@wiedld wiedld Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to clarify. The "incorrect plans" is because the new union() behavior doesn't type-coerce the expressions and it takes the left node's schema. When we built logical plans with union + gap filling (adding casted scalar nulls), we started having missing fields etc when inspecting the union before constructing the next node.

Maybe the latter should be a bug?

/// or alternatively, merge the union input schema using [`coerce_union_schema`] and
/// apply the expression rewrite with [`coerce_plan_expr_for_schema`].
///
/// [`TypeCoercionRewriter::coerce_union`]: https://docs.rs/datafusion-optimizer/latest/datafusion_optimizer/analyzer/type_coercion/struct.TypeCoercionRewriter.html#method.coerce_union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @wiedld
There are couple of projects using DF without optimizer, so I'm thinking should we add the coerced union recommendation to the error message....

@wiedld
Copy link
Contributor Author

wiedld commented Aug 26, 2024

I'm thinking should we add the coerced union recommendation to the error message....

@comphead -- I'm not sure which error message you are referring to? 😅 🙏🏼

@comphead
Copy link
Contributor

I'm thinking should we add the coerced union recommendation to the error message....

@comphead -- I'm not sure which error message you are referring to? 😅 🙏🏼

Ignore it, for downstream projects not using the optimizer it would be difficult to understand why UNION crashed, but I believe the documentation is enough

@comphead comphead merged commit 533ddcb into apache:main Aug 26, 2024
29 checks passed
@alamb alamb deleted the 12105/unions-coerce-at-construction branch August 26, 2024 19:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld and @comphead

@@ -56,6 +56,8 @@ use datafusion_expr::{
Projection, ScalarUDF, Union, WindowFrame, WindowFrameBound, WindowFrameUnits,
};

/// Performs type coercion by determining the schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle downstream impacts to union's behavioral changes.
3 participants