Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle downstream impacts to union's behavioral changes. #12105

Closed
wiedld opened this issue Aug 21, 2024 · 6 comments · Fixed by #12142
Closed

Handle downstream impacts to union's behavioral changes. #12105

wiedld opened this issue Aug 21, 2024 · 6 comments · Fixed by #12142
Labels
enhancement New feature or request

Comments

@wiedld
Copy link
Contributor

wiedld commented Aug 21, 2024

Is your feature request related to a problem or challenge?

With this upstream change, an existing union() API had a behavioral change. This API is used to construct the UNION logical plan. Previously, it had been constructing the UNION plan node and performing type coercion. After the upstream change, it no longer perform the type coercion at logical plan construction (and instead relies upon a later logical plan optimizer pass).

This upstream change does not cause any failures in the upstream datafusion tests. However, for those constructing their own logical plans with UNION it does have significant downstream impacts. Explicitly:

  1. we have a UNION logical plan construction using union() upstream API
  2. we consume the schema & expr from the union, when deciding how to construct other logical plan nodes (e.g. sort, grouping, aggregates, limits)
  3. then afterwards => we hand the logical plan to datafusion for optimization (including type coercion)

When the upstream change removed the UNION type coercion from step 1 above, our logical plan construction code (step 2) started creating incorrect logical plans.

Describe the solution you'd like

Update the union() API docs with how to handle the changed behavior if the user still requires UNION type coercion at logical plan construction. Something like:

let union_schema = coerce_union_schema(vec![prev, next])?;
let prev = coerce_plan_expr_for_schema(prev, &union_schema)?;
let next = coerce_plan_expr_for_schema(next, &union_schema)?;
union(prev, next)

The above code relies upon the already public coerce_plan_expr_for_schema. The coerce_union_schemaI is not yet public, therefore we recommend making it public.

Describe alternatives you've considered

The original change was put in by @jonahgao , whom may have some alternative ideas?

Additional context

We have made our own public (patched) version of the coerce_union_schema as a temporary measure to fix the issue, and can confirm it's sufficiently. Also open to other ideas.

@wiedld wiedld added the enhancement New feature or request label Aug 21, 2024
@jonahgao
Copy link
Member

Sorry for introducing this behavior change. There are two reasons we can't perform coercion when building unions.

  1. We can't get a correct common schema without performing type coercion for expressions, just like the case in Incorrect result returned by UNION ALL (SQLancer-TLP) #11742.
  2. Wildcard expansion was moved to the analyzer recently. When union the two following projections, the number of their expressions will be different, resulting in an inability to perform type coercion.
create table t1(a int, b int);
create table t2(c int, d int);
select * from t1 union select c, d from t2;

I think it's reasonable to make coerce_union_schema public, allowing users to coerce unions if they are not affected by the aforementioned scenarios.

Another potential solution might be to manually run the necessary Analyzer rules, such as ExpandWildcardRule and TypeCoercion, after step 1.

@alamb
Copy link
Contributor

alamb commented Aug 22, 2024

Thank you for the explanation

I think for our use (which is a planner for the InfluxQL query language) we don't need to handle wildcards

I do think it would be interesting to try calling analyze rather than the lower level type coercion

I believe @wiedld said she tried this and the additional coercion performed by TypeCoercion caused some other problem

@wiedld
Copy link
Contributor Author

wiedld commented Aug 23, 2024

Thank you @jonahgao . Agreed that the change was needed.

I was hoping for an update to the docs on how to handle the change (if the user needs to) when using the union() API. Such as adding an explanation & the example code snippet in the method docs.

If the user needs the exact same logical plan as before the change, then this is roughly how we did it:

let union_schema = coerce_union_schema(vec![prev, next])?;
let prev = coerce_plan_expr_for_schema(prev, &union_schema)?;
let next = coerce_plan_expr_for_schema(next, &union_schema)?;
let plan = union(prev, next)?;

If the user is ok with added CAST nodes (including timezone cast), then they could use the analyzers:

let mut plan = union(prev, next)?;
let wildcard_rewriter = ExpandWildcardRule::new();
plan = wildcard_rewriter.analyze(plan, context.inner().copied_config().options())?;
let type_rewriter = TypeCoercion::new();
plan = type_rewriter.analyze(plan, context.inner().copied_config().options())?;

@alamb -- we have passing InfluxQL e2es with the second (analyzer) approach, diffing a lot of added CAST nodes. I'm not sure about the use case for other users whom build their own logical plans (a.k.a. how the casts may impact them).

@wiedld
Copy link
Contributor Author

wiedld commented Aug 23, 2024

I was hoping for an update to the docs on how to handle the change (if the user needs to) when using the union() API. Such as adding an explanation & the example code snippet in the method docs.

I can make this change. Just wanted input (such as the wildcard expansion explanation) before doing so. Are we good for this updated doc change? Or should we do anything else?

@jonahgao
Copy link
Member

I can make this change. Just wanted input (such as the wildcard expansion explanation) before doing so. Are we good for this updated doc change? Or should we do anything else?

Thank you @wiedld 👍 .

I think we can remind users about wildcard expansion in the document.

Another thing is that I wonder whether it is necessary to expose coerce_union, allowing users to try using it first, as it is somewhat a high-level API.

@wiedld
Copy link
Contributor Author

wiedld commented Aug 23, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants