Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For SqlColumnPrunerOptimizer - add support for CTEs in sub-queries #1613

Merged
merged 4 commits into from
Jan 22, 2025

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Jan 18, 2025

Previously, the SqlColumnPrunerOptimizer only accounted for CTEs defined at the top level as in current use, the DataflowToSqlPlanConverter only generated SQL plans in that way. To allow for generation of SQL plans with CTEs defined in sub-queries (at any depth), this PR updates SqlColumnPrunerOptimizer to keep track of the CTE alias mapping / CTE context when traversing the SQL plan.

@cla-bot cla-bot bot added the cla:yes label Jan 18, 2025
@plypaul plypaul force-pushed the p--sql-updates--02 branch from 684c3c6 to bde7869 Compare January 18, 2025 04:17
@plypaul plypaul changed the title For SqlColumnPrunerOptimizer - Add support CTEs in subqueries For SqlColumnPrunerOptimizer - Add support for CTEs in subqueries Jan 18, 2025
@plypaul plypaul changed the title For SqlColumnPrunerOptimizer - Add support for CTEs in subqueries For SqlColumnPrunerOptimizer - add support for CTEs in sub-queries Jan 18, 2025
@plypaul plypaul marked this pull request as ready for review January 18, 2025 04:21
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good assuming we do expect alias collisions in valid plans - but I'm wondering if the collision-handling is necessary if we can build unique aliases appropriately.

class SqlCteAliasMappingLookup:
"""A mutable lookup that stores the CTE-alias mapping at a given node.

In cases with nested CTEs in a SELECT, it's possible that a CTE defined in an inner SELECT has an alias that is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would they be defined with the same alias? Shouldn't we use something similar to _next_unique_table_alias() to build CTE aliases? Or is the problem that both the DataflowToSqlQueryPlanConverter and the optimizer won't have access to that function to ensure unique aliases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something like _next_unique_table_alias is used to generate the CTE aliases, then there would be no issue. However, it would be something that you would have to remember to do in DataflowToSqlQueryPlanConverter so handling this case is more defensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I definitely was planning to use something similar to _next_unique_table_alias() in DataflowToSqlQueryPlanConverter for CTE aliases. If you have a strong opinion about handling this defensively, that's fine and we can leave it here. Just might be nice to simplify the logic if it's not necessary!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My aim is to try to maintain the contract: if the optimizer is fed in valid SQL, then it should output valid SQL. Otherwise, it becomes harder to use / makes a trap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh ok I think I was misunderstanding. I was imagining the CTEs would always be bubbled up to the top level, even if there were defined when visiting a subquery. I didn't realize that you could have a CTE that lives inside a subquery. In that case, your logic makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about your use case and thought CTE that lives inside a subquery was going to be a better fit since it was only locally needed? There's flexibility though, so let me merge this so that you can try it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, either should be fine from an execution standpoint, but keeping the CTEs at the top level might be most readable!


Raises an exception if a mapping already exists.
"""
if select_node in self._select_node_to_cte_alias_mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any scenario where the select_node won't be unique in the dataflow plan at this point, and a valid plan will error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a call to cte_alias_mapping_exists before it's added, so it shouldn't throw an error. But a non-unique select_node also means that the generated SQL plan is no longer a tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like some caching in DataflowNodeToSqlCteVisitor might cause select_node won't be unique. It doesn't cause issues at the moment, but let me taking a look on how to avoid that.

if cte_node is not None:
self._current_required_column_alias_mapping.add_aliases(cte_node, column_aliases)
# `visit_cte_node` will handle propagating the required aliases to all CTEs that this CTE node depends on.
cte_node.accept(self)

def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None:
"""Based on required column aliases for this SELECT, figure out required column aliases in parents."""
# If this SELECT node defines any CTEs, it should override ones that were defined in the outer SELECT in case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here - why do we expect collisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we do not. But once we start adding CTE nodes within DataflowToSqlQueryPlanConverter, there's the potential if unique aliases aren't enforced.

if cte_node is not None:
self._current_required_column_alias_mapping.add_aliases(cte_node, column_aliases)
# `visit_cte_node` will handle propagating the required aliases to all CTEs that this CTE node depends on.
cte_node.accept(self)

def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None:
"""Based on required column aliases for this SELECT, figure out required column aliases in parents."""
# If this SELECT node defines any CTEs, it should override ones that were defined in the outer SELECT in case
# of CTE alias collisions.
if not self._cte_node_lookup.cte_alias_mapping_exists(node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding - this works because the CTE nodes from the inner query will be added to the mapping before the nodes from the outer query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the other way around e.g. for

WITH cte_in_outer_query AS (
  SELECT 1
)

SELECT * FROM (
  WITH cte_in_inner_query AS (
    SELECT 1
  )
  SELECT * FROM cte_in_inner_query JOIN cte_in_outer_query
)

We add the mapping {"cte_in_outer_query": ..} then {cte_in_outer_query": ..., "cte_in_inner_query": ...}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused then 🤔 With this conditional, won't we only add the mapping if the node isn't in the mapping yet? In which case, the outer node will override the inner node, which is the opposite of what the comment says?

Copy link
Contributor Author

@plypaul plypaul Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the mapping refers to the whole {"cte_in_outer_query": ..., "cte_in_inner_query": ...} where as the lookup is like {"inner_select": {"cte_in_outer_query": ..., "cte_in_inner_query": ...}}. The conditional check is seeing if we have figured out the CTE-alias mapping for this SELECT statement. So for:

WITH cte_in_outer_query AS (
  SELECT 1
)

# outer_select
SELECT * FROM (
  WITH cte_in_inner_query AS (
    SELECT 1
  )
  # inner_select
  SELECT * FROM cte_in_inner_query JOIN cte_in_outer_query
)

We first visit outer_select. cte_alias_mapping_exists(outer_select) == False so we add the mapping to the lookup which then looks like:

{
  "outer_select": {"cte_in_outer_query": ...}`
}

Then we visit inner_select. cte_alias_mapping_exists(inner_select) == False as well so we add the mapping "outer_select": {"cte_in_outer_query": ...}. The lookup now looks like:

{
  "outer_select": {"cte_in_outer_query": ...},
  "inner_select": {"cte_in_outer_query": ..., "cte_in_inner_query": ...}
}
  • To make this easier to follow, I've parted out the logic for building a SqlCteAliasMappingLookup to a separate class (see last commit).
  • There is a issue with the way that a cache in DataflowNodeToSqlCteVisitor works right now that is related to this, but it doesn't cause errors (will have to address in a separate PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh helpful, thank you!!

Base automatically changed from p--sql-updates--01 to main January 21, 2025 21:26
@plypaul plypaul force-pushed the p--sql-updates--02 branch from bde7869 to dee1010 Compare January 21, 2025 21:27
@plypaul plypaul merged commit 532d9f8 into main Jan 22, 2025
11 checks passed
@plypaul plypaul deleted the p--sql-updates--02 branch January 22, 2025 01:00
plypaul added a commit that referenced this pull request Jan 22, 2025
…b-queries (#1614)

Similar to #1613, but for
`SqlRewritingSubQueryReducer`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants