-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For SqlColumnPrunerOptimizer
- add support for CTEs in sub-queries
#1613
Conversation
684c3c6
to
bde7869
Compare
SqlColumnPrunerOptimizer
- Add support CTEs in subqueriesSqlColumnPrunerOptimizer
- Add support for CTEs in subqueries
SqlColumnPrunerOptimizer
- Add support for CTEs in subqueriesSqlColumnPrunerOptimizer
- add support for CTEs in sub-queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good assuming we do expect alias collisions in valid plans - but I'm wondering if the collision-handling is necessary if we can build unique aliases appropriately.
class SqlCteAliasMappingLookup: | ||
"""A mutable lookup that stores the CTE-alias mapping at a given node. | ||
|
||
In cases with nested CTEs in a SELECT, it's possible that a CTE defined in an inner SELECT has an alias that is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would they be defined with the same alias? Shouldn't we use something similar to _next_unique_table_alias()
to build CTE aliases? Or is the problem that both the DataflowToSqlQueryPlanConverter
and the optimizer won't have access to that function to ensure unique aliases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something like _next_unique_table_alias
is used to generate the CTE aliases, then there would be no issue. However, it would be something that you would have to remember to do in DataflowToSqlQueryPlanConverter
so handling this case is more defensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I definitely was planning to use something similar to _next_unique_table_alias()
in DataflowToSqlQueryPlanConverter
for CTE aliases. If you have a strong opinion about handling this defensively, that's fine and we can leave it here. Just might be nice to simplify the logic if it's not necessary!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My aim is to try to maintain the contract: if the optimizer is fed in valid SQL, then it should output valid SQL. Otherwise, it becomes harder to use / makes a trap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh ok I think I was misunderstanding. I was imagining the CTEs would always be bubbled up to the top level, even if there were defined when visiting a subquery. I didn't realize that you could have a CTE that lives inside a subquery. In that case, your logic makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about your use case and thought CTE that lives inside a subquery
was going to be a better fit since it was only locally needed? There's flexibility though, so let me merge this so that you can try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, either should be fine from an execution standpoint, but keeping the CTEs at the top level might be most readable!
|
||
Raises an exception if a mapping already exists. | ||
""" | ||
if select_node in self._select_node_to_cte_alias_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any scenario where the select_node
won't be unique in the dataflow plan at this point, and a valid plan will error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a call to cte_alias_mapping_exists
before it's added, so it shouldn't throw an error. But a non-unique select_node
also means that the generated SQL plan is no longer a tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looks like some caching in DataflowNodeToSqlCteVisitor
might cause select_node won't be unique
. It doesn't cause issues at the moment, but let me taking a look on how to avoid that.
if cte_node is not None: | ||
self._current_required_column_alias_mapping.add_aliases(cte_node, column_aliases) | ||
# `visit_cte_node` will handle propagating the required aliases to all CTEs that this CTE node depends on. | ||
cte_node.accept(self) | ||
|
||
def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None: | ||
"""Based on required column aliases for this SELECT, figure out required column aliases in parents.""" | ||
# If this SELECT node defines any CTEs, it should override ones that were defined in the outer SELECT in case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here - why do we expect collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we do not. But once we start adding CTE nodes within DataflowToSqlQueryPlanConverter
, there's the potential if unique aliases aren't enforced.
if cte_node is not None: | ||
self._current_required_column_alias_mapping.add_aliases(cte_node, column_aliases) | ||
# `visit_cte_node` will handle propagating the required aliases to all CTEs that this CTE node depends on. | ||
cte_node.accept(self) | ||
|
||
def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None: | ||
"""Based on required column aliases for this SELECT, figure out required column aliases in parents.""" | ||
# If this SELECT node defines any CTEs, it should override ones that were defined in the outer SELECT in case | ||
# of CTE alias collisions. | ||
if not self._cte_node_lookup.cte_alias_mapping_exists(node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding - this works because the CTE nodes from the inner query will be added to the mapping before the nodes from the outer query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually the other way around e.g. for
WITH cte_in_outer_query AS (
SELECT 1
)
SELECT * FROM (
WITH cte_in_inner_query AS (
SELECT 1
)
SELECT * FROM cte_in_inner_query JOIN cte_in_outer_query
)
We add the mapping {"cte_in_outer_query": ..}
then {cte_in_outer_query": ..., "cte_in_inner_query": ...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused then 🤔 With this conditional, won't we only add the mapping if the node isn't in the mapping yet? In which case, the outer node will override the inner node, which is the opposite of what the comment says?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, the mapping refers to the whole {"cte_in_outer_query": ..., "cte_in_inner_query": ...}
where as the lookup is like {"inner_select": {"cte_in_outer_query": ..., "cte_in_inner_query": ...}}
. The conditional check is seeing if we have figured out the CTE-alias mapping for this SELECT statement. So for:
WITH cte_in_outer_query AS (
SELECT 1
)
# outer_select
SELECT * FROM (
WITH cte_in_inner_query AS (
SELECT 1
)
# inner_select
SELECT * FROM cte_in_inner_query JOIN cte_in_outer_query
)
We first visit outer_select
. cte_alias_mapping_exists(outer_select) == False
so we add the mapping to the lookup which then looks like:
{
"outer_select": {"cte_in_outer_query": ...}`
}
Then we visit inner_select
. cte_alias_mapping_exists(inner_select) == False
as well so we add the mapping "outer_select": {"cte_in_outer_query": ...}
. The lookup now looks like:
{
"outer_select": {"cte_in_outer_query": ...},
"inner_select": {"cte_in_outer_query": ..., "cte_in_inner_query": ...}
}
- To make this easier to follow, I've parted out the logic for building a
SqlCteAliasMappingLookup
to a separate class (see last commit). - There is a issue with the way that a cache in
DataflowNodeToSqlCteVisitor
works right now that is related to this, but it doesn't cause errors (will have to address in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh helpful, thank you!!
…e supplied for SQL optimizer tests.
bde7869
to
dee1010
Compare
Previously, the
SqlColumnPrunerOptimizer
only accounted for CTEs defined at the top level as in current use, theDataflowToSqlPlanConverter
only generated SQL plans in that way. To allow for generation of SQL plans with CTEs defined in sub-queries (at any depth), this PR updatesSqlColumnPrunerOptimizer
to keep track of the CTE alias mapping / CTE context when traversing the SQL plan.