-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For SqlColumnPrunerOptimizer
- add support for CTEs in sub-queries
#1613
Changes from all commits
4294701
5f94aee
dee1010
a201900
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
from __future__ import annotations | ||
|
||
import logging | ||
from typing import Dict | ||
|
||
from metricflow_semantics.mf_logging.lazy_formattable import LazyFormat | ||
|
||
from metricflow.sql.sql_plan import SqlCteAliasMapping, SqlSelectStatementNode | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SqlCteAliasMappingLookup: | ||
"""A mutable lookup that stores the CTE-alias mapping at a given node. | ||
|
||
In cases with nested CTEs in a SELECT, it's possible that a CTE defined in an inner SELECT has an alias that is the | ||
same as a CTE defined in an outer SELECT. e.g. | ||
|
||
# outer_cte | ||
WITH cte_0 AS ( | ||
SELECT 1 AS col_0 | ||
) | ||
|
||
# outer_select | ||
SELECT col_0 | ||
FROM ( | ||
|
||
# inner_cte | ||
WITH cte_0 AS ( | ||
SELECT 2 AS col_0 | ||
) | ||
# inner_select | ||
SELECT col_0 FROM cte_0 | ||
) | ||
... | ||
|
||
In this case, `outer_cte` and `inner_cte` both have the same alias `cte_0`. When `cte_0` is referenced from | ||
`inner_select`, it is referring to the `inner_cte`. For column pruning, it is necessary to figure out which CTE | ||
a given alias is referencing, so this class helps to keep track of that mapping. | ||
""" | ||
|
||
def __init__(self) -> None: # noqa: D107 | ||
self._select_node_to_cte_alias_mapping: Dict[SqlSelectStatementNode, SqlCteAliasMapping] = {} | ||
|
||
def cte_alias_mapping_exists(self, select_node: SqlSelectStatementNode) -> bool: | ||
"""Returns true if the CTE-alias mapping for the given node has been recorded.""" | ||
return select_node in self._select_node_to_cte_alias_mapping | ||
|
||
def add_cte_alias_mapping( | ||
self, | ||
select_node: SqlSelectStatementNode, | ||
cte_alias_mapping: SqlCteAliasMapping, | ||
) -> None: | ||
"""Associate the given CTE-alias mapping with the given node. | ||
|
||
Raises an exception if a mapping already exists. | ||
""" | ||
if select_node in self._select_node_to_cte_alias_mapping: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any scenario where the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, looks like some caching in |
||
raise RuntimeError( | ||
str( | ||
LazyFormat( | ||
"`select_node` node has already been added,", | ||
# child_select_node=child_select_node, | ||
select_node=select_node, | ||
current_mapping=self._select_node_to_cte_alias_mapping, | ||
) | ||
) | ||
) | ||
|
||
self._select_node_to_cte_alias_mapping[select_node] = cte_alias_mapping | ||
|
||
def get_cte_alias_mapping(self, select_node: SqlSelectStatementNode) -> SqlCteAliasMapping: | ||
"""Return the CTE-alias mapping for the given node. | ||
|
||
Raises an exception if a mapping was not previously added for the given node. | ||
""" | ||
cte_alias_mapping = self._select_node_to_cte_alias_mapping.get(select_node) | ||
if cte_alias_mapping is None: | ||
raise RuntimeError( | ||
str(LazyFormat("CTE alias mapping does not exist for the given `select_node`", select_node=select_node)) | ||
) | ||
return cte_alias_mapping |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
from __future__ import annotations | ||
|
||
import logging | ||
from contextlib import contextmanager | ||
from typing import Iterator | ||
|
||
from metricflow_semantics.mf_logging.lazy_formattable import LazyFormat | ||
from typing_extensions import override | ||
|
||
from metricflow.sql.optimizer.cte_alias_to_cte_node_mapping import SqlCteAliasMappingLookup | ||
from metricflow.sql.sql_plan import ( | ||
SqlCreateTableAsNode, | ||
SqlCteAliasMapping, | ||
SqlCteNode, | ||
SqlPlanNode, | ||
SqlPlanNodeVisitor, | ||
SqlSelectQueryFromClauseNode, | ||
SqlSelectStatementNode, | ||
SqlTableNode, | ||
) | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SqlCteAliasMappingLookupBuilderVisitor(SqlPlanNodeVisitor[None]): | ||
"""Traverses the SQL plan and builds the associated `SqlCteAliasMappingLookup`. | ||
|
||
Please see `SqlCteAliasMappingLookup` for more details. | ||
""" | ||
|
||
def __init__(self) -> None: # noqa: D107 | ||
self._current_cte_alias_mapping = SqlCteAliasMapping() | ||
self._cte_alias_mapping_lookup = SqlCteAliasMappingLookup() | ||
|
||
@contextmanager | ||
def _save_current_cte_alias_mapping(self) -> Iterator[None]: | ||
previous_cte_alias_mapping = self._current_cte_alias_mapping | ||
yield | ||
self._current_cte_alias_mapping = previous_cte_alias_mapping | ||
|
||
def _default_handler(self, node: SqlPlanNode) -> None: | ||
"""Default recursive handler to visit the parents of the given node.""" | ||
for parent_node in node.parent_nodes: | ||
with self._save_current_cte_alias_mapping(): | ||
parent_node.accept(self) | ||
return | ||
|
||
@override | ||
def visit_cte_node(self, node: SqlCteNode) -> None: | ||
return self._default_handler(node) | ||
|
||
@override | ||
def visit_select_statement_node(self, node: SqlSelectStatementNode) -> None: | ||
"""Based on required column aliases for this SELECT, figure out required column aliases in parents.""" | ||
logger.debug( | ||
LazyFormat( | ||
"Starting visit of SELECT statement node with CTE alias mapping", | ||
node=node, | ||
current_cte_alias_mapping=self._current_cte_alias_mapping, | ||
) | ||
) | ||
|
||
if self._cte_alias_mapping_lookup.cte_alias_mapping_exists(node): | ||
return self._default_handler(node) | ||
|
||
# Record that can see the CTEs defined in this SELECT node and outer SELECT statements. | ||
# CTEs defined in this select node should override ones that were defined in the outer SELECT in case | ||
# of CTE alias collisions. | ||
self._current_cte_alias_mapping = self._current_cte_alias_mapping.merge( | ||
SqlCteAliasMapping.create({cte_node.cte_alias: cte_node for cte_node in node.cte_sources}) | ||
) | ||
self._cte_alias_mapping_lookup.add_cte_alias_mapping( | ||
select_node=node, | ||
cte_alias_mapping=self._current_cte_alias_mapping, | ||
) | ||
|
||
return self._default_handler(node) | ||
|
||
@override | ||
def visit_table_node(self, node: SqlTableNode) -> None: | ||
self._default_handler(node) | ||
|
||
@override | ||
def visit_query_from_clause_node(self, node: SqlSelectQueryFromClauseNode) -> None: | ||
self._default_handler(node) | ||
|
||
@override | ||
def visit_create_table_as_node(self, node: SqlCreateTableAsNode) -> None: # noqa: D102 | ||
self._default_handler(node) | ||
|
||
@property | ||
def cte_alias_mapping_lookup(self) -> SqlCteAliasMappingLookup: | ||
"""Returns the lookup created after traversal.""" | ||
return self._cte_alias_mapping_lookup |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
test_name: test_common_cte_aliases_in_nested_query | ||
test_filename: test_cte_column_pruner.py | ||
docstring: | ||
Test the case where a CTE defined in the top-level SELECT has the same name as a CTE in a sub-query . | ||
expectation_description: | ||
In the `from_sub_query`, there is a reference to `cte_source__col_0` in a CTE named `cte_source`. Since | ||
`from_sub_query` redefines `cte_source`, the column pruner should retain that column in the CTE defined | ||
in `from_sub_query` but remove the column from the CTE defined in `top_level_select`. | ||
--- | ||
optimizer: | ||
SqlColumnPrunerOptimizer | ||
|
||
sql_before_optimizing: | ||
-- top_level_select | ||
WITH cte_source AS ( | ||
-- CTE source | ||
SELECT | ||
test_table_alias.col_0 AS cte_source__col_0 | ||
, test_table_alias.col_1 AS cte_source__col_1 | ||
FROM test_schema.test_table test_table_alias | ||
) | ||
|
||
SELECT | ||
from_source_alias.from_source__col_0 AS top_level__col_0 | ||
, right_source_alias.right_source__col_1 AS top_level__col_1 | ||
FROM ( | ||
-- from_sub_query | ||
WITH cte_source AS ( | ||
-- CTE source | ||
SELECT | ||
test_table_alias.col_0 AS cte_source__col_0 | ||
, test_table_alias.col_1 AS cte_source__col_1 | ||
FROM test_schema.test_table test_table_alias | ||
) | ||
|
||
SELECT | ||
from_source_alias.cte_source__col_0 AS from_source__col_0 | ||
FROM cte_source from_source_alias | ||
) from_source_alias | ||
INNER JOIN ( | ||
-- joined_sub_query | ||
SELECT | ||
from_source_alias.cte_source__col_1 AS right_source__col_1 | ||
FROM cte_source from_source_alias | ||
) right_source_alias | ||
ON | ||
from_source_alias.from_source__col_0 = right_source_alias.right_source__col_1 | ||
|
||
sql_after_optimizing: | ||
-- top_level_select | ||
WITH cte_source AS ( | ||
-- CTE source | ||
SELECT | ||
test_table_alias.col_1 AS cte_source__col_1 | ||
FROM test_schema.test_table test_table_alias | ||
) | ||
|
||
SELECT | ||
from_source_alias.from_source__col_0 AS top_level__col_0 | ||
, right_source_alias.right_source__col_1 AS top_level__col_1 | ||
FROM ( | ||
-- from_sub_query | ||
WITH cte_source AS ( | ||
-- CTE source | ||
SELECT | ||
test_table_alias.col_0 AS cte_source__col_0 | ||
FROM test_schema.test_table test_table_alias | ||
) | ||
|
||
SELECT | ||
from_source_alias.cte_source__col_0 AS from_source__col_0 | ||
FROM cte_source from_source_alias | ||
) from_source_alias | ||
INNER JOIN ( | ||
-- joined_sub_query | ||
SELECT | ||
from_source_alias.cte_source__col_1 AS right_source__col_1 | ||
FROM cte_source from_source_alias | ||
) right_source_alias | ||
ON | ||
from_source_alias.from_source__col_0 = right_source_alias.right_source__col_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would they be defined with the same alias? Shouldn't we use something similar to
_next_unique_table_alias()
to build CTE aliases? Or is the problem that both theDataflowToSqlQueryPlanConverter
and the optimizer won't have access to that function to ensure unique aliases?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something like
_next_unique_table_alias
is used to generate the CTE aliases, then there would be no issue. However, it would be something that you would have to remember to do inDataflowToSqlQueryPlanConverter
so handling this case is more defensive.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I definitely was planning to use something similar to
_next_unique_table_alias()
inDataflowToSqlQueryPlanConverter
for CTE aliases. If you have a strong opinion about handling this defensively, that's fine and we can leave it here. Just might be nice to simplify the logic if it's not necessary!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My aim is to try to maintain the contract: if the optimizer is fed in valid SQL, then it should output valid SQL. Otherwise, it becomes harder to use / makes a trap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh ok I think I was misunderstanding. I was imagining the CTEs would always be bubbled up to the top level, even if there were defined when visiting a subquery. I didn't realize that you could have a CTE that lives inside a subquery. In that case, your logic makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about your use case and thought
CTE that lives inside a subquery
was going to be a better fit since it was only locally needed? There's flexibility though, so let me merge this so that you can try it out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, either should be fine from an execution standpoint, but keeping the CTEs at the top level might be most readable!