-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new get_catalog_relations macro, Supporting Changes #8648
Conversation
…lations in a schema the adapter should return data about
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #8648 +/- ##
==========================================
- Coverage 86.65% 86.50% -0.15%
==========================================
Files 176 176
Lines 25712 25804 +92
==========================================
+ Hits 22280 22323 +43
- Misses 3432 3481 +49
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through your draft and have some comments. Most of them are just thoughts that I'm throwing against the wall, so take them with a grain of salt; I'm not strongly for or against anything that I said.
* Use profile specified in --profile with dbt init * Update .changes/unreleased/Fixes-20230424-161642.yaml Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> * Refactor run() method into functions, replace exit() calls with exceptions * Update help text for profile option --------- Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is the lines that show up as untested, which looks like it mainly happens because none of our test cases contain more than 100 relations. I assume that part was hand tested? I wonder if we should create a large test case... but I don't think we need to hold this up for that.
Otherwise looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bunch of comments; let me know if you'd like to discuss any of them. Also, this PR lacks a changelog, do we need one?
@@ -415,6 +418,29 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap: | |||
lowercase strings. | |||
""" | |||
info_schema_name_map = SchemaSearchMap() | |||
relations = self._get_catalog_relations(manifest) | |||
for relation in relations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't put my finger on why, but I feel like this piece should be a method on SchemaSearchMap
, something like SchemaSearchMap.extend(relations)
. Then this becomes:
info_schema_name_map = SchemaSearchMap()
info_schema_name_map.extend(self._get_catalog_relations(manifest))
return info_schema_name_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @mikealfare, following this up, since I didn't incorporate every one of your requested changes.
My thinking at the time was that the code I had was tested and working, so I didn't want continue tweaking relatively small details after we had already done a lot of collaboration and gone multiple rounds of review. I should have at least left a note to that effect.
However, if you think this or any of the other remaining issues ought to be addressed before 1.7.0, let me know and I am happy to open a follow up issue.
|
||
|
||
{% macro postgres__get_catalog(information_schema, schemas) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I thought schemas
here was a list of strings (the schema names), and not a list of BaseRelation
objects that act as schemas (by not defining identifier
). If that's the case, the where clause produced below should return nothing (relation.identifier is "falsy" because of jinja things, upper(sch.nspname) = upper('{{ relation.schema }}')
fails, I think because it becomes upper(sch.nspname) = upper('')
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schemas
parameter to this macro is indeed a list of strings. The loop below converts it to relations: List[obj] where the obj is acting like a BaseRelation (specifically a "schema relation") for the limited purpose of calling get_catalog_where_clause
.
So when this macro is called with schemas = [ "schema_a", "schema_b", "schema_c" ], it will in turn call get_catalog_where_clause
with relations = [ { "schema": "schema_a" }, { "schema": "schema_b" }, { "schema": "schema_c" } ].
This is just to make it easier to reuse get_catalog_where_clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, does jinja treat relation.schema
the same as relation.get("schema")
or relation["schema"]
?
upper(sch.nspname) = upper('{{ schema }}'){%- if not loop.last %} or {% endif -%} | ||
{%- endfor -%} | ||
) | ||
{{ postgres__get_catalog_where_clause(relations) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the where clause is not reused, I would keep it in line here.
I made it a separate macro because I needed it twice (old implementation an new implementation). However, I think I like your approach at the adapter level of simply rerouting the existing macro to the new macro, avoiding the need for maintaining two versions of the query.
core/dbt/adapters/base/impl.py
Outdated
@@ -222,6 +223,8 @@ class BaseAdapter(metaclass=AdapterMeta): | |||
ConstraintType.foreign_key: ConstraintSupport.ENFORCED, | |||
} | |||
|
|||
CATALOG_BY_RELATION_SUPPORT = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting a placeholder here to discuss "feature flag-like things".
# databases | ||
return info_schema_name_map | ||
|
||
def _get_catalog_relations_by_info_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be more readable with a defaultdict
. Untested pseudo code:
from collections import defaultdict
def _get_catalog_relations_by_info_schema(self, manifest):
relations_by_info_schema = defaultdict(list)
for relation in self._get_catalog_relations(manifext):
relations_by_info_schema[relation.information_schema_only()].append(relation)
return dict(relations_by_info_schema)
futures.append(fut) | ||
relation_count = len(self._get_catalog_relations(manifest)) | ||
if relation_count <= 100 and self.CATALOG_BY_RELATION_SUPPORT: | ||
relations_by_schema = self._get_catalog_relations_by_info_schema(manifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few observations:
- We're calling
self._get_catalog_relations()
twice, once for the length and again inside ofself._get_catalog_relations_by_info_schema
self._get_catalog_relations_by_info_schema
is doing two things, getting the relations and reformatting the resulting list into a dictionary- The new method
self._get_catalog_relations_by_info_schema
is only used once, here relations_by_schema
is grouping by the information_schema, notrelation.schema
, which is misleadingrelations_by_schema
is created to filter a list of relations
Perhaps we could simplify this by reusing the results of self._get_catalog_relations(manifest)
(let's refer to this as relations
) and filtering relations
by info_schema
while looping through the info_schema
values?
Some pseudo code to add context (and to organize my own thoughts):
def get_catalog(self, manifest):
with executor(self.config) as tpe:
futures = []
relations = self._get_catalog_relations(manifest)
if self.CATALOG_BY_RELATION_SUPPORT and len(relations) <= 100:
info_schemas = {relation.information_schema_only() for relation in relations}
for info_schema in info_schemas:
name = ".".join([str(info_schema.database), "information_schema"])
relations_in_this_info_schema = [
r for r in relations if r.information_schema_only() == info_schema
],
fut = tpe.submit_connected(
self,
name,
self._get_one_catalog_by_relations,
info_schema,
relations_in_this_info_schema,
manifest,
)
futures.append(fut)
else:
# old way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this function, it may be worth reviewing the state of the code today if you still have concerns. Gerda refactored it a bit, and made changes along the lines you are suggesting.
@@ -1093,20 +1114,57 @@ def _get_one_catalog( | |||
results = self._catalog_filter_table(table, manifest) # type: ignore[arg-type] | |||
return results | |||
|
|||
def _get_one_catalog_by_relations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the name of this macro mean? Put another way, how do I know what this should return? It looks like this runs the new macro, but also limits it to objects in the manifest; is that right? This only gets called once, and in the branch of code where we're passing in an explicit list of relations that we got from the manifest. Is the filter still necessary?
I think there's value in wrapping the macro itself in an adapter method. But if we do need to filter it for some reason, then we should do that after this is called.
* Add new get_catalog_relations macro, allowing dbt to specify which relations in a schema the adapter should return data about * Implement postgres adapter support for relation filtering on catalog queries * Code review changes adding feature flag for catalog-by-relation-list support * Use profile specified in --profile with dbt init (#7450) * Use profile specified in --profile with dbt init * Update .changes/unreleased/Fixes-20230424-161642.yaml Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> * Refactor run() method into functions, replace exit() calls with exceptions * Update help text for profile option --------- Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> * add TestLargeEphemeralCompilation (#8376) * Fix a couple of issues in the postgres implementation of get_catalog_relations * Add relation count limit at which to fall back to batch retrieval * Better feature detection mechanism for adapters. * Code review changes to get_catalog_relations and adapter feature checking * Add changelog entry --------- Co-authored-by: ezraerb <ezraerb@alum.mit.edu> Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com>
resolves #8521
Checklist