Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new get_catalog_relations macro, Supporting Changes #8648

Merged
merged 12 commits into from
Sep 29, 2023

Conversation

peterallenwebb
Copy link
Contributor

@peterallenwebb peterallenwebb commented Sep 13, 2023

resolves #8521

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

…lations in a schema the adapter should return data about
@cla-bot cla-bot bot added the cla:yes label Sep 13, 2023
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (f17c1f3) 86.65% compared to head (d10e802) 86.50%.
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8648      +/-   ##
==========================================
- Coverage   86.65%   86.50%   -0.15%     
==========================================
  Files         176      176              
  Lines       25712    25804      +92     
==========================================
+ Hits        22280    22323      +43     
- Misses       3432     3481      +49     
Flag Coverage Δ
integration 83.27% <69.56%> (-0.14%) ⬇️
unit 65.11% <52.17%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
core/dbt/adapters/base/relation.py 86.02% <33.33%> (-7.87%) ⬇️
core/dbt/adapters/base/impl.py 81.94% <72.09%> (-1.64%) ⬇️

... and 15 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mikealfare mikealfare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through your draft and have some comments. Most of them are just thoughts that I'm throwing against the wall, so take them with a grain of salt; I'm not strongly for or against anything that I said.

core/dbt/adapters/base/impl.py Show resolved Hide resolved
core/dbt/adapters/base/impl.py Outdated Show resolved Hide resolved
@peterallenwebb peterallenwebb marked this pull request as ready for review September 19, 2023 17:57
@peterallenwebb peterallenwebb requested review from a team as code owners September 19, 2023 17:57
ezraerb and others added 5 commits September 25, 2023 09:22
* Use profile specified in --profile with dbt init

* Update .changes/unreleased/Fixes-20230424-161642.yaml

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>

* Refactor run() method into functions, replace exit() calls with exceptions

* Update help text for profile option

---------

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is the lines that show up as untested, which looks like it mainly happens because none of our test cases contain more than 100 relations. I assume that part was hand tested? I wonder if we should create a large test case... but I don't think we need to hold this up for that.

Otherwise looks good.

Copy link
Contributor

@mikealfare mikealfare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of comments; let me know if you'd like to discuss any of them. Also, this PR lacks a changelog, do we need one?

@@ -415,6 +418,29 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap:
lowercase strings.
"""
info_schema_name_map = SchemaSearchMap()
relations = self._get_catalog_relations(manifest)
for relation in relations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't put my finger on why, but I feel like this piece should be a method on SchemaSearchMap, something like SchemaSearchMap.extend(relations). Then this becomes:

info_schema_name_map = SchemaSearchMap()
info_schema_name_map.extend(self._get_catalog_relations(manifest))
return info_schema_name_map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @mikealfare, following this up, since I didn't incorporate every one of your requested changes.

My thinking at the time was that the code I had was tested and working, so I didn't want continue tweaking relatively small details after we had already done a lot of collaboration and gone multiple rounds of review. I should have at least left a note to that effect.

However, if you think this or any of the other remaining issues ought to be addressed before 1.7.0, let me know and I am happy to open a follow up issue.

core/dbt/adapters/base/relation.py Show resolved Hide resolved


{% macro postgres__get_catalog(information_schema, schemas) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I thought schemas here was a list of strings (the schema names), and not a list of BaseRelation objects that act as schemas (by not defining identifier). If that's the case, the where clause produced below should return nothing (relation.identifier is "falsy" because of jinja things, upper(sch.nspname) = upper('{{ relation.schema }}') fails, I think because it becomes upper(sch.nspname) = upper('')).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schemas parameter to this macro is indeed a list of strings. The loop below converts it to relations: List[obj] where the obj is acting like a BaseRelation (specifically a "schema relation") for the limited purpose of calling get_catalog_where_clause.

So when this macro is called with schemas = [ "schema_a", "schema_b", "schema_c" ], it will in turn call get_catalog_where_clause with relations = [ { "schema": "schema_a" }, { "schema": "schema_b" }, { "schema": "schema_c" } ].

This is just to make it easier to reuse get_catalog_where_clause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, does jinja treat relation.schema the same as relation.get("schema") or relation["schema"]?

upper(sch.nspname) = upper('{{ schema }}'){%- if not loop.last %} or {% endif -%}
{%- endfor -%}
)
{{ postgres__get_catalog_where_clause(relations) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the where clause is not reused, I would keep it in line here.

I made it a separate macro because I needed it twice (old implementation an new implementation). However, I think I like your approach at the adapter level of simply rerouting the existing macro to the new macro, avoiding the need for maintaining two versions of the query.

@@ -222,6 +223,8 @@ class BaseAdapter(metaclass=AdapterMeta):
ConstraintType.foreign_key: ConstraintSupport.ENFORCED,
}

CATALOG_BY_RELATION_SUPPORT = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a placeholder here to discuss "feature flag-like things".

# databases
return info_schema_name_map

def _get_catalog_relations_by_info_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more readable with a defaultdict. Untested pseudo code:

from collections import defaultdict

def _get_catalog_relations_by_info_schema(self, manifest):
    relations_by_info_schema = defaultdict(list)
    for relation in self._get_catalog_relations(manifext):
        relations_by_info_schema[relation.information_schema_only()].append(relation)
    return dict(relations_by_info_schema)

futures.append(fut)
relation_count = len(self._get_catalog_relations(manifest))
if relation_count <= 100 and self.CATALOG_BY_RELATION_SUPPORT:
relations_by_schema = self._get_catalog_relations_by_info_schema(manifest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few observations:

  • We're calling self._get_catalog_relations() twice, once for the length and again inside of self._get_catalog_relations_by_info_schema
  • self._get_catalog_relations_by_info_schema is doing two things, getting the relations and reformatting the resulting list into a dictionary
  • The new method self._get_catalog_relations_by_info_schema is only used once, here
  • relations_by_schema is grouping by the information_schema, not relation.schema, which is misleading
  • relations_by_schema is created to filter a list of relations

Perhaps we could simplify this by reusing the results of self._get_catalog_relations(manifest) (let's refer to this as relations) and filtering relations by info_schema while looping through the info_schema values?

Some pseudo code to add context (and to organize my own thoughts):

def get_catalog(self, manifest):
    with executor(self.config) as tpe:
        futures = []
        relations = self._get_catalog_relations(manifest)
        if self.CATALOG_BY_RELATION_SUPPORT and len(relations) <= 100:
            info_schemas = {relation.information_schema_only() for relation in relations}
            for info_schema in info_schemas:
                name = ".".join([str(info_schema.database), "information_schema"])
                relations_in_this_info_schema = [
                    r for r in relations if r.information_schema_only() == info_schema
                ],
                fut = tpe.submit_connected(
                    self,
                    name,
                    self._get_one_catalog_by_relations,
                    info_schema,
                    relations_in_this_info_schema,
                    manifest,
                )
                futures.append(fut)
        else:
            # old way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function, it may be worth reviewing the state of the code today if you still have concerns. Gerda refactored it a bit, and made changes along the lines you are suggesting.

@@ -1093,20 +1114,57 @@ def _get_one_catalog(
results = self._catalog_filter_table(table, manifest) # type: ignore[arg-type]
return results

def _get_one_catalog_by_relations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the name of this macro mean? Put another way, how do I know what this should return? It looks like this runs the new macro, but also limits it to objects in the manifest; is that right? This only gets called once, and in the branch of code where we're passing in an explicit list of relations that we got from the manifest. Is the filter still necessary?

I think there's value in wrapping the macro itself in an adapter method. But if we do need to filter it for some reason, then we should do that after this is called.

QMalcolm pushed a commit that referenced this pull request Oct 9, 2023
* Add new get_catalog_relations macro, allowing dbt to specify which relations in a schema the adapter should return data about

* Implement postgres adapter support for relation filtering on catalog queries

* Code review changes adding feature flag for catalog-by-relation-list support

* Use profile specified in --profile with dbt init (#7450)

* Use profile specified in --profile with dbt init

* Update .changes/unreleased/Fixes-20230424-161642.yaml

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>

* Refactor run() method into functions, replace exit() calls with exceptions

* Update help text for profile option

---------

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>

* add TestLargeEphemeralCompilation (#8376)

* Fix a couple of issues in the postgres implementation of get_catalog_relations

* Add relation count limit at which to fall back to batch retrieval

* Better feature detection mechanism for adapters.

* Code review changes to get_catalog_relations and adapter feature checking

* Add changelog entry

---------

Co-authored-by: ezraerb <ezraerb@alum.mit.edu>
Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-3050] [Applied State] Implement Core Changes for Filtered Catalog Queries
5 participants