Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1769] [Spike] Parsing sans adapter #6549

Closed
jtcohen6 opened this issue Jan 8, 2023 · 3 comments
Closed

[CT-1769] [Spike] Parsing sans adapter #6549

jtcohen6 opened this issue Jan 8, 2023 · 3 comments
Labels
enhancement New feature or request python_api Issues related to dbtRunner Python entry point spike

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 8, 2023

Big idea: It should be possible to dbt parse and produce a Manifest without requiring a specific database adapter to be installed. During parsing, dbt isn't actually connecting to any databases (or the Internet at all) — so it feels silly that the adapter plugin is needed.

In this way, parsing is different from compilation (dbt compile), wherein dbt actually does need to run some introspective queries in order to properly template model code, especially if that model Jinja-SQL is dynamically templated based on the result of a query (e.g. dbt_utils.get_column_values).

This is a spike because I'm actually not convinced that it's possible! If it is, though, it would unlock two big things for us:

  • Further separation of parsing + execution (compilation/runtime): only the latter requires an adapter
  • Start loosening the baked-in assumption that an adapter is a singleton registered globally & available at any time

Potential gotchas

I can think of at least four five types of information that are used at parse time and require the adapter today. There might be more. I've started with the ones that are most concerning.

  1. Resolving macro dependencies for adapter.dispatch
  2. Parsing user-space Jinja code containing calls to adapter.custom_method()
  3. QuotePolicy + quoting config
  4. Validation of Profile
  5. Validation of adapter-specific configs

1. Calls to adapter.dispatch

This requires a full registry of all the macros defined in all installed adapters.

Two use cases:

  1. Resolving these macros at parse time, if required as input to ref/source/config
  2. Determining which macros will be called at runtime, to properly populate each node's depends_on.macros → support custom generic tests + state:modified.macros

Example:

-- macros/some_macro.sql
{% macro postgres__some_macro() %}
    {{ return(true) }}
{% endmacro %}

{% macro default__some_macro() %}
    {{ return(false) }}
{% endmacro %}
-- models/my_model.sql
{{ config(
    enabled = adapter.dispatch("some_macro")()
) }}

select 1 as id
$ dbt -q ls -s my_model --output json --output-keys depends_on config --profile garage-postgres
{"config": {... "materialized": "table", ...}, "depends_on": {"macros": ["macro.my_dbt_project.postgres__some_macro"], "nodes": []}}

$ dbt -q ls -s my_model --output json --output-keys depends_on config --profile sandbox-bigquery
{"config": {... "materialized": "view", ...}, "depends_on": {"macros": ["macro.my_dbt_project.default__some_macro"], "nodes": []}}

2. Parsing user-space Jinja code containing calls to adapter.custom_method()

{% set some_argument = 'value' %}
{% set something = adapter.custom_method(some_argument) %}
select {{ something }} as id

Because we actually render complex Jinja at parse time, in order to capture calls to ref/source/config (+ macro dependencies!), we'd need a way to bypass those adapter.* calls. (A "dummy" adapter class that has any method attribute, accepting any arguments?) So long as those methods aren't actually used to resolve ref/source/config, we should be in okay shape.

I still think we could land ourselves in type hell, though. What if custom_method returns a specific type, which is then important for subsequent logic?

{% set something = adapter.custom_method(some_argument) %}
{# 'something' needs to be a str — how would we know that? #}
{% set something_else = something.startswith('some_prefix') %}

The way we've solved for this to date is via the @available.parse decorator on adapter methods, which stubs out an empty return value with the correct type. But how could/would we get those types, if not by having access to the adapter / those method definitions at parse time?

3. quoting config

https://docs.getdbt.com/reference/project-configs/quoting

To figure out whether the database.schema.identifier for a given resource should be quoted, and how to do it if so, we look to a combination of user-supplied configuration (in dbt_project.yml), and the quoting_policy + quote_character defined in the adapter plugin's Relation class. (For example, dbt-snowflake has quoting disabled by default, and dbt-bigquery uses a backtick instead of " as its quoting character.)

We could move the resolution of quoting later, after parsing and during compilation, when we'd need to have the adapter plugin available. E.g. The relation_name property has been resolved at compile time in the past, although we just moved it up to parse time in #6427.

4. Validation of target info

We need to load the relevant profile, and build the target context variable, in order to resolve logic like this at parse time, which may change the shape of the DAG:

{{ config(
    enabled = (target.type == 'bigquery')
)}}

However, the dataclass to properly validate the credentials lives in the adapter plugin. It also includes default values of target attributes which the user may not have specified in profiles.yml, and which is exposed to the {{ target }} context var via explicit inclusion in the _connection_keys method.

For example, take the client_session_keepalive boolean config in dbt-snowflake, which is False by default. If the user doesn't specify it in profiles.yml, how can we successfully parse this without access to the adapter?

{{ config(
    enabled = target.client_session_keepalive
)}}

(This is a silly example—there's no reason for my model be enabled/disabled depending on that config in particular—but I include it to prove a point about what's technically possible.)

5. Validation of adapter-specific configs

There are "adapter-specific" model configs that, in theory, use a dataclass defined in the adapter plugin for their validation (+ default values if not specified). I'm pretty sure these don't work at all today. From #5236:

Adapter-specific configs (sort of discussed in #4622): I'm actually not sure if AdapterConfig works at all today. My sense is that their keys don't show up in the manifest unless set, their types aren't really validated until used (at runtime), and their default values don't get passed through. The code here is kinda jank, to put it nicely.

@jtcohen6 jtcohen6 added enhancement New feature or request python_api Issues related to dbtRunner Python entry point spike Team:Execution labels Jan 8, 2023
@github-actions github-actions bot changed the title [Spike] Parsing sans adapter [CT-1769] [Spike] Parsing sans adapter Jan 8, 2023
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jan 9, 2023

notes from talking to @gshank

Right now the adapter is "all one big ball of wax" :)

Longer-term — how could we clearly lock down the set of adapter-specific information that's available at parse time? Is it desirable to move toward a world where parsing is completely adapter-agnostic, if it means losing things too (e.g. the parse-time availability of relation_name)?

As a step on the way there — could we imagine splitting the adapter up into "parse-time" adapter and "run-time" adapter? The "parse-time" adapter should be installable without third-party dependencies. It would contain just:

  • Static configuration / information that's available at parse time (relation/quoting, credentials dataclass)
  • Methods that are @available.parse (but these could call methods from third-party dependencies...)

@nathaniel-may
Copy link
Contributor

Goal of this spike is to estimate the level of effort for actually doing this work, and to identify potential gotchas.

@ChenyuLInx
Copy link
Contributor

As a step on the way there — could we imagine splitting the adapter up into "parse-time" adapter and "run-time" adapter? The "parse-time" adapter should be installable without third-party dependencies. It would contain just:

  • Static configuration / information that's available at parse time (relation/quoting, credentials dataclass)
  • Methods that are @available.parse (but these could call methods from third-party dependencies...)

IMHO it is okay to make that step on the way the final destination, with the idea that we don't import those methods from the actual third party dependency, we can just create a fake one for parsing time

@jtcohen6 jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_api Issues related to dbtRunner Python entry point spike
Projects
None yet
Development

No branches or pull requests

3 participants