[CT-1769] [Spike] Parsing sans adapter #6549

jtcohen6 · 2023-01-08T19:55:18Z

Big idea: It should be possible to dbt parse and produce a Manifest without requiring a specific database adapter to be installed. During parsing, dbt isn't actually connecting to any databases (or the Internet at all) — so it feels silly that the adapter plugin is needed.

In this way, parsing is different from compilation (dbt compile), wherein dbt actually does need to run some introspective queries in order to properly template model code, especially if that model Jinja-SQL is dynamically templated based on the result of a query (e.g. dbt_utils.get_column_values).

This is a spike because I'm actually not convinced that it's possible! If it is, though, it would unlock two big things for us:

Further separation of parsing + execution (compilation/runtime): only the latter requires an adapter
Start loosening the baked-in assumption that an adapter is a singleton registered globally & available at any time

Potential gotchas

I can think of at least ~~four~~ five types of information that are used at parse time and require the adapter today. There might be more. I've started with the ones that are most concerning.

Resolving macro dependencies for adapter.dispatch
Parsing user-space Jinja code containing calls to adapter.custom_method()
QuotePolicy + quoting config
Validation of Profile
Validation of adapter-specific configs

1. Calls to `adapter.dispatch`

This requires a full registry of all the macros defined in all installed adapters.

Two use cases:

Resolving these macros at parse time, if required as input to ref/source/config
Determining which macros will be called at runtime, to properly populate each node's depends_on.macros → support custom generic tests + state:modified.macros

Example:

-- macros/some_macro.sql
{% macro postgres__some_macro() %}
    {{ return(true) }}
{% endmacro %}

{% macro default__some_macro() %}
    {{ return(false) }}
{% endmacro %}

-- models/my_model.sql
{{ config(
    enabled = adapter.dispatch("some_macro")()
) }}

select 1 as id

$ dbt -q ls -s my_model --output json --output-keys depends_on config --profile garage-postgres
{"config": {... "materialized": "table", ...}, "depends_on": {"macros": ["macro.my_dbt_project.postgres__some_macro"], "nodes": []}}

$ dbt -q ls -s my_model --output json --output-keys depends_on config --profile sandbox-bigquery
{"config": {... "materialized": "view", ...}, "depends_on": {"macros": ["macro.my_dbt_project.default__some_macro"], "nodes": []}}

2. Parsing user-space Jinja code containing calls to `adapter.custom_method()`

{% set some_argument = 'value' %}
{% set something = adapter.custom_method(some_argument) %}
select {{ something }} as id

Because we actually render complex Jinja at parse time, in order to capture calls to ref/source/config (+ macro dependencies!), we'd need a way to bypass those adapter.* calls. (A "dummy" adapter class that has any method attribute, accepting any arguments?) So long as those methods aren't actually used to resolve ref/source/config, we should be in okay shape.

I still think we could land ourselves in type hell, though. What if custom_method returns a specific type, which is then important for subsequent logic?

{% set something = adapter.custom_method(some_argument) %}
{# 'something' needs to be a str — how would we know that? #}
{% set something_else = something.startswith('some_prefix') %}

The way we've solved for this to date is via the @available.parse decorator on adapter methods, which stubs out an empty return value with the correct type. But how could/would we get those types, if not by having access to the adapter / those method definitions at parse time?

3. `quoting` config

https://docs.getdbt.com/reference/project-configs/quoting

To figure out whether the database.schema.identifier for a given resource should be quoted, and how to do it if so, we look to a combination of user-supplied configuration (in dbt_project.yml), and the quoting_policy + quote_character defined in the adapter plugin's Relation class. (For example, dbt-snowflake has quoting disabled by default, and dbt-bigquery uses a backtick instead of " as its quoting character.)

We could move the resolution of quoting later, after parsing and during compilation, when we'd need to have the adapter plugin available. E.g. The relation_name property has been resolved at compile time in the past, although we just moved it up to parse time in #6427.

4. Validation of `target` info

We need to load the relevant profile, and build the target context variable, in order to resolve logic like this at parse time, which may change the shape of the DAG:

{{ config(
    enabled = (target.type == 'bigquery')
)}}

However, the dataclass to properly validate the credentials lives in the adapter plugin. It also includes default values of target attributes which the user may not have specified in profiles.yml, and which is exposed to the {{ target }} context var via explicit inclusion in the _connection_keys method.

For example, take the client_session_keepalive boolean config in dbt-snowflake, which is False by default. If the user doesn't specify it in profiles.yml, how can we successfully parse this without access to the adapter?

{{ config(
    enabled = target.client_session_keepalive
)}}

(This is a silly example—there's no reason for my model be enabled/disabled depending on that config in particular—but I include it to prove a point about what's technically possible.)

5. Validation of adapter-specific configs

There are "adapter-specific" model configs that, in theory, use a dataclass defined in the adapter plugin for their validation (+ default values if not specified). I'm pretty sure these don't work at all today. From #5236:

Adapter-specific configs (sort of discussed in #4622): I'm actually not sure if AdapterConfig works at all today. My sense is that their keys don't show up in the manifest unless set, their types aren't really validated until used (at runtime), and their default values don't get passed through. The code here is kinda jank, to put it nicely.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-01-09T15:50:33Z

notes from talking to @gshank

Right now the adapter is "all one big ball of wax" :)

Longer-term — how could we clearly lock down the set of adapter-specific information that's available at parse time? Is it desirable to move toward a world where parsing is completely adapter-agnostic, if it means losing things too (e.g. the parse-time availability of relation_name)?

As a step on the way there — could we imagine splitting the adapter up into "parse-time" adapter and "run-time" adapter? The "parse-time" adapter should be installable without third-party dependencies. It would contain just:

Static configuration / information that's available at parse time (relation/quoting, credentials dataclass)
Methods that are @available.parse (but these could call methods from third-party dependencies...)

nathaniel-may · 2023-01-12T18:42:39Z

Goal of this spike is to estimate the level of effort for actually doing this work, and to identify potential gotchas.

ChenyuLInx · 2023-02-17T23:30:19Z

As a step on the way there — could we imagine splitting the adapter up into "parse-time" adapter and "run-time" adapter? The "parse-time" adapter should be installable without third-party dependencies. It would contain just:

Static configuration / information that's available at parse time (relation/quoting, credentials dataclass)

Methods that are @available.parse (but these could call methods from third-party dependencies...)

IMHO it is okay to make that step on the way the final destination, with the idea that we don't import those methods from the actual third party dependency, we can just create a fake one for parsing time

jtcohen6 added enhancement New feature or request python_api Issues related to dbtRunner Python entry point spike Team:Execution labels Jan 8, 2023

github-actions bot changed the title ~~[Spike] Parsing sans adapter~~ [CT-1769] [Spike] Parsing sans adapter Jan 8, 2023

jtcohen6 mentioned this issue Jan 8, 2023

[CT-1581] [Epic] dbt-core as a library: first steps #6356

Closed

23 tasks

jtcohen6 mentioned this issue Apr 5, 2023

[CT-2124] [Feature] --no-connection flag for dbt docs generate #6980

Closed

3 tasks

jtcohen6 removed Team:Language labels Jul 19, 2023

jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1769] [Spike] Parsing sans adapter #6549

[CT-1769] [Spike] Parsing sans adapter #6549

jtcohen6 commented Jan 8, 2023 •

edited

Loading

jtcohen6 commented Jan 9, 2023

nathaniel-may commented Jan 12, 2023

ChenyuLInx commented Feb 17, 2023

[CT-1769] [Spike] Parsing sans adapter #6549

[CT-1769] [Spike] Parsing sans adapter #6549

Comments

jtcohen6 commented Jan 8, 2023 • edited Loading

Potential gotchas

1. Calls to adapter.dispatch

2. Parsing user-space Jinja code containing calls to adapter.custom_method()

3. quoting config

4. Validation of target info

5. Validation of adapter-specific configs

jtcohen6 commented Jan 9, 2023

nathaniel-may commented Jan 12, 2023

ChenyuLInx commented Feb 17, 2023

jtcohen6 commented Jan 8, 2023 •

edited

Loading

1. Calls to `adapter.dispatch`

2. Parsing user-space Jinja code containing calls to `adapter.custom_method()`

3. `quoting` config

4. Validation of `target` info