Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General project considerations #1

Closed
dcereijodo opened this issue Oct 21, 2019 · 11 comments
Closed

General project considerations #1

dcereijodo opened this issue Oct 21, 2019 · 11 comments
Labels

Comments

@dcereijodo
Copy link
Contributor

As I mentioned here, I have worked with spectrum tables in RS from DBT for a while now, and I am very interested on having some sort of support for that in a DBT package. I can see the status of the project is quite advanced but here are a few thoughts you might consider.

Development flow

From the README I understand the user will be responsible for triggering the update of external tables by running a dbt run-operation stage_external_sources. That's ok, but it adds a new step in de development flow that it seems it shouldn't be unnecessary ( define your source + dbt run-operation stage_external_sources + write a select statement + dbt run). It would be interesting to support that this happens automatically, prior to any downstream select on that select from that external table.

Complex fields

One of the key capabilities of spectrum from my point of view is the ability to explode nested data with a simple query. I understand that when using this package, this will happen when selecting from the source into a staging model, but can you define struct, array or map fields in the YAML?

Jinja

The DDL definition source definition in the source YAML file could probably benefit from jinja injection (I am thinking about shared fields across many tables, that can be defined once and used in different tables, columns that follow some pattern...). Is it possible to squeeze that somehow in the YAML?

Because of this issues (an because we had no idea sources could be extended as they are in this project :)) my original approach to handle external tables with DBT was 1) creating a specific DBT project for handling spectrum tables and 2) within that project creating a custom materialization for executing the DDL and loading data with a select with spectrum. So defining a model like this:

-- models/spectrum_table.sql

{%- set external_fields -%}

  {{ some_common_fields_across_different_models() }},
  field1 varchar(128),
  field2 struct<
   field21:integer,
   field22:array<integer>
  >

{%- endset -%}

-- the 'spectrum_incremental' materialization simply makes sure to execute the external table creation and adding the partitions before the select is executed ('create table as' or 'insert into', it depends on whether it runs in incremental mode or not)
{{
  config(
    materialized='spectrum_incremental',
    external_fields_ddl=external_fields,
    s3_path='s3://datalake-files/my-table',
    partition_range=['2019-01-01', '2019-08-27'],
    serde='org.apache.hadoop.hive.serde2.OpenCSVSerde'
  )
}}

select 
  field1,
  field2.field21

from {{ external_table(this) }}  -- simply selects from the external table created by the materialization

{% if is_incremental %}
-- filter to the partitions of interest
{% endif %}

and materializing the table with dbt run -m spectrum_table. However, defining the external tables in the sources sounds good too!

@jtcohen6
Copy link
Collaborator

@dcereijodo Thank you for your thoughts here! I've been playing around with external tables for several months now, and I'm really grateful for your perspective.

Model or source?

Running through your comment is a significant thread: how can we best fit external tables into existing dbt constructs? Your approach treats external tables as a special kind of model, with a separate materialization strategy. This was something I thought about a lot before deciding that, to my mind, external tables are sources with special properties. Here's some of my rationale:

  • Syntax: In the SQL syntaxes I've looked at, external table creation is create table ([columns]), whereas dbt models have all to this point been create table as (select ...). The addition of column data_type as a YML metadata field may make it possible to create dbt models that are a combination of create table ([columns]) + insert ([model SQL]), which can be useful in specific cases (e.g. user-supplied column compression on Redshift) but not in the general case.
  • Function: To my mind, external tables are configuration, not transformation. While there is some opinionated knowledge baked into external table config, creating an external table doesn't really involve any business logic. The questions and trip-ups—source data types, file storage organization, structural consistency—feel to me like extraction/ingest/load concerns, rather than analytics concerns per se. I often think of raw source data in the warehouse as the handoff point between a data engineer and a data analyst. In that organizational model, dbt sources are the middle where they meet—if a source is not fresh, or an external table is improperly defined, that's a problem they need to solve together. When a dbt model fails, in my experience, that is an error owned by analysts.
  • Prior art: Continuing that thread, there is precedent for dbt knowing a little bit about ETL, both from user-supplied YML and user-triggered queries: source freshness. Of all preexisting dbt features, this felt closest to what I envisioned for external tables. dbt source snapshot-freshness runs separately from the main dbt run command, and for good reason—its failure concerns systems and processes apart from in-warehouse transformation, and often other people as well.

That said, there are a few weaknesses to this source-based approach where a model-based one does have the current advantage:

  • Environments: At this moment, dbt sources cannot change the values of database, schema, or model based on the target, in the way that models do naturally. This makes sense—all users should be using the same source data, really—but it makes the process of adding, testing, and especially modifying external table configuration a bit harder between dev and prod. I believe that dbt will someday make the target context available to source definitions; for now, it's an issue with the order in which things are parsed. I think this is a tricky question, though, and I'm interested in thinking more about it.
  • Jinja / YML / DRYness: You raise a great point. We talk about keeping our SQL as DRY (Don't Repeat Yourself) as possible, but when it comes to YML configuration, verbosity is the only way to go. This is already an issue with needing to define the same column sets across many models that are built atop of one another—ameliorated, though not resolved, by sticking descriptions in docs blocks. Long term, I think a solution of this might look like reusable YML blocks, or the ability to "extend"/"inherit" metadata configuration from another source or model.

Development flow

the user will be responsible for triggering the update of external tables

The need to regularly run/update external tables depends on the database the user is on:

  • Redshift/Spectrum: External tables without partitions do not need updating unless the source data changes structure or S3 location. Partitions, however, need to be defined manually, with one alter table add partition statement for each discrete slice of source data. I've tried to solve this programmatically, using dbt macros, but it requires running on a more constant basis for tables whose partition columns are dates/timestamps.
  • Snowflake: External tables without partitions do not need updating unless the source data changes or moves. External tables with partitions require running alter table ... refresh after creation and any time the metadata is updated (i.e. a new partition's worth of data is added to the storage location), unless you enable auto-refresh, in which case you've configured an AWS SQS notification or Azure Blob event to trigger the metadata update.
  • Spark: Once external tables are created, even ones with partitions, they never need updating unless the source data changes or moves.

Even with the regular need to update partitions in Spectrum, I don't envision this figuring into the development process. I think it could be handled by a separate job in dbt Cloud/Airflow/Jenkins/wherever dbt is deployed that just runs dbt run-operation stage_external_sources to keep production data up to date, on whatever cadence is necessary (daily? hourly?). In the eventual implementation of this, I definitely think there's a compelling case for refreshing only those models which need it:

dbt source stage-external --select source:snowplow.event

Complex fields

You should be able to define complex fields in YML. The data type definition may require characters that are unfriendly to YML, in which case you can enclose the value in quotes. This example from the Redshift docs would look like:

sources:
  - name: spectrum
    tables:
      - name: customers
        external:
          location: "s3://awssampledbuswest2/nested_example/customers/"
          file_format: parquet
    columns:
      - name: id
        data_type: int
      - name: name
        data_type: "struct<given:varchar(20), family:varchar(20)>"
      - name: phones
        data_type: "array<varchar(20)>"
      - name: orders
        data_type: "array<struct<shipdate:timestamp, price:double precision>>"

@dcereijodo
Copy link
Contributor Author

@jtcohen6, thanks a lot for your explanations :)

About external table configuration updates, my concern about adding an unnecessary step in the development process was more about changes on the external table schema and not so much about adding partitions. If it's the case that I want to add a new field in the external table, it's probably because I want that new data to propagate to the staging and subsequent layers. It is in context that I pictured the adding of a new field in the source and in the staging model as part of the same "development", and I might expect dbt to provide selection semantics to capture and resolve those dependencies.

For example: If I have that table_b depends on table_a I know that, during development, I can run dbt run -m table_a+ and DBT will make sure to align table_a along with all its dependencies with whatever the code reflects at that specific time. This has great value to me, and it does not hold when table_b is a source (unless there is a way to "run" sources which I am not aware of)

All in all, I totally buy the approach. Having the external tables defined as sources has the inconveniences you mentioned (unawareness of target profile and out of scope for jinja), but having a full DBT repo for dealing with external sources has some inconveniences too, so count me in for the idea 👍

How do you work around the unavailable target on the source context?

@davehowell
Copy link

davehowell commented Jan 23, 2020

@dcereijodo

In your example

{{
  config(
    materialized='spectrum_incremental',
    external_fields_ddl=external_fields,
    s3_path='s3://datalake-files/my-table',
    partition_range=['2019-01-01', '2019-08-27'],
    serde='org.apache.hadoop.hive.serde2.OpenCSVSerde'
  )
}}

select 
  field1,
  field2.field21

from {{ external_table(this) }}  -- simply selects from the external table created by the materialization

This code is within a single model called spectrum_table so what would get created here for the spectrum_incremental materialization?

I assume there would be an external table, but then what is the Select statement used for? External table doesn't need a select statement, it is just DDL. Does it also create a table, view or ephemeral model? If that's the intention then I think the this reference is circular, unless there are two things being created with some systematic naming convention?

I agree with jtcohen6 that external table creation is just a metadata operation, it is not possible to wrap it in CTAS and doesn't make sense to create view or inline CTE (ephemeral) with it, but a view or ephemeral would certainly be appropriate n a base model.

How do you work around the unavailable target on the source context?

Well you can't right? You just have to hard-code the schema into the source definition. I don't think that's a big deal.

You made a really interesting point about dbt run -m table_a+ and if the dependency is a source.

It would be nice to grab the source definition from yaml, and then compare it with the actual table metadata (e.g. information_schema columns & types stuff) and warn if the columns in source don't match the actual table columns. You could in the first place also generate your sources yaml (barebones i.e. without helpful comments for the dbt docs) from metadata queries. I did that to get started but didn't think about how to manage keeping it in sync.

Going further, it would be nice to have some way to check the sources match metadata from the datawarehouse, then follow the dependency chain to check if anything downstream is going to break, perhaps with a macro that introspects the graph nodes. It's probably overkill though because if a column is removed then dbt will error and let you know about it pretty quickly.

...
" reusable YML blocks" are awful, the syntax is confusing and maintainability is lost. I think yaml is data not code so DRYness doesn't apply the same way. Even with code if readability is lost then DRY is not worth it.

@jtcohen6
Copy link
Collaborator

Thanks for adding your thoughts on this @davehowell!

How do you work around the unavailable target on the source context?

Well you can't right? You just have to hard-code the schema into the source definition. I don't think that's a big deal.

I totally see the desire to test a fix or update to an external table definition in a scratch schema first, or test changes through a CI process, without impacting anything in production. The way dbt handles this environment control for model materializations is based on target.schema. While this is not possible yet, I could see a version of this that looks like:

sources:
  - name: spectrum
     schema: {{ target.schema ~ '_spectrum' if target.name == 'dev' else 'spectrum' %} 

For all dbt development that doesn't require changes to external table definitions, however, it makes sense that I want to select from the same sources as in production. For now, my approach has been to manually change the schema property in development only when I'm testing updates to external table definitions, and to hope that I remember to change it back before I open a PR :)

I'm really interested in the conversation around running external tables as model dependencies. For example, should dbt run -m source:snowplow_external implicitly resolve to something like dbt run-operation stage_external_sources --select snowplow? Should that step be implicitly included in dbt run -m +snowplow? It's not something we need to worry about too much for all the reasons mentioned—if the dependencies misalign, you'll see a database error pretty quickly. I think this adds to an ongoing line of thinking around how we can best define and orchestrate more complex jobs, comprised of several dbt commands.

@davehowell
Copy link

Hmm I suppose there is some precedent with model aliases and the over-ridable macro generate_alias_name

@dcereijodo
Copy link
Contributor Author

@davehowell

This code is within a single model called spectrum_table so what would get created here for the spectrum_incremental materialization?

Such materialization is currently doing two things: 1) dropping and re-creating an external table with whatever external_fields I had defined, and 2) selecting from that external model to create a physical table in a raw schema in redshift. My spectrum models map 1 to 1 with the event types of a JSON events system, so it's pretty convenient, for example, if a new field was added in the event and I want to load it in the DWH. I just need to add a couple of lines to the spectrum_table file and run it with --full-refresh

If that's the intention then I think the this reference is circular, unless there are two things being created with some systematic naming convention?

this is used by the external_table macro to render the full name of the external table created by the spectrum_materialization. I could use hardcoded table names. So from external_schema.spectrum_table instead of from {{ external_table(this) }}, but then I would need to remember to change the from clause if I rename the model.

I think this is what you meant by two things being created with some systematic naming convention anyways 😄

@davehowell
Copy link

Oh wow it's doing both, that makes sense now :D

@jtcohen6
Copy link
Collaborator

I totally see the desire to test a fix or update to an external table definition in a scratch schema first, or test changes through a CI process, without impacting anything in production. The way dbt handles this environment control for model materializations is based on target.schema. While this is not possible yet, I could see a version of this that looks like:

sources:
 - name: spectrum
   schema: {{ target.schema ~ '_spectrum' if target.name == 'dev' else 'spectrum' %}

I just want to clarify that this is now possible. dbt 0.16.0 (released last week) makes target variables accessible in YML config.

@dcereijodo Thanks for fully explaining your approach to external tables as a custom materialization, that makes a lot of sense! Given that you're heading in a somewhat different direction from this package, I'll leave it up to you if you want to keep this issue/conversation open.

jtcohen6 pushed a commit that referenced this issue Dec 3, 2020
jtcohen6 pushed a commit that referenced this issue Jun 25, 2021
@pgoslatara
Copy link
Contributor

I totally see the desire to test a fix or update to an external table definition in a scratch schema first, or test changes through a CI process, without impacting anything in production. The way dbt handles this environment control for model materializations is based on target.schema. While this is not possible yet, I could see a version of this that looks like:

sources:
 - name: spectrum
   schema: {{ target.schema ~ '_spectrum' if target.name == 'dev' else 'spectrum' %}

I just want to clarify that this is now possible. dbt 0.16.0 (released last week) makes target variables accessible in YML config.

@dcereijodo Thanks for fully explaining your approach to external tables as a custom materialization, that makes a lot of sense! Given that you're heading in a somewhat different direction from this package, I'll leave it up to you if you want to keep this issue/conversation open.

Note for anyone else that encounters this issue in future, I had to use surrounding double quotes for this to work with dbt 0.20.1:

schema: "{{ target.schema ~ '_spectrum' if target.name == 'dev' else 'spectrum' }}"

@github-actions
Copy link

github-actions bot commented Aug 7, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 7, 2023
@github-actions
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants