-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General project considerations #1
Comments
@dcereijodo Thank you for your thoughts here! I've been playing around with external tables for several months now, and I'm really grateful for your perspective. Model or source?Running through your comment is a significant thread: how can we best fit external tables into existing dbt constructs? Your approach treats external tables as a special kind of model, with a separate materialization strategy. This was something I thought about a lot before deciding that, to my mind, external tables are sources with special properties. Here's some of my rationale:
That said, there are a few weaknesses to this source-based approach where a model-based one does have the current advantage:
Development flow
The need to regularly run/update external tables depends on the database the user is on:
Even with the regular need to update partitions in Spectrum, I don't envision this figuring into the development process. I think it could be handled by a separate job in dbt Cloud/Airflow/Jenkins/wherever dbt is deployed that just runs
Complex fieldsYou should be able to define complex fields in YML. The data type definition may require characters that are unfriendly to YML, in which case you can enclose the value in quotes. This example from the Redshift docs would look like: sources:
- name: spectrum
tables:
- name: customers
external:
location: "s3://awssampledbuswest2/nested_example/customers/"
file_format: parquet
columns:
- name: id
data_type: int
- name: name
data_type: "struct<given:varchar(20), family:varchar(20)>"
- name: phones
data_type: "array<varchar(20)>"
- name: orders
data_type: "array<struct<shipdate:timestamp, price:double precision>>" |
@jtcohen6, thanks a lot for your explanations :) About external table configuration updates, my concern about adding an unnecessary step in the development process was more about changes on the external table schema and not so much about adding partitions. If it's the case that I want to add a new field in the external table, it's probably because I want that new data to propagate to the staging and subsequent layers. It is in context that I pictured the adding of a new field in the source and in the staging model as part of the same "development", and I might expect dbt to provide selection semantics to capture and resolve those dependencies. For example: If I have that All in all, I totally buy the approach. Having the external tables defined as sources has the inconveniences you mentioned (unawareness of target profile and out of scope for jinja), but having a full DBT repo for dealing with external sources has some inconveniences too, so count me in for the idea 👍 How do you work around the unavailable |
In your example
This code is within a single model called I assume there would be an external table, but then what is the Select statement used for? External table doesn't need a select statement, it is just DDL. Does it also create a table, view or ephemeral model? If that's the intention then I think the I agree with jtcohen6 that external table creation is just a metadata operation, it is not possible to wrap it in CTAS and doesn't make sense to create view or inline CTE (ephemeral) with it, but a view or ephemeral would certainly be appropriate n a base model. How do you work around the unavailable target on the source context? Well you can't right? You just have to hard-code the schema into the source definition. I don't think that's a big deal. You made a really interesting point about It would be nice to grab the source definition from yaml, and then compare it with the actual table metadata (e.g. information_schema columns & types stuff) and warn if the columns in source don't match the actual table columns. You could in the first place also generate your sources yaml (barebones i.e. without helpful comments for the dbt docs) from metadata queries. I did that to get started but didn't think about how to manage keeping it in sync. Going further, it would be nice to have some way to check the sources match metadata from the datawarehouse, then follow the dependency chain to check if anything downstream is going to break, perhaps with a macro that introspects the graph nodes. It's probably overkill though because if a column is removed then dbt will error and let you know about it pretty quickly. ... |
Thanks for adding your thoughts on this @davehowell!
I totally see the desire to test a fix or update to an external table definition in a scratch schema first, or test changes through a CI process, without impacting anything in production. The way dbt handles this environment control for model materializations is based on sources:
- name: spectrum
schema: {{ target.schema ~ '_spectrum' if target.name == 'dev' else 'spectrum' %} For all dbt development that doesn't require changes to external table definitions, however, it makes sense that I want to select from the same I'm really interested in the conversation around running external tables as model dependencies. For example, should |
Hmm I suppose there is some precedent with model aliases and the over-ridable macro |
Such materialization is currently doing two things: 1) dropping and re-creating an external table with whatever
I think this is what you meant by two things being created with some systematic naming convention anyways 😄 |
Oh wow it's doing both, that makes sense now :D |
I just want to clarify that this is now possible. dbt 0.16.0 (released last week) makes @dcereijodo Thanks for fully explaining your approach to external tables as a custom materialization, that makes a lot of sense! Given that you're heading in a somewhat different direction from this package, I'll leave it up to you if you want to keep this issue/conversation open. |
Note for anyone else that encounters this issue in future, I had to use surrounding double quotes for this to work with dbt 0.20.1:
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
As I mentioned here, I have worked with spectrum tables in RS from DBT for a while now, and I am very interested on having some sort of support for that in a DBT package. I can see the status of the project is quite advanced but here are a few thoughts you might consider.
Development flow
From the README I understand the user will be responsible for triggering the update of external tables by running a
dbt run-operation stage_external_sources
. That's ok, but it adds a new step in de development flow that it seems it shouldn't be unnecessary ( define your source +dbt run-operation stage_external_sources
+ write aselect
statement +dbt run
). It would be interesting to support that this happens automatically, prior to any downstreamselect
on that select from that external table.Complex fields
One of the key capabilities of spectrum from my point of view is the ability to explode nested data with a simple query. I understand that when using this package, this will happen when selecting from the
source
into astaging
model, but can you definestruct
,array
ormap
fields in the YAML?Jinja
The DDL definition source definition in the source YAML file could probably benefit from jinja injection (I am thinking about shared fields across many tables, that can be defined once and used in different tables, columns that follow some pattern...). Is it possible to squeeze that somehow in the YAML?
Because of this issues (an because we had no idea sources could be extended as they are in this project :)) my original approach to handle external tables with DBT was 1) creating a specific DBT project for handling spectrum tables and 2) within that project creating a custom materialization for executing the DDL and loading data with a select with spectrum. So defining a model like this:
and materializing the table with
dbt run -m spectrum_table
. However, defining the external tables in thesources
sounds good too!The text was updated successfully, but these errors were encountered: