-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark External Table Bugs #53
Comments
Thanks for checking this out @bdelamotte!
Here's what I'd hope would work:
|
I will re-double my testing effort this week and see if I still run into it for the first issue I ran into |
I think I found a different potential bug. If you create an external table and then change the data type, the refresh table will still have the old data type. For example, I had an external table with a column of type Decimal and changed it to Double, but after re-running I propose either the external tables get dropped and re-created each time, or we add a --full-refresh flag to the command to trigger this. |
@bdelamotte Right on, I've sorta-hacked the |
Good to know I can do the full refresh with the above way you provided. Thank you @jtcohen6 ! I think we found one more bug last week. If we run I think to reproduce is to drop your schema entirely and run this command and you'll run into the issue. |
Good point, that's definitely true. What do you think? Should dbt run This is one of the wrinkles that exists because stage external sources is a run-operation, rather than a "real" builtin task. It would be very cool to support it as one someday (dbt-labs/dbt-core#2381). |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Describe the bug
I tested out #51 (review) today by using the following declarations in the
packages.yml
file:I came across two issues:
dbt run-operation stage_external_sources
command, you cannot reference it with the normal source command like so:SELECT * FROM {{ source('dbt_databricks', 'my_source_table') }}
It throws an error saying the the relation is required, so it seems dbt doesn't have knowledge of the external table in its cache.
What worked as a workaround was creating a dbt Relation directly and then referencing that. Something like
"s3://my-data-bucket/bla/*/invoice_items.json" where * is a date_range, the external table only picks up the paths that are created. For example if I have:
Load data in s3://my-data-bucket/bla/Jan/invoice_items.json
Load data in s3://my-data-bucket/bla/Feb/invoice_items.json
Load data in s3://my-data-bucket/bla/March/invoice_items.json
Run
dbt run-operation stage_external_sources
Load data in s3://my-data-bucket/bla/April/invoice_items.json
Query external table now
In the last query, I won't see April data at all. In order to get the April data to show up, I had to do a
drop table <source_name
and then re-rundbt run-operation stage_external_sources
My only guess as to why this is happening is behind the covers the Hive metastore only evaluates the path once upon creation maybe and to trigger new data to show up a full drop table needs to happen before recreating the external table definition.
Steps to reproduce
Expected results
Actual results
System information
The contents of your
packages.yml
file:Which database are you using dbt with?
The output of
dbt --version
:The operating system you're using:
The output of
python --version
:Additional context
The text was updated successfully, but these errors were encountered: