Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark External Table Bugs #53

Closed
1 of 3 tasks
bdelamotte opened this issue Nov 25, 2020 · 8 comments
Closed
1 of 3 tasks

Spark External Table Bugs #53

bdelamotte opened this issue Nov 25, 2020 · 8 comments
Labels
bug Something isn't working spark Stale

Comments

@bdelamotte
Copy link

Describe the bug

I tested out #51 (review) today by using the following declarations in the packages.yml file:

    # So we can use https://github.com/fishtown-analytics/dbt-external-tables/pull/51
  - git: "https://github.com/fishtown-analytics/dbt-external-tables"
    revision: master

I came across two issues:

  1. After declaring a Spark external source and creating it using the dbt run-operation stage_external_sources command, you cannot reference it with the normal source command like so:
    SELECT * FROM {{ source('dbt_databricks', 'my_source_table') }}

It throws an error saying the the relation is required, so it seems dbt doesn't have knowledge of the external table in its cache.

What worked as a workaround was creating a dbt Relation directly and then referencing that. Something like

{% set relation = api.Relation.create(schema='dbt_databricks', identifier='my_source_table') %}
  1. The second thing I noticed was if you give an S3 path like:
    "s3://my-data-bucket/bla/*/invoice_items.json" where * is a date_range, the external table only picks up the paths that are created. For example if I have:
    Load data in s3://my-data-bucket/bla/Jan/invoice_items.json
    Load data in s3://my-data-bucket/bla/Feb/invoice_items.json
    Load data in s3://my-data-bucket/bla/March/invoice_items.json
    Run dbt run-operation stage_external_sources
    Load data in s3://my-data-bucket/bla/April/invoice_items.json
    Query external table now

In the last query, I won't see April data at all. In order to get the April data to show up, I had to do a drop table <source_name and then re-run dbt run-operation stage_external_sources

My only guess as to why this is happening is behind the covers the Hive metastore only evaluates the path once upon creation maybe and to trigger new data to show up a full drop table needs to happen before recreating the external table definition.

Steps to reproduce

Expected results

Actual results

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • redshift
  • snowflake
  • other (specify: Spark on Databricks

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

@bdelamotte bdelamotte added the bug Something isn't working label Nov 25, 2020
@jtcohen6 jtcohen6 added the spark label Nov 25, 2020
@jtcohen6
Copy link
Collaborator

Thanks for checking this out @bdelamotte!

  1. I haven't had this experience while developing locally. Using the example here:
    https://github.com/fishtown-analytics/dbt-external-tables/blob/f2924dfb3b9772843d8c44ac5fad9629cb0d2795/integration_tests/models/spark.yml#L3-L10
    I can write this in a model, and it works just fine:
select * from {{ source('spark_external', 'people_csv_unpartitioned_using') }}
  1. I hope this is what refresh table is here for. If I understand it right, you can refresh the metadata without needing to drop and recreate the table.

Here's what I'd hope would work:

  1. Load data in s3://my-data-bucket/bla/Jan/invoice_items.json
  2. Load data in s3://my-data-bucket/bla/Feb/invoice_items.json
  3. Load data in s3://my-data-bucket/bla/March/invoice_items.json
  4. Run dbt run-operation stage_external_sources --vars 'ext_full_refresh: true': this does a drop + recreate
  5. Load data in s3://my-data-bucket/bla/April/invoice_items.json
  6. Run dbt run-operation stage_external_sources: this does a refresh on all existing tables

@bdelamotte
Copy link
Author

bdelamotte commented Nov 30, 2020

I will re-double my testing effort this week and see if I still run into it for the first issue I ran into
Totally didn't know about refresh table! I will try that too! Thanks @jtcohen6

@bdelamotte
Copy link
Author

I think I found a different potential bug. If you create an external table and then change the data type, the refresh table will still have the old data type. For example, I had an external table with a column of type Decimal and changed it to Double, but after re-running dbt run-operation stage_external_sources , the tables just get refreshed, but still have the old data type.

I propose either the external tables get dropped and re-created each time, or we add a --full-refresh flag to the command to trigger this.

@jtcohen6
Copy link
Collaborator

@bdelamotte Right on, I've sorta-hacked the --full-refresh flag (since it isn't supported by run-operation otherwise) using a CLI var. Try running: dbt run-operation stage_external_sources --vars 'ext_full_refresh: true'

@bdelamotte
Copy link
Author

Good to know I can do the full refresh with the above way you provided. Thank you @jtcohen6 !

I think we found one more bug last week. If we run dbt run-operation stage_external_sources for a schema that doesn't exist yet, this command will error out. Since I always ran dbt seed before this command, I never ran into it. Then a co-worker of mine got slim CI working and since there was no changes to the seed files, nothing ran that would fire off the creation of the new schema and I ran into this error.

I think to reproduce is to drop your schema entirely and run this command and you'll run into the issue.

@jtcohen6
Copy link
Collaborator

Good point, that's definitely true. What do you think? Should dbt run create schema if not exists, for each source, before creating the external table?

This is one of the wrinkles that exists because stage external sources is a run-operation, rather than a "real" builtin task. It would be very cool to support it as one someday (dbt-labs/dbt-core#2381).

@github-actions
Copy link

github-actions bot commented Aug 5, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 5, 2023
@github-actions
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working spark Stale
Projects
None yet
Development

No branches or pull requests

2 participants