Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2294] [Feature] Seed JSON and Parquet #7155

Closed
3 tasks done
jpmmcneill opened this issue Mar 10, 2023 · 3 comments
Closed
3 tasks done

[CT-2294] [Feature] Seed JSON and Parquet #7155

jpmmcneill opened this issue Mar 10, 2023 · 3 comments
Labels
duplicate This issue or pull request already exists enhancement New feature or request

Comments

@jpmmcneill
Copy link
Contributor

jpmmcneill commented Mar 10, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Right now (as far as I am aware) seed files have to be CSVs.

Using something like duckdb (which critically is dependency-less) seeds could easily be extended to json or parquet as well, by having something like:

seeds/
  foo.csv
  bar.csv
  foobar.json

->

seeds/
  foo.csv
  bar.csv
  foobar.json
  __dbt_foobar.csv (temporary file, created by duckdb)

From which the usual seed method could be created. Finally dbt could delete the temp file.

Possibly it's silly to widen the net of things that could be seeded 🤷

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

Yup, if i'm pointed in the right direction! (I think it's probably core/dbt/task/seed.py

Anything else?

No response

@jpmmcneill jpmmcneill added enhancement New feature or request triage labels Mar 10, 2023
@github-actions github-actions bot changed the title [Feature] Seed JSON and Parquet [CT-2294] [Feature] Seed JSON and Parquet Mar 10, 2023
@dbeatty10 dbeatty10 self-assigned this Mar 11, 2023
@dbeatty10
Copy link
Contributor

Hey @jpmmcneill, always good to see you again!

JSON seeds

There's actually a pre-existing issue for the JSON part: #2365. Specifically, the NDJSON format was proposed with one valid JSON value per line (which will most commonly be an object or array).

Since the issue you submitted is primarily concerned with JSON seeds, I'm going to close this as a duplicate of #2365

Parquet seeds

But it sounds like you might have a stand-alone (but related) request to be able to seed Parquet files too.

My initial (and not fully refined thoughts) related to adding official support for Parquet as a seed format is: not at this time*.

The primary reasons:

  1. Supporting Parquet would beg the question about support other formats like ORC, Avro, etc.
  2. There are warehouse specific workarounds to ingest Parquet today (see below for dbt-duckdb)

*With that being said, I would absolutely welcome and encourage you to open a new feature request for Parquet seeds if you want to discuss further!

Parquet in DuckDB

Let's assume you have a valid parquet file located at seeds/my_parquet_seed.parquet. For DuckDB users (using dbt-duckdb), I think parquet "seeds" are possible as-is via syntax like this:
models/my_test.sql

select * from 'seeds/my_parquet_seed.parquet'

This syntax obviously isn't a true seed, because it wouldn't support references like ref("my_parquet_seed"). But you could just do ref("my_test") instead when that is needed/desired.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2023
@dbeatty10 dbeatty10 added duplicate This issue or pull request already exists and removed triage labels Mar 13, 2023
@dbeatty10 dbeatty10 removed their assignment Mar 13, 2023
@jpmmcneill
Copy link
Contributor Author

jpmmcneill commented Mar 13, 2023

Hey @dbeatty10, thanks for this! No problem to close.

I agree that supporting n != 1 always begs the question of "can you support X". I agree that there are warehouse specific workarounds but they're complete pains sometimes (I'm thinking specifically of snowflake here). Really what attracted me to duckdb for this issue is that it provides a quite headless way to turn other data types into csv's in a quite lightweight method.

So I wasn't really talking about the duckdb adapter specifically! Indeed, adding duckdb as a requirement for dbt core would give josh a tonne of dependency headaches most likely! 😂

Nice that JSON issue exists. Sorry that I missed that issue, I'll give it a read.

@dbeatty10
Copy link
Contributor

Yeah, that's a clever idea of using DuckDB as lightweight format converter 🧠

Two things that would improve that approach:

  1. CSV seeds in dbt rely on combination of type inference + column_types configuration. It would be nice if converting from Parquet to a CSV seed also emitted the appropriate column_types configuration (either database specific or database agnostic).
  2. Currently, Parquet is one of the few file formats supported for importing by DuckDB. This approach would benefit if there were more export options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants