[CT-2294] [Feature] Seed JSON and Parquet #7155

jpmmcneill · 2023-03-10T23:19:24Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Right now (as far as I am aware) seed files have to be CSVs.

Using something like duckdb (which critically is dependency-less) seeds could easily be extended to json or parquet as well, by having something like:

seeds/
  foo.csv
  bar.csv
  foobar.json

->

seeds/
  foo.csv
  bar.csv
  foobar.json
  __dbt_foobar.csv (temporary file, created by duckdb)

From which the usual seed method could be created. Finally dbt could delete the temp file.

Possibly it's silly to widen the net of things that could be seeded 🤷

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

Yup, if i'm pointed in the right direction! (I think it's probably core/dbt/task/seed.py

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-03-13T17:46:49Z

Hey @jpmmcneill, always good to see you again!

JSON seeds

There's actually a pre-existing issue for the JSON part: #2365. Specifically, the NDJSON format was proposed with one valid JSON value per line (which will most commonly be an object or array).

Since the issue you submitted is primarily concerned with JSON seeds, I'm going to close this as a duplicate of #2365

Parquet seeds

But it sounds like you might have a stand-alone (but related) request to be able to seed Parquet files too.

My initial (and not fully refined thoughts) related to adding official support for Parquet as a seed format is: not at this time*.

The primary reasons:

Supporting Parquet would beg the question about support other formats like ORC, Avro, etc.
There are warehouse specific workarounds to ingest Parquet today (see below for dbt-duckdb)

*With that being said, I would absolutely welcome and encourage you to open a new feature request for Parquet seeds if you want to discuss further!

Parquet in DuckDB

Let's assume you have a valid parquet file located at seeds/my_parquet_seed.parquet. For DuckDB users (using dbt-duckdb), I think parquet "seeds" are possible as-is via syntax like this:
models/my_test.sql

select * from 'seeds/my_parquet_seed.parquet'

This syntax obviously isn't a true seed, because it wouldn't support references like ref("my_parquet_seed"). But you could just do ref("my_test") instead when that is needed/desired.

jpmmcneill · 2023-03-13T17:57:10Z

Hey @dbeatty10, thanks for this! No problem to close.

I agree that supporting n != 1 always begs the question of "can you support X". I agree that there are warehouse specific workarounds but they're complete pains sometimes (I'm thinking specifically of snowflake here). Really what attracted me to duckdb for this issue is that it provides a quite headless way to turn other data types into csv's in a quite lightweight method.

So I wasn't really talking about the duckdb adapter specifically! Indeed, adding duckdb as a requirement for dbt core would give josh a tonne of dependency headaches most likely! 😂

Nice that JSON issue exists. Sorry that I missed that issue, I'll give it a read.

dbeatty10 · 2023-03-13T20:07:17Z

Yeah, that's a clever idea of using DuckDB as lightweight format converter 🧠

Two things that would improve that approach:

CSV seeds in dbt rely on combination of type inference + column_types configuration. It would be nice if converting from Parquet to a CSV seed also emitted the appropriate column_types configuration (either database specific or database agnostic).
Currently, Parquet is one of the few file formats supported for importing by DuckDB. This approach would benefit if there were more export options.

jpmmcneill added enhancement New feature or request triage labels Mar 10, 2023

github-actions bot changed the title ~~[Feature] Seed JSON and Parquet~~ [CT-2294] [Feature] Seed JSON and Parquet Mar 10, 2023

dbeatty10 self-assigned this Mar 11, 2023

dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2023

dbeatty10 added duplicate This issue or pull request already exists and removed triage labels Mar 13, 2023

dbeatty10 removed their assignment Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2294] [Feature] Seed JSON and Parquet #7155

[CT-2294] [Feature] Seed JSON and Parquet #7155

jpmmcneill commented Mar 10, 2023 •

edited

Loading

dbeatty10 commented Mar 13, 2023

jpmmcneill commented Mar 13, 2023 •

edited

Loading

dbeatty10 commented Mar 13, 2023

[CT-2294] [Feature] Seed JSON and Parquet #7155

[CT-2294] [Feature] Seed JSON and Parquet #7155

Comments

jpmmcneill commented Mar 10, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented Mar 13, 2023

JSON seeds

Parquet seeds

Parquet in DuckDB

jpmmcneill commented Mar 13, 2023 • edited Loading

dbeatty10 commented Mar 13, 2023

jpmmcneill commented Mar 10, 2023 •

edited

Loading

jpmmcneill commented Mar 13, 2023 •

edited

Loading