Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Closed
sgoley opened this issue Oct 1, 2021 · 10 comments · Fixed by #7186
Closed

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

sgoley opened this issue Oct 1, 2021 · 10 comments · Fixed by #7186
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors seeds Issues related to dbt's seed functionality

Comments

@sgoley
Copy link

sgoley commented Oct 1, 2021

Describe the feature

Use alternative delimiters for seeds besides "," (comma)

Describe alternatives you've considered

Currently, the only option is to load a non-csv seed into database as a temp table and then reference that as a "source".

Additional context

No, not database specific.

Who will this benefit?

Anyone who uses seeds.

@sgoley sgoley added enhancement New feature or request triage labels Oct 1, 2021
@jtcohen6 jtcohen6 removed the triage label Oct 4, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 4, 2021

@sgoley When you say "non-csv file types," are you talking only about tabular data with non-comma delimiters (e.g. TSVs)? Or are you also thinking about other formats for structured data (JSON, XML, ...)?

If it's just a question of delimiters, could you provide a bit more background on the difficulty you're encountering? Is there a blocker to pre-processing your files, so as to switch from tab/semicolon/other delimiter to commas?

We've got another issue already open for newline-delimited JSON support: #2365

@sgoley
Copy link
Author

sgoley commented Oct 5, 2021

Yes, in my case I am specifically talking about non-comma delimiters like TSVs, semi-colon delimited (common in some european countries given the numerical comma standard), or "|" ( 'VERTICAL LINE' U+007C ) delimited files (infrequent but used within US financial systems).

I do completely understand that pre-processing is the only current workable solution, just opening the issue here since I searched and no truly similar issue has been raised or closed.

After reading more into the agate csv reader function, it looks like that particular api does not support a "sep" / "delimiter" parameter which is why I assume this was not supported natively?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 5, 2021

@sgoley You're right, we just call agate's from_csv method for this:
https://github.com/dbt-labs/dbt/blob/f7680379fca80f653e8e7e2d45d7165a0fd864da/core/dbt/clients/agate_helper.py#L146

The good news: in that method, agate itself is happy to pass through any/all keyword arguments supported by python's built-in reader. It accepts **kwargs and passes them into csv.reader().

So I think this could be as simple as:

  • adding a delimiter arg to agate_helper.from_csv
  • adding a delimiter config to seeds (right around here), with default value either , or None (i.e. comma)
  • adjusting the load_agate_table context method
        column_types = self.model.config.column_types
        delimiter = self.model.config.delimiter
        try:
            table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter)

So, I'm pretty close to tagging this as a good first issue. I have just a few more questions:

  • What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?
  • Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.
  • Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing?

@jtcohen6 jtcohen6 added the seeds Issues related to dbt's seed functionality label Oct 5, 2021
@jameseon
Copy link

jameseon commented Jan 17, 2022

Hello @jtcohen6

Following this issue along with JSON related (following #2365 )
As a dbt community member, If I may answer your questions to @sgoley

I understand the poor use cases outlined for dbt here https://docs.getdbt.com/docs/building-a-dbt-project/seeds - for valid/good use cases seeding small amounts of data as part of a build process (sampling ) - writing tests, documentation etc

What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?

-- I believe the community would benefit from adding generalized support for csv - COMMA is a very common character and shows up everywhere. If the data is not escaped or quoted, the data may shift even for simple files.

  • support for different delimiters, text enclosure and escape character at a minimum

Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.

-- Another community benefit, sometimes we need to work with small datasets from different formats. I believe support for CSV (if it has the configurable delimiter with text enclosure and escape character; this covers any character delimited file .txt,.tsv, pipe delimited etc - csv.reader() kwargs should handle this beautifully), json , xml, parquet and ORC: these are becoming common.

Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing? keeping it simple is great but this limits the entry point to dbt as only comma separated files. Going through

---- As the community grows this is going to be a growing need. Now, dbt seed is still not ideal for loading data into a warehouse, as part of the build process for a data project, sampling (maybe find size limitations to avoid functionality abuse) needs to be recommended as part of best practice to keep the functionality light as is. Yes (dbt is not a a data loader), but can it be used to build an ELT pipeline based on small samples of different file formats? Mocking the (EL) process for common data formats, build pipeline based on a small amount of data. Let's not say "no", say yes 😄 - the community needs these functionalities. The preprocess of a file before entry to dbt is a little hectic and expensive for a modern day tool.

Thank you for considering this. Keep up the great work.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jul 17, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

@ramonvermeulen
Copy link
Contributor

I opened a pull request that adds this feature into dbt-core: #7186

@jtcohen6
Copy link
Contributor

Thanks @ramonvermeulen! Reopening, given you've picked this up

@jtcohen6 jtcohen6 reopened this Mar 20, 2023
@jtcohen6 jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Language and removed stale Issues that have gone stale Team:Execution labels Mar 20, 2023
QMalcolm pushed a commit that referenced this issue Aug 1, 2023
… (#7186)

* Support configurable delimiter for seed files, default to comma (#3990)

* Update Features-20230317-144957.yaml

* Moved "delimiter" to seed config instead of node config

* Update core/dbt/clients/agate_helper.py

Co-authored-by: Cor <jczuurmond@protonmail.com>

* Update test_contracts_graph_parsed.py

* fixed integration tests

* Added functional tests for seed files with a unique delimiter

* Added docstrings

* Added a test for an empty string configured delimiter value

* whitespace

* ran black

* updated changie entry

* Update Features-20230317-144957.yaml

---------

Co-authored-by: Cor <jczuurmond@protonmail.com>
@Ram-Dev7
Copy link

Ram-Dev7 commented Nov 3, 2024

Hello team, when this issue will be fixed? we have both csv and tsv files in the seeds, seems like tsv is not working in seeds - I'm in version dbt-databricks 1.8.0, Thanks

@ramonvermeulen
Copy link
Contributor

ramonvermeulen commented Nov 4, 2024

Hello team, when this issue will be fixed? we have both csv and tsv files in the seeds, seems like tsv is not working in seeds - I'm in version dbt-databricks 1.8.0, Thanks

@Ram-Dev7
So for my understanding, the issue is that tab separator still does not work as delimiter after my delimiter support contribution last year? Could you provide me with some more context, e.g. the configuration that you currently use? I assume you should configure with a \t as delimiter. Most likely if \t does not work, this has to do with the underlaying library (agate) that is being used and is not supporting this.

UPDATE: I did some tests, and this seems to be the case.

See call with delimiter being passed as argument:

table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter)

And function definition in dbt-common:
https://github.com/dbt-labs/dbt-common/blob/65d7ba5d71f4ebea80605408824708bf0e5a99af/dbt_common/clients/agate_helper.py#L152

In that case I think it might be a bit harder to fix, e.g. this needs to be implemented in agate, or a very big change has to happen in how dbt is handling seeds (e.g. using python internals or another library instead of agate?).

But let's first see if we can have a reproducable example, to confirm that this is really the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors seeds Issues related to dbt's seed functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants