Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

Closed
multimeric opened this issue Nov 2, 2023 · 6 comments
Closed

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

multimeric opened this issue Nov 2, 2023 · 6 comments

Comments

@multimeric
Copy link

DuckDB can execute SELECT statements directly on files (CSV, Parquet etc) without needing to use a function. In this case, the file name is treated as the table name. This is explained in the docs: https://duckdb.org/docs/archive/0.9.1/data/parquet/overview.

Previously this functionality worked fine with dbplyr, up until 2.3.4. However with the upgrade to 2.4.0, it fails. With the below example I'm querying a specific file called metadata.0.2.3.parquet, but you can replace it with any parquet file and the same issue will occur:

duckdb::duckdb() |>
    DBI::dbConnect(drv = _, read_only = TRUE) |>
    dplyr::tbl("metadata.0.2.3.parquet")
It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"
LIMIT 11
Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found!
Candidate tables: "metadata"
Run `rlang::last_trace()` to see where the error occurred.

This may be related to #1388, but the error messages are somewhat different.

@mgirlich
Copy link
Collaborator

mgirlich commented Nov 2, 2023

While the update to 2.4.0 is the reason this failed, I think this is something that should be fixed by duckdb. Please open an issue in the duckdb-r repository.

@mgirlich mgirlich closed this as completed Nov 2, 2023
@multimeric
Copy link
Author

Can you please elaborate on what part of the dbplyr or DBI API the duckdb package is misusing or how it should be fixed?

@mgirlich
Copy link
Collaborator

mgirlich commented Nov 2, 2023

tbl(con, "metadata.0.2.3.parquet") creates the query

SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"

which is correct. duckdb has the special support for reading directly from files if the table would not be quoted.
There is also an open (but inactive for some time) discussion: duckdb/duckdb#8212

@multimeric
Copy link
Author

I'm confused, is this not identical to what I'm doing?

@mgirlich
Copy link
Collaborator

mgirlich commented Nov 2, 2023

I explained why it doesn't work 😉

You want the query

SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"

You could use a workaround with sql(): tbl(con, sql("metadata.0.2.3.parquet")). But I don't think this is a nice way. Rather, duckdb should have a more explicit helper, e.g. something like from_file() so this would read tbl(con, from_file("metadata.0.2.3.parquet")).

@rsund
Copy link
Contributor

rsund commented Nov 2, 2023

How DuckDB-backend handles this is that it checks whether the table exists and if not it removes quotes so that DuckDB parser is able to do the magic, i.e. to perform a replacement scan (such as reading from file) or execute DuckDB function such as read_parquet('file.parquet').

So putting an explicit helper function there is, in my opinion, something that breakes the magic, i.e. you should (for no reason) know when you are dealing with actual database tables and when you just query from some other source of data.

I haven't checked, but probably ident_q() is no longer able to unquote parts of the query, but if sql() is able to do that, then there should be a relatively easy fix that can be done on the DuckDB-backend side.

In any case, the detected problem is not a bug in the DuckDB-backend, but caused by a breaking change in how dbplyr works. That is unfortunate as this probably breaks a large proportion of data processing pipelines that use DuckDB's dbplyr-backend (using parquet or csv files). Those real-world use cases are not in any package tests, so the magnitude of the breaking change was probably difficult to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants