dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

multimeric · 2023-11-02T01:17:31Z

DuckDB can execute SELECT statements directly on files (CSV, Parquet etc) without needing to use a function. In this case, the file name is treated as the table name. This is explained in the docs: https://duckdb.org/docs/archive/0.9.1/data/parquet/overview.

Previously this functionality worked fine with dbplyr, up until 2.3.4. However with the upgrade to 2.4.0, it fails. With the below example I'm querying a specific file called metadata.0.2.3.parquet, but you can replace it with any parquet file and the same issue will occur:

duckdb::duckdb() |>
    DBI::dbConnect(drv = _, read_only = TRUE) |>
    dplyr::tbl("metadata.0.2.3.parquet")

It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"
LIMIT 11
Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found!
Candidate tables: "metadata"
Run `rlang::last_trace()` to see where the error occurred.

This may be related to #1388, but the error messages are somewhat different.

The text was updated successfully, but these errors were encountered:

mgirlich · 2023-11-02T05:37:13Z

While the update to 2.4.0 is the reason this failed, I think this is something that should be fixed by duckdb. Please open an issue in the duckdb-r repository.

multimeric · 2023-11-02T05:40:10Z

Can you please elaborate on what part of the dbplyr or DBI API the duckdb package is misusing or how it should be fixed?

mgirlich · 2023-11-02T05:47:53Z

tbl(con, "metadata.0.2.3.parquet") creates the query

SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"

which is correct. duckdb has the special support for reading directly from files if the table would not be quoted.
There is also an open (but inactive for some time) discussion: duckdb/duckdb#8212

multimeric · 2023-11-02T05:50:05Z

I'm confused, is this not identical to what I'm doing?

mgirlich · 2023-11-02T05:53:44Z

I explained why it doesn't work 😉

You want the query

SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"

You could use a workaround with sql(): tbl(con, sql("metadata.0.2.3.parquet")). But I don't think this is a nice way. Rather, duckdb should have a more explicit helper, e.g. something like from_file() so this would read tbl(con, from_file("metadata.0.2.3.parquet")).

rsund · 2023-11-02T19:59:31Z

How DuckDB-backend handles this is that it checks whether the table exists and if not it removes quotes so that DuckDB parser is able to do the magic, i.e. to perform a replacement scan (such as reading from file) or execute DuckDB function such as read_parquet('file.parquet').

So putting an explicit helper function there is, in my opinion, something that breakes the magic, i.e. you should (for no reason) know when you are dealing with actual database tables and when you just query from some other source of data.

I haven't checked, but probably ident_q() is no longer able to unquote parts of the query, but if sql() is able to do that, then there should be a relatively easy fix that can be done on the DuckDB-backend side.

In any case, the detected problem is not a bug in the DuckDB-backend, but caused by a breaking change in how dbplyr works. That is unfortunate as this probably breaks a large proportion of data processing pipelines that use DuckDB's dbplyr-backend (using parquet or csv files). Those real-world use cases are not in any package tests, so the magnitude of the breaking change was probably difficult to see.

multimeric mentioned this issue Nov 2, 2023

Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found! stemangiola/CuratedAtlasQueryR#121

Closed

mgirlich closed this as completed Nov 2, 2023

multimeric mentioned this issue Nov 2, 2023

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables duckdb/duckdb-r#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

rsund commented Nov 2, 2023

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables #1390

Comments

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

multimeric commented Nov 2, 2023

mgirlich commented Nov 2, 2023

rsund commented Nov 2, 2023