Some questions #350

fr3fou · 2024-10-23T09:10:09Z

fr3fou
Oct 23, 2024

Hey, I've been wondering about some questions after coming across this project recently:

Are there any plans on the roadmap for getting the PG_DuckDB extension approved in AWS RDS? We use a managed RDS instance and don't want to set up our own instance or pay some other vendor for a cloud offering.
Can you make a pg_duckdb table that always implicitly reads a Parquet file from S3 when you query it?
Can you have Hive styled partitioning for the aforementioned table, such that you can read multiple Parquet files that back the underlying table/view based on the WHERE filters?
Can you cache said views in order to avoid wasting S3 bandwidth? If yes, how can you invalidate them? Would this be part of a scheduled ETL/ELT pipeline? E.g. periodically ingesting your files in your data lake and then recreating a table/view which stores the underlying Parquet file(s). My main concern is if you have a customer facing frontend which uses DuckDB on the backend to return the data, would the latency (and cost of downloading the entire Parquet file) be a good idea in the long run.
Does the pg_duckdb extension store anything in a duck.db file? If yes, what's it used for, if no, can we treat it as "just" an execution engine?
How do transactions work across both the PostgreSQL and DuckDB engine? E.g. can I read from a Parquet file (or a view/table which queries the underlying Parquet file(s), build some analytical query and store the results in a PostgreSQL table?

wuputah · 2024-10-24T16:12:34Z

We don't control what AWS RDS offers on their instances. We're open to having discussions with them, so feel free to request it via a support ticket. If you're interested in a hosted solution, check out Hydra, who helped build pg_duckdb.
I would recommend using a Postgres view to wrap a read_parquet call to do this. Aside from views, there's no such equivalent feature in DuckDB we could expose.
Yes, read_parquet is supported in its entirety. Also see the details in our docs.
There is a basic caching feature in duckdb.cache. Look for more in this area in the future (including finishing off the docs, sorry about that!). Hydra can also set up a bucket for you that won't have any bandwidth transfer costs when using their service. Cache invalidation is one of the hard problems. We'll have to get back to you on solving that one.
No, such a file would not be a "safe" place to store data for a Postgres system so it is not used.
See the COPY feature for pushing out, and you can INSERT INTO ... SELECT for pulling in. If you run into issues specifically with transactions, please open a bug report.

Hope this helps!

0 replies