What do the Hydra docs mean by "DuckDB tables"? #440

anentropic · 2024-11-15T15:00:23Z

anentropic
Nov 15, 2024

https://docs.hydra.so/components/duckdb

DuckDB execution is usually enabled automatically when needed. It’s enabled whenever you use DuckDB functions (such as read_parquet), when you query DuckDB tables, and when running COPY table TO 's3://...'

My data is not big enough (< 100GB) to be very interested in reading Parquet direct from S3, except maybe as a way to pre-load it into the db ... I want it on local SSD for fastest possible queries.

When I tested standalone duckdb from Python it seemed like it was faster to query against data that had first been loaded into a table via read_parquet rather than ad hoc querying via SELECT FROM read_parquet

So I was hoping to do the same via pg_duckdb

I couldn't find other mentions of "DuckDB tables" concept in the docs here - is that a thing? or it just means any tables in a db with pg_duckdb enabled?

To get the behaviour I want can I do a regular CREATE TABLE, then INSERT INTO ... SELECT FROM read_parquet, and then SET duckdb.force_execution TO true for my analytics queries as mentioned in Hydra docs?

Answered by wuputah

Nov 15, 2024

That would be any table that is using duckdb, which currently are either MotherDuck tables or temporary tables.

You can copy / cache parquet files from S3 to local SSD using the duckdb.cache function which will then use the SSD copy automatically after it's downloaded. We just added some more functions to allow management of the cache, so you can expect to see that shipped in a future release.

View full answer

wuputah · 2024-11-15T15:08:36Z

wuputah
Nov 15, 2024
Maintainer

That would be any table that is using duckdb, which currently are either MotherDuck tables or temporary tables.

You can copy / cache parquet files from S3 to local SSD using the duckdb.cache function which will then use the SSD copy automatically after it's downloaded. We just added some more functions to allow management of the cache, so you can expect to see that shipped in a future release.

2 replies

anentropic Nov 16, 2024
Author

I feel like I'm missing an explanation of how the Postgres and DuckDB sides interact - it seems like not much?

i.e. I can use it as a regular Postgres db, but that data is not using any special columnar storage(?) and running duckdb functions against that would be like connecting to a local foreign postgres connection from standalone duckdb?

Or I can use the built-in duckdb to read from parquet files etc and do fast queries on that data, via a psql client interface.

Is that basically what this extension provides?

anentropic Nov 16, 2024
Author

Or... no... https://github.com/hydradatabase/hydra this is columnar storage apparently

Does the Hydra product combine the column storage with pg_duckdb? Do they work together, i.e. duckdb execution over columnar postgres tables? Is that combo faster than pg_duckdb over cached Parquet?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do the Hydra docs mean by "DuckDB tables"? #440

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What do the Hydra docs mean by "DuckDB tables"? #440

anentropic Nov 15, 2024

Replies: 1 comment · 2 replies

wuputah Nov 15, 2024 Maintainer

anentropic Nov 16, 2024 Author

anentropic Nov 16, 2024 Author

anentropic
Nov 15, 2024

Replies: 1 comment 2 replies

wuputah
Nov 15, 2024
Maintainer

anentropic Nov 16, 2024
Author

anentropic Nov 16, 2024
Author