add statistics to arrow datasets #1838

djouallah · 2023-11-11T10:46:31Z

I am using delta_rs python with DuckDB, it works rather and thanks for that, but I did notice that it is substantially slower than reading directly from Parquet, I asked the DuckDB devs and their answer is that arrow dataset does not contains any statistics, so a lot of queries that involved join orders for example become problematics

is there a way to fix that ?

wjones127 · 2023-11-11T17:26:55Z

@djouallah Did the DuckDB devs say how external libraries can provide statistics to DuckDB? We have various kinds of statistics; I just don't know how to provide them to DuckDB and which it needs for join orders.

r3stl355 · 2023-11-11T19:01:05Z

As far as I know the flow is Deltalake->Arrow->DuckDB, if they need stats in DuckDB does it mean they are not pushing down any query to Deltalake? I don't know if there is a way to add stats to pyarrow dataframe.

roeap · 2023-11-11T19:13:08Z

pyarrow Datasets are made up of Fragments (I think thats what they call them), which for us corresponds to files. These can also contain statistics, but this should already be set. IF we pass a reader to DuckDB, I think the only way is to actually iterate through all the batches. IN case of table, all is materialized.

So as far as I know, the Dataset generated by delta-rs should contain statistics, if DuckDB ran leverage them in this form is a different question :).

@djouallah, can you confirm you used to_pyarrow_dataset for this?

ion-elgreco · 2023-11-11T19:16:19Z

Polars is able to do pushdowns through pyarrow dataset, so theoretically duckdb should be able to do that as well

wjones127 · 2023-11-11T19:52:59Z

My interpretation of "their answer is that arrow dataset does not contains any statistics" is that it's not about pushdown but about surfacing information to their query planner. For example, they would consider the estimated number of rows in each source table to determine order. So I think this is separate from statistics-based page / file pruning, which we do indeed perform.

djouallah · 2023-11-12T01:05:44Z

Yes exactly

wjones127 · 2023-11-12T02:42:04Z

Understood. In DataFusion we have this information here: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html

In DuckDB the closest thing I can find is: https://github.com/duckdb/duckdb/blob/a00b28f5d453ff7ec3b3837385f083d0887124ad/src/include/duckdb/storage/statistics/node_statistics.hpp#L16

I don't think Polars has any notion of this.

I'll think about ways this could be put in some abstraction that can be read by DuckDB.

djouallah · 2023-11-12T08:51:48Z

link to duckdb devs comments
duckdb/duckdb#4636 (reply in thread)

djouallah added the enhancement New feature or request label Nov 11, 2023

ion-elgreco added the binding/python Issues for the Python package label Nov 22, 2023

djouallah mentioned this issue Dec 26, 2023

docs: datafusion integration #1993

Merged

wjones127 mentioned this issue Dec 29, 2023

[Python] Define a Dataset protocol based on Substrait and C Data Interface apache/arrow#37504

Open

djouallah mentioned this issue Jun 3, 2024

Support Table.to_arrow_batch_reader to return RecordBatchReader instead of a fully materialized Arrow Table apache/iceberg-python#786

Merged

djouallah closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add statistics to arrow datasets #1838

add statistics to arrow datasets #1838

djouallah commented Nov 11, 2023

wjones127 commented Nov 11, 2023

r3stl355 commented Nov 11, 2023 •

edited

Loading

roeap commented Nov 11, 2023

ion-elgreco commented Nov 11, 2023

wjones127 commented Nov 11, 2023

djouallah commented Nov 12, 2023

wjones127 commented Nov 12, 2023

djouallah commented Nov 12, 2023

add statistics to arrow datasets #1838

add statistics to arrow datasets #1838

Comments

djouallah commented Nov 11, 2023

wjones127 commented Nov 11, 2023

r3stl355 commented Nov 11, 2023 • edited Loading

roeap commented Nov 11, 2023

ion-elgreco commented Nov 11, 2023

wjones127 commented Nov 11, 2023

djouallah commented Nov 12, 2023

wjones127 commented Nov 12, 2023

djouallah commented Nov 12, 2023

r3stl355 commented Nov 11, 2023 •

edited

Loading