Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add statistics to arrow datasets #1838

Closed
djouallah opened this issue Nov 11, 2023 · 8 comments
Closed

add statistics to arrow datasets #1838

djouallah opened this issue Nov 11, 2023 · 8 comments
Labels
binding/python Issues for the Python package enhancement New feature or request

Comments

@djouallah
Copy link

I am using delta_rs python with DuckDB, it works rather and thanks for that, but I did notice that it is substantially slower than reading directly from Parquet, I asked the DuckDB devs and their answer is that arrow dataset does not contains any statistics, so a lot of queries that involved join orders for example become problematics

is there a way to fix that ?

@djouallah djouallah added the enhancement New feature or request label Nov 11, 2023
@wjones127
Copy link
Collaborator

@djouallah Did the DuckDB devs say how external libraries can provide statistics to DuckDB? We have various kinds of statistics; I just don't know how to provide them to DuckDB and which it needs for join orders.

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 11, 2023

As far as I know the flow is Deltalake->Arrow->DuckDB, if they need stats in DuckDB does it mean they are not pushing down any query to Deltalake? I don't know if there is a way to add stats to pyarrow dataframe.

@roeap
Copy link
Collaborator

roeap commented Nov 11, 2023

pyarrow Datasets are made up of Fragments (I think thats what they call them), which for us corresponds to files. These can also contain statistics, but this should already be set. IF we pass a reader to DuckDB, I think the only way is to actually iterate through all the batches. IN case of table, all is materialized.

So as far as I know, the Dataset generated by delta-rs should contain statistics, if DuckDB ran leverage them in this form is a different question :).

@djouallah, can you confirm you used to_pyarrow_dataset for this?

@ion-elgreco
Copy link
Collaborator

Polars is able to do pushdowns through pyarrow dataset, so theoretically duckdb should be able to do that as well

@wjones127
Copy link
Collaborator

My interpretation of "their answer is that arrow dataset does not contains any statistics" is that it's not about pushdown but about surfacing information to their query planner. For example, they would consider the estimated number of rows in each source table to determine order. So I think this is separate from statistics-based page / file pruning, which we do indeed perform.

@djouallah
Copy link
Author

Yes exactly

@wjones127
Copy link
Collaborator

Understood. In DataFusion we have this information here: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html

In DuckDB the closest thing I can find is: https://github.com/duckdb/duckdb/blob/a00b28f5d453ff7ec3b3837385f083d0887124ad/src/include/duckdb/storage/statistics/node_statistics.hpp#L16

I don't think Polars has any notion of this.

I'll think about ways this could be put in some abstraction that can be read by DuckDB.

@djouallah
Copy link
Author

link to duckdb devs comments
duckdb/duckdb#4636 (reply in thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants