-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add statistics to arrow datasets #1838
Comments
@djouallah Did the DuckDB devs say how external libraries can provide statistics to DuckDB? We have various kinds of statistics; I just don't know how to provide them to DuckDB and which it needs for join orders. |
As far as I know the flow is Deltalake->Arrow->DuckDB, if they need stats in DuckDB does it mean they are not pushing down any query to Deltalake? I don't know if there is a way to add stats to pyarrow dataframe. |
pyarrow So as far as I know, the @djouallah, can you confirm you used |
Polars is able to do pushdowns through pyarrow dataset, so theoretically duckdb should be able to do that as well |
My interpretation of "their answer is that arrow dataset does not contains any statistics" is that it's not about pushdown but about surfacing information to their query planner. For example, they would consider the estimated number of rows in each source table to determine order. So I think this is separate from statistics-based page / file pruning, which we do indeed perform. |
Yes exactly |
Understood. In DataFusion we have this information here: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html In DuckDB the closest thing I can find is: https://github.com/duckdb/duckdb/blob/a00b28f5d453ff7ec3b3837385f083d0887124ad/src/include/duckdb/storage/statistics/node_statistics.hpp#L16 I don't think Polars has any notion of this. I'll think about ways this could be put in some abstraction that can be read by DuckDB. |
link to duckdb devs comments |
I am using delta_rs python with DuckDB, it works rather and thanks for that, but I did notice that it is substantially slower than reading directly from Parquet, I asked the DuckDB devs and their answer is that arrow dataset does not contains any statistics, so a lot of queries that involved join orders for example become problematics
is there a way to fix that ?
The text was updated successfully, but these errors were encountered: