-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File list pruning with hive partition columns doesn't work with date type #14712
Comments
Thanks, I edited the path when reading from |
We (currently) allow comparisons in the form of first.filter(pl.col("date") == dt.date(2024, 2, 1).strftime(...)).explain() Any expression that might alter the value of the column (e.g. It would be nice if we can provide feedback, a warning in case a hive partitioned dataset is scanning the whole dataset |
Ah, i guess i made the example slightly too small. What I am actually using, and what is most powerful (i think)
Otherwise, to using the literal comparisons i need to do a loop over the required dates, and & them all together. |
I agree this should be fixed. We first need to do proper schema inference on hive partitions. Once that is in place we can use a similar architecture we use for parquet statistic pruning for hive partitions. |
Awesome, is there an issue for tracking the schema inference? I can just follow along on that. |
Not yet, I have created an issue for hive partition schema (#14838) |
Checks
Reproducible example
Original example
Log output
Log output
Log output
Observe the first scan using strings skips the file, but the second one using date does not.
Issue description
Casting a hive partitioned column from str -> date is causing the whole dataset to be scanned, rather than pruning reads as it would with direct comparison
Expected behavior
Would expect both types of filters to be exact same speed, and prune the same reads from the read.
This issue blows out quite largely with multiple large hive partitioned datasets on date. (computation times are linear in size of whole dataset if using the second filtering)
Installed versions
The text was updated successfully, but these errors were encountered: