-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041
Comments
This would be super helpful. As it is, I can't use polars to load the hive partitioned files I work with and have to fall back to duckdb. I lose the benefit of lazy loading for the files that would most benefit from it. |
Not sure about the proposed API, but this feature would definitely be nice to have. |
I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details. |
If that is done consistently, you can just set |
@nameexhaustion this one might also be a good data point in the design for the hive partition redesign. |
I don't believe this should be fully closed, if you have a hive partition column that conflicts with a parquet column, especially if that data is different you have no workaround besides rewriting the data or partitions? |
Right. Can we make a new feature request for the reduced scope. Then we can make a decision about that. |
Sounds good, i opened #12041 |
I still met duplicated columns on polars 1.6.0 when the column exists in both hive path and parquet. How could I send you the parquet file for reproducing? edit: edit: |
Description
For context see:
#12036
I propose that we either deprecate
hive_partitioning
in favor ofhive_partitioning_strategy
(alternatively maybe even repurposing the old name, but that might be more confusing) inscan_parquet
which takes in"favor_partition"
(means drop the physical column/don't read it), "favor_physical
" (idk about this name, but means ignore the partition key), "favor_none
" (which should throw an error if there are conflicts and should be the default + maps to the old parameter set toTrue
) andNone
for no hive partitioning (equivalent ofhive_partitioning=False
)?I think it's a pretty important feature, as the person querying the data often has no control over how it's written.
Created a feature as per @ritchie46 's request. Thanks!
The text was updated successfully, but these errors were encountered: