Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

Closed
kszlim opened this issue Oct 26, 2023 · 9 comments · Fixed by #17203
Assignees
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-high Priority: high

Comments

@kszlim
Copy link
Contributor

kszlim commented Oct 26, 2023

Description

For context see:
#12036

I propose that we either deprecate hive_partitioning in favor of hive_partitioning_strategy (alternatively maybe even repurposing the old name, but that might be more confusing) in scan_parquet which takes in "favor_partition" (means drop the physical column/don't read it), "favor_physical" (idk about this name, but means ignore the partition key), "favor_none" (which should throw an error if there are conflicts and should be the default + maps to the old parameter set to True) and None for no hive partitioning (equivalent of hive_partitioning=False)?

I think it's a pretty important feature, as the person querying the data often has no control over how it's written.

Created a feature as per @ritchie46 's request. Thanks!

@kszlim kszlim added the enhancement New feature or an improvement of an existing feature label Oct 26, 2023
@jrothbaum
Copy link

This would be super helpful. As it is, I can't use polars to load the hive partitioned files I work with and have to fall back to duckdb. I lose the benefit of lazy loading for the files that would most benefit from it.

@stinodego stinodego added A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation labels Apr 10, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Apr 10, 2024
@stinodego
Copy link
Member

Not sure about the proposed API, but this feature would definitely be nice to have.

@jrothbaum
Copy link

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

@stinodego
Copy link
Member

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

If that is done consistently, you can just set hive_partitioning=False and you're good to go.

@ritchie46
Copy link
Member

@nameexhaustion this one might also be a good data point in the design for the hive partition redesign.

@kszlim
Copy link
Contributor Author

kszlim commented Jun 26, 2024

I don't believe this should be fully closed, if you have a hive partition column that conflicts with a parquet column, especially if that data is different you have no workaround besides rewriting the data or partitions?

@ritchie46
Copy link
Member

Right. Can we make a new feature request for the reduced scope. Then we can make a decision about that.

@kszlim
Copy link
Contributor Author

kszlim commented Jun 26, 2024

Sounds good, i opened #12041

@Veiasai
Copy link
Contributor

Veiasai commented Sep 5, 2024

I still met duplicated columns on polars 1.6.0 when the column exists in both hive path and parquet.

How could I send you the parquet file for reproducing?

edit:
There are 5 cols in the path, and only 1 of 5 in the parquet file, does that matter?

edit:
Well, I just took a look at the implementation. It seems like it depends on whether the first col in the schema?
https://github.com/pola-rs/polars/pull/17203/files#diff-5fbba3b3c960c05ee1ff71769819814cb53e44b17e2004e33b42a301aa91eb57R166-R170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-high Priority: high
Projects
Archived in project
6 participants