Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

kszlim · 2023-10-26T07:36:29Z

Description

For context see:
#12036

I propose that we either deprecate hive_partitioning in favor of hive_partitioning_strategy (alternatively maybe even repurposing the old name, but that might be more confusing) in scan_parquet which takes in "favor_partition" (means drop the physical column/don't read it), "favor_physical" (idk about this name, but means ignore the partition key), "favor_none" (which should throw an error if there are conflicts and should be the default + maps to the old parameter set to True) and None for no hive partitioning (equivalent of hive_partitioning=False)?

I think it's a pretty important feature, as the person querying the data often has no control over how it's written.

Created a feature as per @ritchie46 's request. Thanks!

The text was updated successfully, but these errors were encountered:

jrothbaum · 2023-11-17T19:59:15Z

This would be super helpful. As it is, I can't use polars to load the hive partitioned files I work with and have to fall back to duckdb. I lose the benefit of lazy loading for the files that would most benefit from it.

stinodego · 2024-04-10T06:37:36Z

Not sure about the proposed API, but this feature would definitely be nice to have.

jrothbaum · 2024-04-10T13:20:24Z

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

stinodego · 2024-04-10T13:29:40Z

I work with hive partitioned data that has the partition key in the file, which I suspect is done to avoid the casting/type issue from #13892 and #14838 while taking ~0 space (since, by definition, the key is the same for all rows in each partition file). I'm indifferent about the API implementation details.

If that is done consistently, you can just set hive_partitioning=False and you're good to go.

ritchie46 · 2024-06-17T17:30:21Z

@nameexhaustion this one might also be a good data point in the design for the hive partition redesign.

kszlim · 2024-06-26T16:27:28Z

I don't believe this should be fully closed, if you have a hive partition column that conflicts with a parquet column, especially if that data is different you have no workaround besides rewriting the data or partitions?

ritchie46 · 2024-06-26T16:57:32Z

Right. Can we make a new feature request for the reduced scope. Then we can make a decision about that.

kszlim · 2024-06-26T17:07:26Z

Sounds good, i opened #12041

Veiasai · 2024-09-05T01:00:10Z

I still met duplicated columns on polars 1.6.0 when the column exists in both hive path and parquet.

How could I send you the parquet file for reproducing?

edit:
There are 5 cols in the path, and only 1 of 5 in the parquet file, does that matter?

edit:
Well, I just took a look at the implementation. It seems like it depends on whether the first col in the schema?
https://github.com/pola-rs/polars/pull/17203/files#diff-5fbba3b3c960c05ee1ff71769819814cb53e44b17e2004e33b42a301aa91eb57R166-R170

kszlim added the enhancement New feature or an improvement of an existing feature label Oct 26, 2023

jrothbaum mentioned this issue Apr 9, 2024

Hive partitioning tracking issue #15441

Open

13 tasks

stinodego added A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation labels Apr 10, 2024

github-project-automation bot added this to Backlog Apr 10, 2024

github-project-automation bot moved this to Ready in Backlog Apr 10, 2024

nameexhaustion self-assigned this Jun 19, 2024

nameexhaustion added P-medium Priority: medium P-high Priority: high and removed P-medium Priority: medium labels Jun 19, 2024

nameexhaustion mentioned this issue Jun 26, 2024

feat: Support loading from datasets where the hive columns are also stored in the file #17203

Merged

ritchie46 closed this as completed in #17203 Jun 26, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jun 26, 2024

kszlim mentioned this issue Jun 26, 2024

Allow remapping of hive partitioning columns (or physical parquet columns) before they're unified. #17222

Open

Veiasai mentioned this issue Sep 8, 2024

fix(rust,python): Fix materialize_hive_partitions that should not rely on whether the first partition is in schema #18606

Closed

nameexhaustion mentioned this issue Sep 9, 2024

fix: Scanning hive partitioned files where hive columns are partially included in the file #18626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

kszlim commented Oct 26, 2023 •

edited by alexander-beedie

Loading

jrothbaum commented Nov 17, 2023

stinodego commented Apr 10, 2024

jrothbaum commented Apr 10, 2024

stinodego commented Apr 10, 2024

ritchie46 commented Jun 17, 2024

kszlim commented Jun 26, 2024

ritchie46 commented Jun 26, 2024

kszlim commented Jun 26, 2024

Veiasai commented Sep 5, 2024 •

edited

Loading

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

Provide a way to de-conflict columns that come from hive partitioning vs what's in a physical file #12041

Comments

kszlim commented Oct 26, 2023 • edited by alexander-beedie Loading

Description

jrothbaum commented Nov 17, 2023

stinodego commented Apr 10, 2024

jrothbaum commented Apr 10, 2024

stinodego commented Apr 10, 2024

ritchie46 commented Jun 17, 2024

kszlim commented Jun 26, 2024

ritchie46 commented Jun 26, 2024

kszlim commented Jun 26, 2024

Veiasai commented Sep 5, 2024 • edited Loading

kszlim commented Oct 26, 2023 •

edited by alexander-beedie

Loading

Veiasai commented Sep 5, 2024 •

edited

Loading