Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive Partition Schema #14838

Closed
c-peters opened this issue Mar 4, 2024 · 8 comments
Closed

Hive Partition Schema #14838

c-peters opened this issue Mar 4, 2024 · 8 comments
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals

Comments

@c-peters
Copy link
Collaborator

c-peters commented Mar 4, 2024

Description

When doing hive partitioning it would be great if Polars could support other data types (e.g. dates, datetimes) similar to other frameworks (e.g. duckdb , bigquery).

@c-peters c-peters added the enhancement New feature or an improvement of an existing feature label Mar 4, 2024
@baycoder0
Copy link

+1

I had a similar issue open before: #12894

@ritchie46 is this something the core team has plans around by chance? If not, I'm willing to take a stab at it given some guidance on the desired design.

@deanm0000
Copy link
Collaborator

@baycoder0 c-peters is part of the core team.

I think you'd want to get started here

let s = if INTEGER_RE.is_match(value) {
let value = value.parse::<i64>().ok()?;
Series::new(name, &[value])
} else if BOOLEAN_RE.is_match(value) {
let value = value.parse::<bool>().ok()?;
Series::new(name, &[value])
} else if FLOAT_RE.is_match(value) {
let value = value.parse::<f64>().ok()?;
Series::new(name, &[value])
} else if value == "__HIVE_DEFAULT_PARTITION__" {
Series::new_null(name, 1)
} else {
Series::new(name, &[percent_decode_str(value).decode_utf8().ok()?])

@deanm0000
Copy link
Collaborator

The csv datetime parser seems to be over here

@deanm0000
Copy link
Collaborator

Additionally, this is highly related to #13892

@baycoder0
Copy link

@deanm0000 I added the initial PR here: #14950. I'd like to get the initial checks. Also, I did exhaustive testing myself and would like to add units tests. However, tests in tests/unit/io/test_hive.py seem to be skipped due to PyArrow 15 right now. Should I add it there? And should I add columns to foods1.ipc and foods2.ipc files that are used to test hive partitioning? Or do you prefer me to create new files?

@ritchie46
Copy link
Member

What we require first is schema inference on hive partitions. Otherwise some parts may be strings and/or different date formats. There needs to be something in place for schema inference and communicating that schema result between the partitions first.

@fcocquemas
Copy link

Are you still considering a parameter to pass a hive partitions schema, similar to how you can pass dtypes to override the inference? This is something @deanm0000 suggested in #13892.

There are some cases where it would be nice to override the regexp-based schema inference. For instance:

  • It might be hard to detect all date formats correctly (Is this string %d/%m/%Y or %m/%d/%Y?). In some cases (e.g. 8 digit number), it may look like a date, but really be an Integer or a String.
  • BOOLEAN_RE would match value=TRUE, even when it might represent a String, not a Boolean.

Having schema inference as a default is convenient, but an option to override would be nice.

@stinodego stinodego added accepted Ready for implementation P-goal Priority: aligns with long-term Polars goals labels Mar 29, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Mar 29, 2024
@stinodego stinodego self-assigned this Mar 29, 2024
@stinodego stinodego moved this from Ready to In progress in Backlog Mar 29, 2024
@stinodego stinodego moved this from In progress to Next in Backlog May 21, 2024
@stinodego stinodego moved this from Next to Candidate in Backlog May 26, 2024
@stinodego stinodego removed their assignment May 26, 2024
@stinodego stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Jul 3, 2024
@nameexhaustion
Copy link
Collaborator

Datetime support has been added by #17256

@github-project-automation github-project-automation bot moved this from Candidate to Done in Backlog Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals
Projects
Archived in project
Development

No branches or pull requests

7 participants