Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File list pruning with hive partition columns doesn't work with date type #14712

Closed
2 tasks done
mwhithead opened this issue Feb 27, 2024 · 6 comments · Fixed by #17545
Closed
2 tasks done

File list pruning with hive partition columns doesn't work with date type #14712

mwhithead opened this issue Feb 27, 2024 · 6 comments · Fixed by #17545
Assignees
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium python Related to Python Polars

Comments

@mwhithead
Copy link

mwhithead commented Feb 27, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import os
from datetime import date
from pathlib import Path

os.environ["POLARS_VERBOSE"] = "1"

path = Path(".env/date=2024-01-01/data.parquet")
path.parent.mkdir(exist_ok=True, parents=True)

pl.DataFrame({"x": 1}).write_parquet(path)

print("==== Using string ====")
print(
    pl.scan_parquet(path, hive_partitioning=True, try_parse_hive_dates=False)
    .filter(pl.col("date") == "2023-01-01")
    .explain(),
)

print("==== Using date ====")
print(
    pl.scan_parquet(path, hive_partitioning=True)
    .filter(pl.col("date") == date(2023, 1, 1))
    .explain(),
)
Original example
import polars as pl
import datetime as dt
from pathlib import Path
from random import choice
from string import ascii_uppercase, digits

data_dir = Path("./polars_test")


identifiers = [''.join(choice(ascii_uppercase+digits) for i in range(12)) for i in range(1024)]
some_numbers = [i for  i in range(128)]

dates = pl.date_range(dt.date(2023, 1, 1), dt.date(2024, 2, 26), eager=True)
df = (
    pl.DataFrame(dates, schema=["date"])
    .join(pl.DataFrame(identifiers, schema=["identifier"]), how="cross")
    .join(pl.DataFrame(some_numbers, schema=["number"]), how="cross")
)
df.write_parquet(
    data_dir / "first",
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["date"]}
)

pl.Config.set_verbose()
first = pl.scan_parquet("./polars_test/first/date=*/*.parquet")


first.filter(pl.col("date") == "2024-02-01").explain()


first.filter(pl.col("date").str.to_date() == dt.date(2024, 2, 1)).explain()

Log output

Log output

parquet file can be skipped, the statistics were sufficient to apply the predicate.
'\n  Parquet SCAN \n  PROJECT */3 COLUMNS\n  SELECTION: [(col("date")) == (String(2024-01-01))]'
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 26 files, first file : /data/mw/polars_test/first/date=2024-02-01/af0866188c4948e5a0d14536321b4986-0.parquet


then, for the second explain:

we don't get any predicate skips:

'\n  Parquet SCAN 26 files: first file: /data/mw/polars_test/first/date=2024-02-01/af0866188c4948e5a0d14536321b4986-0.parquet\n  PROJECT */3 COLUMNS\n  SELECTION: [(col("date").str.strptime([String(raise)])) == (2024-02-01)]'

Log output

==== Using string ====
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 1 files, first file : .env/date=2024-01-01/data.parquet
DF ["x", "date"]; PROJECT */2 COLUMNS; SELECTION: None
==== Using date ====
parquet file must be read, statistics not sufficient for predicate.
Parquet SCAN [.env/date=2024-01-01/data.parquet]
PROJECT */2 COLUMNS
SELECTION: [(col("date")) == (2023-01-01)]

Observe the first scan using strings skips the file, but the second one using date does not.

Issue description

Casting a hive partitioned column from str -> date is causing the whole dataset to be scanned, rather than pruning reads as it would with direct comparison

Expected behavior

Would expect both types of filters to be exact same speed, and prune the same reads from the read.
This issue blows out quite largely with multiple large hive partitioned datasets on date. (computation times are linear in size of whole dataset if using the second filtering)

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Linux-4.14.111-1.el7.centos.x86_64-x86_64-with-glibc2.17
Python:               3.9.14 (heads/master:70f9229, Sep 28 2022, 09:57:57) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@mwhithead mwhithead added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 27, 2024
@c-peters
Copy link
Collaborator

Thanks, I edited the path when reading from data/mw/ to ./

@c-peters
Copy link
Collaborator

c-peters commented Feb 27, 2024

We (currently) allow comparisons in the form of pl.col("") > | < | == | != literal. Would there be a problem with the first case? ,like casting the datetime

first.filter(pl.col("date") == dt.date(2024, 2, 1).strftime(...)).explain()

Any expression that might alter the value of the column (e.g. pl.col("date").str.to_date()) would significantly increase the complexity and overhead of hive partitioning as we would need to run the expression for every file instead of comparing the statistic in the path directly to the literal.

It would be nice if we can provide feedback, a warning in case a hive partitioned dataset is scanning the whole dataset

@mwhithead
Copy link
Author

Ah, i guess i made the example slightly too small. What I am actually using, and what is most powerful (i think)

first.filter(pl.col("date").str.to_date().is_between(start, end)).explain()

Otherwise, to using the literal comparisons i need to do a loop over the required dates, and & them all together.
I currently do this in my environment, but it is pretty hacky.

@ritchie46
Copy link
Member

I agree this should be fixed. We first need to do proper schema inference on hive partitions. Once that is in place we can use a similar architecture we use for parquet statistic pruning for hive partitions.

@mwhithead
Copy link
Author

Awesome, is there an issue for tracking the schema inference? I can just follow along on that.
Thanks

@c-peters
Copy link
Collaborator

c-peters commented Mar 4, 2024

Not yet, I have created an issue for hive partition schema (#14838)

@stinodego stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Jul 3, 2024
@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature accepted Ready for implementation P-medium Priority: medium and removed bug Something isn't working labels Jul 5, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jul 5, 2024
@nameexhaustion nameexhaustion removed the needs triage Awaiting prioritization by a maintainer label Jul 5, 2024
@nameexhaustion nameexhaustion self-assigned this Jul 5, 2024
@nameexhaustion nameexhaustion changed the title hive partitioning unable to skip files when casting column to date File list pruning with hive partition columns doesn't work with date type Jul 5, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants