File list pruning with hive partition columns doesn't work with date type #14712

mwhithead · 2024-02-27T07:57:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import os
from datetime import date
from pathlib import Path

os.environ["POLARS_VERBOSE"] = "1"

path = Path(".env/date=2024-01-01/data.parquet")
path.parent.mkdir(exist_ok=True, parents=True)

pl.DataFrame({"x": 1}).write_parquet(path)

print("==== Using string ====")
print(
    pl.scan_parquet(path, hive_partitioning=True, try_parse_hive_dates=False)
    .filter(pl.col("date") == "2023-01-01")
    .explain(),
)

print("==== Using date ====")
print(
    pl.scan_parquet(path, hive_partitioning=True)
    .filter(pl.col("date") == date(2023, 1, 1))
    .explain(),
)

Original example

import polars as pl
import datetime as dt
from pathlib import Path
from random import choice
from string import ascii_uppercase, digits

data_dir = Path("./polars_test")


identifiers = [''.join(choice(ascii_uppercase+digits) for i in range(12)) for i in range(1024)]
some_numbers = [i for  i in range(128)]

dates = pl.date_range(dt.date(2023, 1, 1), dt.date(2024, 2, 26), eager=True)
df = (
    pl.DataFrame(dates, schema=["date"])
    .join(pl.DataFrame(identifiers, schema=["identifier"]), how="cross")
    .join(pl.DataFrame(some_numbers, schema=["number"]), how="cross")
)
df.write_parquet(
    data_dir / "first",
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["date"]}
)

pl.Config.set_verbose()
first = pl.scan_parquet("./polars_test/first/date=*/*.parquet")


first.filter(pl.col("date") == "2024-02-01").explain()


first.filter(pl.col("date").str.to_date() == dt.date(2024, 2, 1)).explain()

Log output

parquet file can be skipped, the statistics were sufficient to apply the predicate.
'\n  Parquet SCAN \n  PROJECT */3 COLUMNS\n  SELECTION: [(col("date")) == (String(2024-01-01))]'
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 26 files, first file : /data/mw/polars_test/first/date=2024-02-01/af0866188c4948e5a0d14536321b4986-0.parquet


then, for the second explain:

we don't get any predicate skips:

'\n  Parquet SCAN 26 files: first file: /data/mw/polars_test/first/date=2024-02-01/af0866188c4948e5a0d14536321b4986-0.parquet\n  PROJECT */3 COLUMNS\n  SELECTION: [(col("date").str.strptime([String(raise)])) == (2024-02-01)]'

Log output

==== Using string ====
parquet file can be skipped, the statistics were sufficient to apply the predicate.
hive partitioning: skipped 1 files, first file : .env/date=2024-01-01/data.parquet
DF ["x", "date"]; PROJECT */2 COLUMNS; SELECTION: None
==== Using date ====
parquet file must be read, statistics not sufficient for predicate.
Parquet SCAN [.env/date=2024-01-01/data.parquet]
PROJECT */2 COLUMNS
SELECTION: [(col("date")) == (2023-01-01)]

Observe the first scan using strings skips the file, but the second one using date does not.

Issue description

Casting a hive partitioned column from str -> date is causing the whole dataset to be scanned, rather than pruning reads as it would with direct comparison

Expected behavior

Would expect both types of filters to be exact same speed, and prune the same reads from the read.
This issue blows out quite largely with multiple large hive partitioned datasets on date. (computation times are linear in size of whole dataset if using the second filtering)

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Linux-4.14.111-1.el7.centos.x86_64-x86_64-with-glibc2.17
Python:               3.9.14 (heads/master:70f9229, Sep 28 2022, 09:57:57) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

c-peters · 2024-02-27T09:22:48Z

Thanks, I edited the path when reading from data/mw/ to ./

c-peters · 2024-02-27T13:32:50Z

We (currently) allow comparisons in the form of pl.col("") > | < | == | != literal. Would there be a problem with the first case? ,like casting the datetime

first.filter(pl.col("date") == dt.date(2024, 2, 1).strftime(...)).explain()

Any expression that might alter the value of the column (e.g. pl.col("date").str.to_date()) would significantly increase the complexity and overhead of hive partitioning as we would need to run the expression for every file instead of comparing the statistic in the path directly to the literal.

It would be nice if we can provide feedback, a warning in case a hive partitioned dataset is scanning the whole dataset

mwhithead · 2024-02-27T13:39:57Z

Ah, i guess i made the example slightly too small. What I am actually using, and what is most powerful (i think)

first.filter(pl.col("date").str.to_date().is_between(start, end)).explain()

Otherwise, to using the literal comparisons i need to do a loop over the required dates, and & them all together.
I currently do this in my environment, but it is pretty hacky.

ritchie46 · 2024-02-27T15:05:51Z

I agree this should be fixed. We first need to do proper schema inference on hive partitions. Once that is in place we can use a similar architecture we use for parquet statistic pruning for hive partitions.

mwhithead · 2024-02-27T15:07:24Z

Awesome, is there an issue for tracking the schema inference? I can just follow along on that.
Thanks

c-peters · 2024-03-04T12:27:52Z

Not yet, I have created an issue for hive partition schema (#14838)

mwhithead added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 27, 2024

ritchie46 mentioned this issue Jul 3, 2024

Hive partitioning tracking issue #15441

Open

13 tasks

stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Jul 3, 2024

nameexhaustion added enhancement New feature or an improvement of an existing feature accepted Ready for implementation P-medium Priority: medium and removed bug Something isn't working labels Jul 5, 2024

github-project-automation bot added this to Backlog Jul 5, 2024

github-project-automation bot moved this to Ready in Backlog Jul 5, 2024

nameexhaustion removed the needs triage Awaiting prioritization by a maintainer label Jul 5, 2024

nameexhaustion self-assigned this Jul 5, 2024

nameexhaustion changed the title ~~hive partitioning unable to skip files when casting column to date~~ File list pruning with hive partition columns doesn't work with date type Jul 5, 2024

nameexhaustion mentioned this issue Jul 10, 2024

perf: Support datetime in predicate during hive partition pruning #17545

Merged

ritchie46 closed this as completed in #17545 Jul 11, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File list pruning with hive partition columns doesn't work with date type #14712

File list pruning with hive partition columns doesn't work with date type #14712

mwhithead commented Feb 27, 2024 •

edited by nameexhaustion

Loading

c-peters commented Feb 27, 2024

c-peters commented Feb 27, 2024 •

edited

Loading

mwhithead commented Feb 27, 2024

ritchie46 commented Feb 27, 2024

mwhithead commented Feb 27, 2024

c-peters commented Mar 4, 2024 •

edited

Loading

File list pruning with hive partition columns doesn't work with date type #14712

File list pruning with hive partition columns doesn't work with date type #14712

Comments

mwhithead commented Feb 27, 2024 • edited by nameexhaustion Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

c-peters commented Feb 27, 2024

c-peters commented Feb 27, 2024 • edited Loading

mwhithead commented Feb 27, 2024

ritchie46 commented Feb 27, 2024

mwhithead commented Feb 27, 2024

c-peters commented Mar 4, 2024 • edited Loading

mwhithead commented Feb 27, 2024 •

edited by nameexhaustion

Loading

c-peters commented Feb 27, 2024 •

edited

Loading

c-peters commented Mar 4, 2024 •

edited

Loading