Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive partitions are corrupted during reads from cloud storage in Polars 1.0.0-rc.1 #17155

Closed
2 tasks done
nameexhaustion opened this issue Jun 24, 2024 · 0 comments · Fixed by #17152
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars regression Issue introduced by a new release

Comments

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jun 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import os
from polars.testing import assert_frame_equal
from pathlib import Path

os.environ["POLARS_FORCE_ASYNC"] = "1"
os.environ["POLARS_PREFETCH_SIZE"] = "1"

root = Path(".env/data2")

dfs = [
    pl.DataFrame({"x": 1}),
    pl.DataFrame({"x": 2}),
    pl.DataFrame({"x": 3}),
]

paths = [
    root / "a=1/b=1/data.bin",
    root / "a=2/b=2/data.bin",
    root / "a=3/b=3/data.bin",
]

[
    [paths[i].parent.mkdir(exist_ok=True, parents=True), dfs[i].write_parquet(paths[i])]
    for i in range(len(dfs))
]

lf = pl.scan_parquet(root)

assert_frame_equal(lf.collect(), pl.DataFrame({k: [1, 2, 3] for k in ["x", "a", "b"]}))

Output:

AssertionError: DataFrames are different (value mismatch for column 'a')
[left]:  [1, 1, 1]
[right]: [1, 2, 3]

Log output

No response

Issue description

This is introduced by 306a918. It will occur when reading more files than the calculated prefetch size.

Expected behavior

Given example passes

Installed versions

main @ 46ba436

@nameexhaustion nameexhaustion added bug Something isn't working python Related to Python Polars regression Issue introduced by a new release accepted Ready for implementation P-critical Priority: critical labels Jun 24, 2024
@nameexhaustion nameexhaustion self-assigned this Jun 24, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 24, 2024
@nameexhaustion nameexhaustion changed the title Hive partitions are not read correctly from cloud storage Hive partitions are not read correctly from cloud storage in 1.0.0 pre-releases Jun 24, 2024
@nameexhaustion nameexhaustion changed the title Hive partitions are not read correctly from cloud storage in 1.0.0 pre-releases Hive partitions are corrupted during reads from cloud storage in 1.0.0 pre-releases Jun 24, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 24, 2024
@nameexhaustion nameexhaustion changed the title Hive partitions are corrupted during reads from cloud storage in 1.0.0 pre-releases Hive partitions are corrupted during reads from cloud storage in Polars 1.0.0-rc.1 Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-critical Priority: critical python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
1 participant