-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_parquet()
+ collect()
errors if the parquet
file was created with sink_parquet()
#365
Comments
I got the following error messages: > pl$scan_parquet(dest2)$collect()
thread '<unnamed>' panicked at 'should not fail: ComputeError(ErrString("cannot concat categoricals coming from a different source; consider setting a global StringCache"))', <redacted>/polars/polars-core/src/frame/mod.rs:923:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
polars: closing concurrent R handler
Error: Execution halted with the following contexts
0: In R: in $collect():
0: During function call [pl$scan_parquet(dest2)$collect()]
1: A polars sub-thread panicked. See panic msg, which is likely more informative than this error: Any { .. } Maybe it's related to the String cache? I'm not sure about this |
Also personally I think this is not likely to be a bug of |
I can't reproduce it with py-polars: import polars as pl
df = pl.LazyFrame(
{"values": [1, 2, 3], "values2": ["a", "b", "c"]},
schema={"values": pl.Float64, "values2": pl.Categorical},
)
df.sink_parquet("foo.parquet")
pl.scan_parquet("foo.parquet").collect() Maybe it was fixed in a recent version |
Fails when there are more than 3 values in categorical column. Simpler reprex: library(polars)
lf1 <- pl$LazyFrame(values = factor(letters[1:3]))
lf2 <- pl$LazyFrame(values = factor(letters[1:4]))
dest <- tempfile(fileext = ".parquet")
lf1$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (3, 1)
#> ┌────────┐
#> │ values │
#> │ --- │
#> │ cat │
#> ╞════════╡
#> │ a │
#> │ b │
#> │ c │
#> └────────┘
lf2$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> polars: closing concurrent R handler
#> Error: Execution halted with the following contexts
#> 0: In R: in $collect():
#> 1: A polars sub-thread panicked. See panic msg, which is likely more informative than this error: Any { .. } |
I'm working on bumpin rust-polars to 0.32.0 and is quite far. |
Looks like #334 solved this: library(polars)
lf1 <- pl$LazyFrame(values = factor(letters[1:3]))
lf2 <- pl$LazyFrame(values = factor(letters[1:4]))
dest <- tempfile(fileext = ".parquet")
lf1$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (3, 1)
#> ┌────────┐
#> │ values │
#> │ --- │
#> │ cat │
#> ╞════════╡
#> │ a │
#> │ b │
#> │ c │
#> └────────┘
lf2$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (4, 1)
#> ┌────────┐
#> │ values │
#> │ --- │
#> │ cat │
#> ╞════════╡
#> │ a │
#> │ b │
#> │ c │
#> │ d │
#> └────────┘ |
I think
sink_parquet()
has some bug inside: runningpl$scan_parquet()$collect()
errors if the file was created withsink_parquet()
but not if it was created witharrow::write_parquet()
.The text was updated successfully, but these errors were encountered: