We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df = pl.DataFrame(pl.Series("d", ["foo", "bar"]), schema=pl.Schema({"d": pl.Enum(["foo", "bar", "ham"])})) df.write_parquet("test.parquet") print(pl.read_parquet("test.parquet").schema) print(pl.scan_parquet("test.parquet").collect_schema())
Schema([('d', Enum(categories=['foo', 'bar', 'ham']))]) Schema([('d', Categorical(ordering='physical'))])
Enum type should be preserved when file is read with scan_parquet
scan_parquet
Output from the example should be:
Schema([('d', Enum(categories=['foo', 'bar', 'ham']))]) Schema([('d', Enum(categories=['foo', 'bar', 'ham']))])
--------Version info--------- Polars: 1.10.0 Index type: UInt32 Platform: macOS-14.7-arm64-arm-64bit Python: 3.10.6 (main, Aug 2 2022, 20:27:59) [Clang 14.0.3 ] LTS CPU: False ----Optional dependencies---- adbc_driver_manager <not installed> altair <not installed> cloudpickle <not installed> connectorx <not installed> deltalake <not installed> fastexcel <not installed> fsspec 2024.9.0 gevent <not installed> great_tables <not installed> matplotlib <not installed> nest_asyncio 1.6.0 numpy <not installed> openpyxl <not installed> pandas <not installed> pyarrow <not installed> pydantic 2.9.2 pyiceberg <not installed> sqlalchemy <not installed> torch <not installed> xlsx2csv <not installed> xlsxwriter <not installed>
The text was updated successfully, but these errors were encountered:
I don't think the types Categorical and Enum exist in Parquet.
You should use the Arrow file format (Arrow IPC file) to store these types.
>>> import polars as pl >>> df = pl.DataFrame(pl.Series("d", ["foo", "bar"]), schema=pl.Schema({"d": pl.Enum(["foo", "bar", "ham"])})) >>> df.write_ipc("test.arrow") >>> pl.read_ipc("test.arrow") shape: (2, 1) ┌──────┐ │ d │ │ --- │ │ enum │ ╞══════╡ │ foo │ │ bar │ └──────┘
Sorry, something went wrong.
This can definitely be preserved either through the arrow schema or through a POLARS_SCHEMA metadata argument.
POLARS_SCHEMA
ritchie46
Successfully merging a pull request may close this issue.
Checks
Reproducible example
Log output
Issue description
Enum type should be preserved when file is read with
scan_parquet
Expected behavior
Output from the example should be:
Installed versions
The text was updated successfully, but these errors were encountered: