Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate values from .cat.get_categories #16916

Closed
2 tasks done
uemurax opened this issue Jun 13, 2024 · 1 comment · Fixed by #17041
Closed
2 tasks done

Duplicate values from .cat.get_categories #16916

uemurax opened this issue Jun 13, 2024 · 1 comment · Fixed by #17041
Assignees
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@uemurax
Copy link

uemurax commented Jun 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame([
    pl.Series('e', ['a', 'b'] * 500_000, pl.Enum(['a', 'b'])),
])
df.write_parquet('data.parquet')
df_1 = pl.read_parquet('data.parquet')
print(df_1.select(pl.col('e').cat.get_categories()))

Log output

shape: (6, 1)
┌─────┐
│ e   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ a   │
│ b   │
│ a   │
│ b   │
└─────┘

Issue description

Duplicate values are returned from .cat.get_categories. This seems to happen when a large DataFrame containing Enum columns is saved as a parquet and loaded back.

Expected behavior

shape: (2, 1)
┌─────┐
│ e   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
└─────┘

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.38
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@uemurax uemurax added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 13, 2024
@stinodego stinodego added A-dtype-categorical Area: categorical data type P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jun 13, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 13, 2024
@c-peters
Copy link
Collaborator

After some digging, problem occurs here. The big parquet file gets split into multiple row groups, which get read into a dataframe of multiple chunks. The schema is correct of each single one is correct. However, we evaluate the get_categories in parallel over each chunk and them combine the results into one which results in this outcome.

let df = if self.streamable
&& df.n_chunks() > 1
&& df.height() > POOL.current_num_threads() * 2
&& self.options.run_parallel
{
let chunks = df.split_chunks().collect::<Vec<_>>();
let iter = chunks.into_par_iter().map(|mut df| {
let selected_cols = evaluate_physical_expressions(
&mut df,
&self.expr,
state,
self.has_windows,
self.options.run_parallel,
)?;
check_expand_literals(selected_cols, df.is_empty(), self.options)
});
let df = POOL.install(|| iter.collect::<PolarsResult<Vec<_>>>())?;
accumulate_dataframes_vertical_unchecked(df)

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 18, 2024
@c-peters c-peters added the accepted Ready for implementation label Jun 24, 2024
@c-peters c-peters self-assigned this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants