Duplicate values from `.cat.get_categories` #16916

uemurax · 2024-06-13T05:33:26Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame([
    pl.Series('e', ['a', 'b'] * 500_000, pl.Enum(['a', 'b'])),
])
df.write_parquet('data.parquet')
df_1 = pl.read_parquet('data.parquet')
print(df_1.select(pl.col('e').cat.get_categories()))

Log output

shape: (6, 1)
┌─────┐
│ e   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ a   │
│ b   │
│ a   │
│ b   │
└─────┘

Issue description

Duplicate values are returned from .cat.get_categories. This seems to happen when a large DataFrame containing Enum columns is saved as a parquet and loaded back.

Expected behavior

shape: (2, 1)
┌─────┐
│ e   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
└─────┘

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.38
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

c-peters · 2024-06-13T19:12:16Z

After some digging, problem occurs here. The big parquet file gets split into multiple row groups, which get read into a dataframe of multiple chunks. The schema is correct of each single one is correct. However, we evaluate the get_categories in parallel over each chunk and them combine the results into one which results in this outcome.

polars/crates/polars-lazy/src/physical_plan/executors/projection.rs

Lines 26 to 44 in 4584b45

    
           let df = if self.streamable 
        
               && df.n_chunks() > 1 
        
               && df.height() > POOL.current_num_threads() * 2 
        
               && self.options.run_parallel 
        
           { 
        
               let chunks = df.split_chunks().collect::<Vec<_>>(); 
        
               let iter = chunks.into_par_iter().map(|mut df| { 
        
                   let selected_cols = evaluate_physical_expressions( 
        
                       &mut df, 
        
                       &self.expr, 
        
                       state, 
        
                       self.has_windows, 
        
                       self.options.run_parallel, 
        
                   )?; 
        
                   check_expand_literals(selected_cols, df.is_empty(), self.options) 
        
               }); 
        
               let df = POOL.install(|| iter.collect::<PolarsResult<Vec<_>>>())?; 
        
               accumulate_dataframes_vertical_unchecked(df)

uemurax added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 13, 2024

stinodego added A-dtype-categorical Area: categorical data type P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jun 13, 2024

github-project-automation bot added this to Backlog Jun 13, 2024

github-project-automation bot moved this to Ready in Backlog Jun 13, 2024

c-peters mentioned this issue Jun 18, 2024

fix(rust): fix get categories on multiple row groups #17041

Merged

ritchie46 closed this as completed in #17041 Jun 18, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jun 18, 2024

c-peters added the accepted Ready for implementation label Jun 24, 2024

c-peters self-assigned this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate values from `.cat.get_categories` #16916

Duplicate values from `.cat.get_categories` #16916

uemurax commented Jun 13, 2024

c-peters commented Jun 13, 2024

Duplicate values from .cat.get_categories #16916

Duplicate values from .cat.get_categories #16916

Comments

uemurax commented Jun 13, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

c-peters commented Jun 13, 2024

Duplicate values from `.cat.get_categories` #16916

Duplicate values from `.cat.get_categories` #16916