keep more sorted information with DataFrame #10935

Sage0614 · 2023-09-05T16:19:02Z

Problem description

currently when we sort a DataFrame by more than one column, only the first column is marked as sorted:

import polars as pl

test_df = pl.DataFrame(
    {
        "idx_1": [1, 2, 3, 1, 2, 3],
        "idx_2": [4, 4, 5, 5, 6, 6],
        "value": [1, 2, 3, 4, 5, 6],
    }
)

test_df = test_df.sort(["idx_1", "idx_2"])

print(test_df)

print(test_df.flags)

output:

shape: (6, 3)
┌───────┬───────┬───────┐
│ idx_1 ┆ idx_2 ┆ value │
│ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 1     │
│ 1     ┆ 5     ┆ 4     │
│ 2     ┆ 4     ┆ 2     │
│ 2     ┆ 6     ┆ 5     │
│ 3     ┆ 5     ┆ 3     │
│ 3     ┆ 6     ┆ 6     │
└───────┴───────┴───────┘
{'idx_1': {'SORTED_ASC': True, 'SORTED_DESC': False}, 'idx_2': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'value': {'SORTED_ASC': False, 'SORTED_DESC': False}}

my questions are:

are the downstream operations aware that the idx_2 is sorted groupwise?
if not, is there a way to flag that idx_2 is group by idx_1 sorted? and any order sensitive groupwise operation can potentially benefit from it?

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-09-05T17:31:00Z

This seems like a bit of a minefield, but maybe not. If you label A as sorted by B, then if B changes you need to de-flag anything that depends on it, including A, which means that for each column we now need a list of all columns that depend on it. Maybe that's all that needs to be done? That seems to be jumping into some column-intertwinement that either could be opening a can of worms, or new possibilities.

Sage0614 · 2023-09-05T18:17:20Z

@mcrumiller I am asking this because currently polars seems lack a way similar to create key/index in SQL, for example, in R's data.table there's a setkey function which I can set index or physically reorder the data, and the key metadata information is preserved:
https://www.rdocumentation.org/packages/data.table/versions/1.14.8/topics/setkey

the set_sorted seems only be able to track that information on a per column basis but not at the combined level.

if you use it in a way like key/index in SQL, it is a table level information, not a field level information. and it rarely changes

mcrumiller · 2023-09-05T18:21:13Z

@Sage0614 yeah, it does seem quite useful, I agree.

deanm0000 · 2023-11-21T20:48:34Z

To get a sorted key field you can make a struct

For instance:

test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].is_sorted()
### True
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].struct.field('idx_1').is_sorted()
### True
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].struct.field('idx_2').is_sorted()
### False
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key').filter(pl.col('idx_1')==1)['key'].struct.field('idx_2').is_sorted()
### True

Hoeze · 2024-01-06T14:49:34Z

This does not really help as polars does not flag the struct column as being sorted:

test_df = pl.DataFrame(
    {
        "idx_1": [1, 2, 3, 1, 2, 3],
        "idx_2": [4, 4, 5, 5, 6, 6],
        "value": [1, 2, 3, 4, 5, 6],
    }
)
test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')

test_df.flags['key']
# {'SORTED_ASC': False, 'SORTED_DESC': False}

Also, I cannot use this property to merge sorted dataframes:

# test_df.merge_sorted(test_df, key="key")
thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/merge_sorted.rs:144:13:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[5], line 10
      1 test_df = pl.DataFrame(
      2     {
      3         "idx_1": [1, 2, 3, 1, 2, 3],
   (...)
      6     }
      7 )
      8 test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')
---> 10 test_df.merge_sorted(test_df, key="key")

File /opt/anaconda/lib/python3.10/site-packages/polars/dataframe/frame.py:10221, in DataFrame.merge_sorted(self, other, key)
  10157 def merge_sorted(self, other: DataFrame, key: str) -> DataFrame:
  10158     """
  10159     Take two sorted DataFrames and merge them by the sorted key.
  10160 
   (...)
  10219     └────────┴─────┘
  10220     """
> 10221     return self.lazy().merge_sorted(other.lazy(), key).collect(_eager=True)

File /opt/anaconda/lib/python3.10/site-packages/polars/lazyframe/frame.py:1706, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1693     comm_subplan_elim = False
   1695 ldf = self._ldf.optimization_toggle(
   1696     type_coercion,
   1697     predicate_pushdown,
   (...)
   1704     _eager,
   1705 )
-> 1706 return wrap_df(ldf.collect())

PanicException: not implemented

I tried this with latest polars v0.22.2:

--------Version info---------
Polars:               0.20.2
Index type:           UInt32
Platform:             Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28
Python:               3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Sage0614 added the enhancement New feature or an improvement of an existing feature label Sep 5, 2023

This was referenced Jan 6, 2024

Flag sorted struct-type columns as sorted #13484

Closed

Implement merge_sorted for struct-type columns #13485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep more sorted information with DataFrame #10935

keep more sorted information with DataFrame #10935

Sage0614 commented Sep 5, 2023

mcrumiller commented Sep 5, 2023

Sage0614 commented Sep 5, 2023

mcrumiller commented Sep 5, 2023

deanm0000 commented Nov 21, 2023

Hoeze commented Jan 6, 2024 •

edited

Loading

keep more sorted information with DataFrame #10935

keep more sorted information with DataFrame #10935

Comments

Sage0614 commented Sep 5, 2023

Problem description

mcrumiller commented Sep 5, 2023

Sage0614 commented Sep 5, 2023

mcrumiller commented Sep 5, 2023

deanm0000 commented Nov 21, 2023

Hoeze commented Jan 6, 2024 • edited Loading

Hoeze commented Jan 6, 2024 •

edited

Loading