Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep more sorted information with DataFrame #10935

Open
Sage0614 opened this issue Sep 5, 2023 · 5 comments
Open

keep more sorted information with DataFrame #10935

Sage0614 opened this issue Sep 5, 2023 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Sage0614
Copy link

Sage0614 commented Sep 5, 2023

Problem description

currently when we sort a DataFrame by more than one column, only the first column is marked as sorted:

import polars as pl

test_df = pl.DataFrame(
    {
        "idx_1": [1, 2, 3, 1, 2, 3],
        "idx_2": [4, 4, 5, 5, 6, 6],
        "value": [1, 2, 3, 4, 5, 6],
    }
)

test_df = test_df.sort(["idx_1", "idx_2"])

print(test_df)

print(test_df.flags)

output:

shape: (6, 3)
┌───────┬───────┬───────┐
│ idx_1 ┆ idx_2 ┆ value │
│ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 1     │
│ 1     ┆ 5     ┆ 4     │
│ 2     ┆ 4     ┆ 2     │
│ 2     ┆ 6     ┆ 5     │
│ 3     ┆ 5     ┆ 3     │
│ 3     ┆ 6     ┆ 6     │
└───────┴───────┴───────┘
{'idx_1': {'SORTED_ASC': True, 'SORTED_DESC': False}, 'idx_2': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'value': {'SORTED_ASC': False, 'SORTED_DESC': False}}

my questions are:

  1. are the downstream operations aware that the idx_2 is sorted groupwise?
  2. if not, is there a way to flag that idx_2 is group by idx_1 sorted? and any order sensitive groupwise operation can potentially benefit from it?
@Sage0614 Sage0614 added the enhancement New feature or an improvement of an existing feature label Sep 5, 2023
@mcrumiller
Copy link
Contributor

This seems like a bit of a minefield, but maybe not. If you label A as sorted by B, then if B changes you need to de-flag anything that depends on it, including A, which means that for each column we now need a list of all columns that depend on it. Maybe that's all that needs to be done? That seems to be jumping into some column-intertwinement that either could be opening a can of worms, or new possibilities.

@Sage0614
Copy link
Author

Sage0614 commented Sep 5, 2023

@mcrumiller I am asking this because currently polars seems lack a way similar to create key/index in SQL, for example, in R's data.table there's a setkey function which I can set index or physically reorder the data, and the key metadata information is preserved:
https://www.rdocumentation.org/packages/data.table/versions/1.14.8/topics/setkey

the set_sorted seems only be able to track that information on a per column basis but not at the combined level.

if you use it in a way like key/index in SQL, it is a table level information, not a field level information. and it rarely changes

@mcrumiller
Copy link
Contributor

@Sage0614 yeah, it does seem quite useful, I agree.

@deanm0000
Copy link
Collaborator

To get a sorted key field you can make a struct

For instance:

test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].is_sorted()
### True
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].struct.field('idx_1').is_sorted()
### True
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')['key'].struct.field('idx_2').is_sorted()
### False
test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key').filter(pl.col('idx_1')==1)['key'].struct.field('idx_2').is_sorted()
### True

@Hoeze
Copy link

Hoeze commented Jan 6, 2024

This does not really help as polars does not flag the struct column as being sorted:

test_df = pl.DataFrame(
    {
        "idx_1": [1, 2, 3, 1, 2, 3],
        "idx_2": [4, 4, 5, 5, 6, 6],
        "value": [1, 2, 3, 4, 5, 6],
    }
)
test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')

test_df.flags['key']
# {'SORTED_ASC': False, 'SORTED_DESC': False}

Also, I cannot use this property to merge sorted dataframes:

# test_df.merge_sorted(test_df, key="key")
thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/merge_sorted.rs:144:13:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[5], line 10
      1 test_df = pl.DataFrame(
      2     {
      3         "idx_1": [1, 2, 3, 1, 2, 3],
   (...)
      6     }
      7 )
      8 test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')
---> 10 test_df.merge_sorted(test_df, key="key")

File /opt/anaconda/lib/python3.10/site-packages/polars/dataframe/frame.py:10221, in DataFrame.merge_sorted(self, other, key)
  10157 def merge_sorted(self, other: DataFrame, key: str) -> DataFrame:
  10158     """
  10159     Take two sorted DataFrames and merge them by the sorted key.
  10160 
   (...)
  10219     └────────┴─────┘
  10220     """
> 10221     return self.lazy().merge_sorted(other.lazy(), key).collect(_eager=True)

File /opt/anaconda/lib/python3.10/site-packages/polars/lazyframe/frame.py:1706, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1693     comm_subplan_elim = False
   1695 ldf = self._ldf.optimization_toggle(
   1696     type_coercion,
   1697     predicate_pushdown,
   (...)
   1704     _eager,
   1705 )
-> 1706 return wrap_df(ldf.collect())

PanicException: not implemented

I tried this with latest polars v0.22.2:

--------Version info---------
Polars:               0.20.2
Index type:           UInt32
Platform:             Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28
Python:               3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants