-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keep more sorted information with DataFrame #10935
Comments
This seems like a bit of a minefield, but maybe not. If you label |
@mcrumiller I am asking this because currently polars seems lack a way similar to create key/index in SQL, for example, in R's data.table there's a setkey function which I can set index or physically reorder the data, and the key metadata information is preserved: the set_sorted seems only be able to track that information on a per column basis but not at the combined level. if you use it in a way like key/index in SQL, it is a table level information, not a field level information. and it rarely changes |
@Sage0614 yeah, it does seem quite useful, I agree. |
To get a sorted key field you can make a struct For instance:
|
This does not really help as polars does not flag the struct column as being sorted: test_df = pl.DataFrame(
{
"idx_1": [1, 2, 3, 1, 2, 3],
"idx_2": [4, 4, 5, 5, 6, 6],
"value": [1, 2, 3, 4, 5, 6],
}
)
test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')
test_df.flags['key']
# {'SORTED_ASC': False, 'SORTED_DESC': False} Also, I cannot use this property to merge sorted dataframes: # test_df.merge_sorted(test_df, key="key")
thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/merge_sorted.rs:144:13:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[5], line 10
1 test_df = pl.DataFrame(
2 {
3 "idx_1": [1, 2, 3, 1, 2, 3],
(...)
6 }
7 )
8 test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')
---> 10 test_df.merge_sorted(test_df, key="key")
File /opt/anaconda/lib/python3.10/site-packages/polars/dataframe/frame.py:10221, in DataFrame.merge_sorted(self, other, key)
10157 def merge_sorted(self, other: DataFrame, key: str) -> DataFrame:
10158 """
10159 Take two sorted DataFrames and merge them by the sorted key.
10160
(...)
10219 └────────┴─────┘
10220 """
> 10221 return self.lazy().merge_sorted(other.lazy(), key).collect(_eager=True)
File /opt/anaconda/lib/python3.10/site-packages/polars/lazyframe/frame.py:1706, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
1693 comm_subplan_elim = False
1695 ldf = self._ldf.optimization_toggle(
1696 type_coercion,
1697 predicate_pushdown,
(...)
1704 _eager,
1705 )
-> 1706 return wrap_df(ldf.collect())
PanicException: not implemented I tried this with latest polars v0.22.2:
|
Problem description
currently when we sort a DataFrame by more than one column, only the first column is marked as sorted:
output:
my questions are:
The text was updated successfully, but these errors were encountered: