`maintain_order` in `top_k` is inconsistent with `maintain_order` in other places #15238

MarcoGorelli · 2024-03-22T14:47:47Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({'a': [1, 2, 3], 'b': [6, 5, 4]})
df.top_k(k=2, by='a', maintain_order=True)

Log output

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
│ 2   ┆ 5   │
└─────┴─────┘

Issue description

maintain_order here refers to how ties are broken, whereas in other places in Polars it refers to whether the original input order is preserved

Expected behavior

Either

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 5   │
│ 3   ┆ 4   │
└─────┴─────┘

or for maintain_order to be renamed

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

etiennebacher · 2024-05-23T13:31:54Z

It looks like since #16041 the maintain_order argument doesn't do anything in top_k():

https://github.com/pola-rs/polars/blob/main/py-polars/polars/expr/expr.py#L2096-L2097

stinodego · 2024-05-25T12:11:39Z

I seem to recall we have discussed this in the issue triage before. Not sure what conclusion we arrived at. Both maintain_order and ascending are problematic, in my opinion.

nameexhaustion · 2024-05-30T05:08:43Z

It seems the current maintain_order behaves like a keep='first' does to DataFrame.unique

stinodego · 2024-05-30T11:20:14Z

We have decided to remove the following flags:

maintain_order
nulls_last
multithreaded

We will not give any guarantees about the order in which the top/bottom elements are returned. descending will remain as it pertains to the by element and is required to express the desired operation.

In both top_k and bottom_k, null values will always have lowest priority, e.g. the result will only include nulls if the column contains fewer than k non-null elements.

In the future, we may include the option to maintain the original order or to return the elements in ascending/descending order.

We will deprecate the parameters in 0.20.31 and remove them in 1.0.0 when implementing the new behavior.

MarcoGorelli · 2024-05-30T11:29:51Z

Thanks for the update!

This

We will not give any guarantees about the order in which the top/bottom elements are returned.

slightly concerns me, because it means that #10054 isn't addressed by having a by argument in the top_k expression. See this comment #10054 (comment) I made for an explanation of why

stinodego · 2024-05-30T11:44:22Z

Regarding #10054 (comment) :

I don't see an issue with this. Operations on multi-column expressions always operate on the individual columns. If you want a guarantee that entire rows are preserved, you must use DataFrame.top_k.

In any case, we do want to support a 'stable' top_k that maintains the order in the original DataFrame/column, but that will be a feature we implement after 1.0.0, and we're not sure what the API will be (maybe a maintain_order flag, maybe we have an order parameter with multiple options, e.g. ["any", "maintain", "ascending", "descending"].

Currently there is a maintain_order parameter and it flat out doesn't work, so that's no good to anyone.

MarcoGorelli · 2024-05-30T12:24:15Z

Agree on removing maintain_order, thanks for doing that

If you want a guarantee that entire rows are preserved, you must use DataFrame.top_k.

Agree! The trouble is that DataFrame.top_k doesn't have a group_by argument. And when that was suggested in #10054, the response was that the solution was to add a by argument to Expr.top_k

That would've been fine if there was some ordering guarantee. If there isn't, then I'd like to make the case that #10054 should be reopened

stinodego · 2024-05-30T12:46:39Z

You're correct. I think in this case I think the issue should be "Add stable top_k that maintains order of original elements", since this is something we want to support anyway. That will solve the issue without adding a GroupBy.top_k.

MarcoGorelli · 2024-05-30T13:13:57Z

Thanks, have opened an issue

Regarding

descending will remain as it pertains to the by element and is required to express the desired operation

Without consulting the docs, if I read

pl.col('a').top_k_by('b', k=2, descending=True)

then it looks like it means "take the elements of 'a' corresponding to the top 2 elements of 'b' when it's sorted in a descending manner". i.e. that it's the same as:

pl.col('a').sort_by('b', descending=True).head(2)

But, it's not. It does the opposite:

df = pl.DataFrame({'a': [1,2,3], 'b': [5,4,6]})
print(df.select(pl.col('a').top_k_by('b', k=2, descending=True)))
print(df.select(pl.col('a').sort_by('b', descending=True).head(2)))

shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 1   │
└─────┘
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 1   │
└─────┘

It actually matches pl.col('a').sort_by('b', descending=False).head(2)!

In that sense, I (and others) have made the case the behaviour of "descending" is backwards. It's going to get harder to course-correct later.

As hard as it may be, I suggest biting the bullet and reversing the behaviour of descending in 1.0, and that this be made very clear in the upgrade guide

stinodego · 2024-05-30T13:24:45Z

I believe the existing behavior is correct. In your head example you are sorting a together with b, which is why you end up with different results.

MarcoGorelli · 2024-05-30T13:29:24Z

Sorting 'a' by 'b' is the intention - I've revised the example to use pl.col('a').sort_by('b') so it's more explicit

pl.col('a').top_k_by('b', k=2, descending=True)
pl.col('a').sort_by('b', descending=True).head(2)
pl.col('a').sort_by('b', descending=False).head(2)

Just reading these, would you expect 1. and 2. to match, or 1. and 3.? The current answer is that 1. and 3. match.

stinodego · 2024-05-30T13:46:52Z

Ok, I see now what you mean. I think the parameter should be called reverse in this case... will discuss.

EDIT: We're keeping it this way. It's a bit confusing but there is no real good way to do this. Changing True <-> False isn't going to help here. You could just as well think about taking the top_k_by('b') as sorting by b and taking the tail. In that case, the name works fine.

orlp changed the title ~~maintain_order in top_k is inconsistent with maintain_order in order places~~ maintain_order in top_k is inconsistent with maintain_order in other places Mar 22, 2024

stinodego added python Related to Python Polars bug Something isn't working labels May 25, 2024

stinodego mentioned this issue May 25, 2024

fix(python): Fix boolean trap issue in top_k/bottom_k #16489

Merged

stinodego added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels May 25, 2024

stinodego added this to the 1.0.0 milestone May 25, 2024

stinodego added the needs decision Awaiting decision by a maintainer label May 25, 2024

stinodego added this to Backlog May 26, 2024

github-project-automation bot moved this to Ready in Backlog May 26, 2024

stinodego moved this from Ready to Next in Backlog May 26, 2024

stinodego added accepted Ready for implementation and removed needs decision Awaiting decision by a maintainer labels May 30, 2024

stinodego self-assigned this May 30, 2024

MarcoGorelli mentioned this issue May 30, 2024

Add stable Expr.top_k #16596

Open

stinodego moved this from Next to In progress in Backlog May 30, 2024

This was referenced May 30, 2024

feat!: Remove deprecated top_k parameters nulls_last, maintain_order, and multithreaded #16599

Merged

bottom_k should not include nulls if the column contains at least k valid elements #16748

Closed

stinodego closed this as completed in #16599 Jun 5, 2024

github-project-automation bot moved this from In progress to Done in Backlog Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`maintain_order` in `top_k` is inconsistent with `maintain_order` in other places #15238

`maintain_order` in `top_k` is inconsistent with `maintain_order` in other places #15238

MarcoGorelli commented Mar 22, 2024

etiennebacher commented May 23, 2024

stinodego commented May 25, 2024

nameexhaustion commented May 30, 2024

stinodego commented May 30, 2024 •

edited

Loading

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024 •

edited

Loading

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024

MarcoGorelli commented May 30, 2024 •

edited

Loading

stinodego commented May 30, 2024

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024 •

edited

Loading

maintain_order in top_k is inconsistent with maintain_order in other places #15238

maintain_order in top_k is inconsistent with maintain_order in other places #15238

Comments

MarcoGorelli commented Mar 22, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

etiennebacher commented May 23, 2024

stinodego commented May 25, 2024

nameexhaustion commented May 30, 2024

stinodego commented May 30, 2024 • edited Loading

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024 • edited Loading

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024

MarcoGorelli commented May 30, 2024 • edited Loading

stinodego commented May 30, 2024

MarcoGorelli commented May 30, 2024

stinodego commented May 30, 2024 • edited Loading

`maintain_order` in `top_k` is inconsistent with `maintain_order` in other places #15238

`maintain_order` in `top_k` is inconsistent with `maintain_order` in other places #15238

stinodego commented May 30, 2024 •

edited

Loading

stinodego commented May 30, 2024 •

edited

Loading

MarcoGorelli commented May 30, 2024 •

edited

Loading

stinodego commented May 30, 2024 •

edited

Loading