Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintain_order in top_k is inconsistent with maintain_order in other places #15238

Closed
2 tasks done
MarcoGorelli opened this issue Mar 22, 2024 · 12 comments · Fixed by #16599
Closed
2 tasks done

maintain_order in top_k is inconsistent with maintain_order in other places #15238

MarcoGorelli opened this issue Mar 22, 2024 · 12 comments · Fixed by #16599
Assignees
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Milestone

Comments

@MarcoGorelli
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({'a': [1, 2, 3], 'b': [6, 5, 4]})
df.top_k(k=2, by='a', maintain_order=True)

Log output

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
│ 2   ┆ 5   │
└─────┴─────┘

Issue description

maintain_order here refers to how ties are broken, whereas in other places in Polars it refers to whether the original input order is preserved

Expected behavior

Either

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 5   │
│ 3   ┆ 4   │
└─────┴─────┘

or for maintain_order to be renamed

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer A-api Area: changes to the public API and removed needs triage Awaiting prioritization by a maintainer bug Something isn't working python Related to Python Polars labels Mar 22, 2024
@orlp orlp changed the title maintain_order in top_k is inconsistent with maintain_order in order places maintain_order in top_k is inconsistent with maintain_order in other places Mar 22, 2024
@etiennebacher
Copy link
Contributor

It looks like since #16041 the maintain_order argument doesn't do anything in top_k():

https://github.com/pola-rs/polars/blob/main/py-polars/polars/expr/expr.py#L2096-L2097

@stinodego stinodego added python Related to Python Polars bug Something isn't working labels May 25, 2024
@stinodego stinodego added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels May 25, 2024
@stinodego stinodego added this to the 1.0.0 milestone May 25, 2024
@stinodego
Copy link
Member

I seem to recall we have discussed this in the issue triage before. Not sure what conclusion we arrived at. Both maintain_order and ascending are problematic, in my opinion.

@stinodego stinodego added the needs decision Awaiting decision by a maintainer label May 25, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 26, 2024
@stinodego stinodego moved this from Ready to Next in Backlog May 26, 2024
@nameexhaustion
Copy link
Collaborator

It seems the current maintain_order behaves like a keep='first' does to DataFrame.unique

@stinodego stinodego added accepted Ready for implementation and removed needs decision Awaiting decision by a maintainer labels May 30, 2024
@stinodego stinodego self-assigned this May 30, 2024
@stinodego
Copy link
Member

stinodego commented May 30, 2024

We have decided to remove the following flags:

  • maintain_order
  • nulls_last
  • multithreaded

We will not give any guarantees about the order in which the top/bottom elements are returned. descending will remain as it pertains to the by element and is required to express the desired operation.

In both top_k and bottom_k, null values will always have lowest priority, e.g. the result will only include nulls if the column contains fewer than k non-null elements.

In the future, we may include the option to maintain the original order or to return the elements in ascending/descending order.

We will deprecate the parameters in 0.20.31 and remove them in 1.0.0 when implementing the new behavior.

@MarcoGorelli
Copy link
Collaborator Author

Thanks for the update!

This

We will not give any guarantees about the order in which the top/bottom elements are returned.

slightly concerns me, because it means that #10054 isn't addressed by having a by argument in the top_k expression. See this comment #10054 (comment) I made for an explanation of why

@stinodego
Copy link
Member

stinodego commented May 30, 2024

Regarding #10054 (comment) :

I don't see an issue with this. Operations on multi-column expressions always operate on the individual columns. If you want a guarantee that entire rows are preserved, you must use DataFrame.top_k.

In any case, we do want to support a 'stable' top_k that maintains the order in the original DataFrame/column, but that will be a feature we implement after 1.0.0, and we're not sure what the API will be (maybe a maintain_order flag, maybe we have an order parameter with multiple options, e.g. ["any", "maintain", "ascending", "descending"].

Currently there is a maintain_order parameter and it flat out doesn't work, so that's no good to anyone.

@MarcoGorelli
Copy link
Collaborator Author

Agree on removing maintain_order, thanks for doing that

If you want a guarantee that entire rows are preserved, you must use DataFrame.top_k.

Agree! The trouble is that DataFrame.top_k doesn't have a group_by argument. And when that was suggested in #10054, the response was that the solution was to add a by argument to Expr.top_k

That would've been fine if there was some ordering guarantee. If there isn't, then I'd like to make the case that #10054 should be reopened

@stinodego
Copy link
Member

You're correct. I think in this case I think the issue should be "Add stable top_k that maintains order of original elements", since this is something we want to support anyway. That will solve the issue without adding a GroupBy.top_k.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented May 30, 2024

Thanks, have opened an issue

Regarding

descending will remain as it pertains to the by element and is required to express the desired operation

Without consulting the docs, if I read

pl.col('a').top_k_by('b', k=2, descending=True)

then it looks like it means "take the elements of 'a' corresponding to the top 2 elements of 'b' when it's sorted in a descending manner". i.e. that it's the same as:

pl.col('a').sort_by('b', descending=True).head(2)

But, it's not. It does the opposite:

df = pl.DataFrame({'a': [1,2,3], 'b': [5,4,6]})
print(df.select(pl.col('a').top_k_by('b', k=2, descending=True)))
print(df.select(pl.col('a').sort_by('b', descending=True).head(2)))
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 1   │
└─────┘
shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 1   │
└─────┘

It actually matches pl.col('a').sort_by('b', descending=False).head(2)!

In that sense, I (and others) have made the case the behaviour of "descending" is backwards. It's going to get harder to course-correct later.

As hard as it may be, I suggest biting the bullet and reversing the behaviour of descending in 1.0, and that this be made very clear in the upgrade guide

@stinodego
Copy link
Member

I believe the existing behavior is correct. In your head example you are sorting a together with b, which is why you end up with different results.

@MarcoGorelli
Copy link
Collaborator Author

Sorting 'a' by 'b' is the intention - I've revised the example to use pl.col('a').sort_by('b') so it's more explicit

  1. pl.col('a').top_k_by('b', k=2, descending=True)
  2. pl.col('a').sort_by('b', descending=True).head(2)
  3. pl.col('a').sort_by('b', descending=False).head(2)

Just reading these, would you expect 1. and 2. to match, or 1. and 3.? The current answer is that 1. and 3. match.

@stinodego
Copy link
Member

stinodego commented May 30, 2024

Ok, I see now what you mean. I think the parameter should be called reverse in this case... will discuss.

EDIT: We're keeping it this way. It's a bit confusing but there is no real good way to do this. Changing True <-> False isn't going to help here. You could just as well think about taking the top_k_by('b') as sorting by b and taking the tail. In that case, the name works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
Archived in project
4 participants