-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371
Comments
I think this should just be a matter of checking if |
Note that maybe it's intended to be like this, but at that point group_by should match the behaviour. The main issue is the inconsistency between the two |
I agree, I already raise this behavior here: #11569 |
I actually skimmed that one, but though it was about something else, sorry. In my personal opinion the better option here is to return a tuple in this case, but if its more consistent to return a scalar by considering edit: pandas |
IMHO, |
I was originally going to make I'll close this if/when the PR is accepted |
I agree with the OP here: treat cols = get_partition_columns() # returns list, we don't know the length
out = df.partition_by(columns, as_dict=True)
# now we have to deal with the fact that the list might have been length 1
if len(columns) == 1:
out = {(k,): v for (k,v) in out.items() |
@mcrumiller that's exactly my use case :) |
This behaviour seems maintained also when using selectors, at least according to the docs: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html last example This means that depending on the number of columns matched by the selector a tuple or a scalar is returned. |
I was going to ask if But experimenting with your latest selectors example, there seems to be an issue with >>> [k for k, _ in df.group_by("a", "b", maintain_order=True)]
[(1, 0), (1, 1), (2, 2), (2, 3)] >>> [k for k, _ in df.group_by(cs.all(), maintain_order=True)]
[1, 1, 2, 2] # <- Hmm? |
@cmdlineluser It's because of polars/py-polars/polars/dataframe/group_by.py Lines 107 to 111 in 7ff3bf7
My PR fixes that by counting the actual columns rather than inferring that when |
If Same for the groupby |
do you think adding a flag to force the use of tuples would make sense? |
I don't think we should add a parameter for this. If the current behavior is considered problematic, we should always use tuples to avoid the potential ambiguity. |
Agree. I don't think it's a lot nicer from a user perspective: the user can simply supply a tuple in if they want a tuple out. I don't necessarily consider the packaging of a dtype to necessary be a change in the dtype. In numpy, adding 1 to a scalar returns a scalar, adding 1 to a 2D array returns a 2D array, etc. In this example, we are partitioning by a scalar versus a list, and in the former case we return scalar keys, and in the latter list keys. |
I may have misread @stinodego initial comment. (this leaves out the use of a single selector like |
A single selector may match multiple columns so it has to be a tuple key. In that sense it's the same as a sequence input. |
yes.
This is a good point. The |
agreed the fact that a scalar selector (meaning a single one, not a list of them) may match more than a column complicate things, but this is likely another issue |
I recently received a warning:
However, I believe that maintaining the current state, where different types of 'by' return different keys, is quite beneficial. As mentioned:
This approach offers clear and intuitive semantics. Following the update in #13646 and modifying the code as per the warning, besides the previously discussed semantic concern, the new syntax seems somewhat verbose, requiring an additional pair of parentheses and a comma: - for k, g in df.group_by(....)
+ for (k, ), g in df.group_by(....) Often, I use Therefore, it would be greatly appreciated if this warning could be removed and the deprecation halted as is. Thank you for this wonderful tool, which I use daily. |
I don't think it does. It is clear if you read the docs, but you might try The fact that Regarding your use case: it might indeed be a bit more verbose to write that way. But depending on what you do with the group key, maybe just using the one-tuple is also fine and you may not need to extract it? Or use |
So, when the deprecation of the single key feature in
Unless it's the second scenario, users might be compelled to change a single
How will things end up? |
The |
The signature for option 3 is copied from the source code. If you mean to say that the signature will remain the same even after the deprecation is complete, the following would be equivalent in the future:
However, before the deprecation is fully implemented, users who use Regardless, I respect the decision of the development team. After all, having reached this point, any resolution might seem a bit peculiar. |
That's correct. Unfortunately, single string input will warn for now. That's the deal with deprecation warnings. |
I just got bitten by this warning
I actually didn't read the warning too much. What I thought was that the function signature has changed to only accept list, so I did that, and the code broke, because now the partition has become a tuple. I think that Also, I don't think making the partition key tuple only make sense. Group by in SQL or doing group by manually using dict in python always use the type of the key of the grouping type. It does not make sense that I have to type People that want uniform key type should also use uniform grouping key type, i.e. when grouping by a single column use a tuple with one element, when grouping by two columns use a tuple with two elements. So the responsibility is put on the user itself. The library itself should not force the result type of the key. |
@stinodego What do you think about making a different warning class so that it is easier to selectively ignore? I'm thinking about putting (something like this) in polars exceptions
and then changing the partition_by warning to that class. and then people can do
I'm struggling to get the regex right for the message parameter of filterwarnings. |
I don't think that's a good idea. People running into this warning should ideally address it rather than ignore it. If they really want to ignore it, they can just ignore all deprecation warnings. |
Is the 1.0 behavior really going to be singleton tuples as keys for If the underlying issue is that |
The argument to partition_by is ambiguous between column names, expr and selector. Maybe we could have a |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
Passing a list of a single column to partition by should return keys as one element tuple, similarly to what iterating on a groupby does.
The current behaviour is inconsistent with how partition by behaves with more than one by column passed, making hard to generalize code that uses is if this takes the partition columns from the outside.
Expected behavior
When passed a list of columns to partition_by should always return keys as tuple, like how list(group_by) does
Edit
This bahaviours seems to happen also when using selectors in partition by, in particular depending on the number of column matched at runtime by the selector a scalar or a select is returned. This is not great too, but may be another issue. Mentioned here: #13371 (comment)
Installed versions
The text was updated successfully, but these errors were encountered: