Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371

CaselIT · 2024-01-02T10:53:47Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [1, 1, 2, 2], "b": range(4)})

keys = [k for k, _ in df.group_by("a", maintain_order=True)]
print(keys)
keys = [k for k, _ in df.group_by(["a"], maintain_order=True)]
print(keys)
keys = [k for k, _ in df.group_by(["a", "b"], maintain_order=True)]
print(keys)
keys = list(df.partition_by("a", maintain_order=True, as_dict=True).keys())
print(keys)
keys = list(df.partition_by(["a"], maintain_order=True, as_dict=True).keys())
print(keys)
keys = list(df.partition_by(["a", "b"], maintain_order=True, as_dict=True).keys())
print(keys)

Log output

[1, 2]
[(1,), (2,)]
[(1, 0), (1, 1), (2, 2), (2, 3)]
[1, 2]
[1, 2]
[(1, 0), (1, 1), (2, 2), (2, 3)]

Issue description

Passing a list of a single column to partition by should return keys as one element tuple, similarly to what iterating on a groupby does.

The current behaviour is inconsistent with how partition by behaves with more than one by column passed, making hard to generalize code that uses is if this takes the partition columns from the outside.

Expected behavior

When passed a list of columns to partition_by should always return keys as tuple, like how list(group_by) does

Edit

This bahaviours seems to happen also when using selectors in partition by, in particular depending on the number of column matched at runtime by the selector a scalar or a select is returned. This is not great too, but may be another issue. Mentioned here: #13371 (comment)

Installed versions

--------Version info---------
Polars:               0.20.2
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.8.0
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              14.0.1
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-01-02T14:23:53Z

I think this should just be a matter of checking if by is a list or tuple then in the len(by)==1 case return (p[by[0]][0],)

CaselIT · 2024-01-02T14:38:41Z

Note that maybe it's intended to be like this, but at that point group_by should match the behaviour. The main issue is the inconsistency between the two

gab23r · 2024-01-02T15:01:33Z

I agree, I already raise this behavior here: #11569

CaselIT · 2024-01-02T15:07:45Z

I actually skimmed that one, but though it was about something else, sorry.

In my personal opinion the better option here is to return a tuple in this case, but if its more consistent to return a scalar by considering by='a' and by=['a'] as equivalent then I believe that group_by should too. (it would break compatibility with pandas, that behaves like the current group by does, treating by='a' and by=['a'] as different)

edit: pandas df.groupby().groups that returns a similarly keyed dict behaves as partition_by, having scalars for both by='a' and by=['a']

gab23r · 2024-01-02T15:18:28Z

IMHO, by='a' and by=['a'] are not equivalent, and I would prefer the output types to be align to the input types and not on the data itself. But this idea didn't make a consensus at the time

deanm0000 · 2024-01-02T16:39:58Z

I was originally going to make partition_by act like the existing group_by but from the linked issue it seems that the consensus is to treat ["a"] as "a" so I made the change to the GroupBy class to check for group count directly instead of based on input types.

I'll close this if/when the PR is accepted

mcrumiller · 2024-01-03T01:10:13Z

I agree with the OP here: treat "a" as a scalar and ["a"] as a tuple. Otherwise this requires people to add extra conditional layers to their code, with something like:

cols = get_partition_columns()  # returns list, we don't know the length
out = df.partition_by(columns, as_dict=True)

# now we have to deal with the fact that the list might have been length 1
if len(columns) == 1:
    out = {(k,): v for (k,v) in out.items()

CaselIT · 2024-01-03T07:37:40Z

@mcrumiller that's exactly my use case :)

CaselIT · 2024-01-08T09:56:04Z

This behaviour seems maintained also when using selectors, at least according to the docs: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html last example df.partition_by(cs.string(), as_dict=True)

This means that depending on the number of columns matched by the selector a tuple or a scalar is returned.
I think this is a very poor api behaviour, but this may be a different issue

cmdlineluser · 2024-01-08T10:30:06Z

I was going to ask if .partition_by could just be implemented in terms of .group_by so they behave the same?

But experimenting with your latest selectors example, there seems to be an issue with .group_by also

>>> [k for k, _ in df.group_by("a", "b", maintain_order=True)]
[(1, 0), (1, 1), (2, 2), (2, 3)]

>>> [k for k, _ in df.group_by(cs.all(), maintain_order=True)]
[1, 1, 2, 2] # <- Hmm?

deanm0000 · 2024-01-08T12:14:12Z

@cmdlineluser It's because of

polars/py-polars/polars/dataframe/group_by.py

Lines 107 to 111 in 7ff3bf7

    
           self._group_names: Iterator[object] | Iterator[tuple[object, ...]] 
        
           if isinstance(self.by, (str, pl.Expr)) and not self.more_by: 
        
               self._group_names = iter(group_names.to_series()) 
        
           else: 
        
               self._group_names = group_names.iter_rows()

My PR fixes that by counting the actual columns rather than inferring that when by is a str or Expr that it means there's only 1 result. The good news is that the issue seems to only exists in __iter__.

stinodego · 2024-01-11T14:34:57Z

If by is a single string ~~or expression~~, the tuples should have a non-tuple key. In all other instances, it should be a tuple.

Same for the groupby __iter__.

CaselIT · 2024-01-11T15:11:56Z

If by is a single string or expression, the tuples should have a non-tuple key. In all other instances, it should be a tuple.

do you think adding a flag to force the use of tuples would make sense?
from an user prospective it's a lot nicer an api that has a stable return type

stinodego · 2024-01-11T15:21:50Z

I don't think we should add a parameter for this. If the current behavior is considered problematic, we should always use tuples to avoid the potential ambiguity.

mcrumiller · 2024-01-11T15:26:53Z

I don't think we should add a parameter for this. If the current behavior is considered problematic, we should always use tuples to avoid the potential ambiguity.

Agree. I don't think it's a lot nicer from a user perspective: the user can simply supply a tuple in if they want a tuple out. I don't necessarily consider the packaging of a dtype to necessary be a change in the dtype. In numpy, adding 1 to a scalar returns a scalar, adding 1 to a 2D array returns a 2D array, etc. In this example, we are partitioning by a scalar versus a list, and in the former case we return scalar keys, and in the latter list keys.

CaselIT · 2024-01-11T15:33:08Z

I may have misread @stinodego initial comment.
Is the proposal that single string returns a scalar while a list/tuple of a single element returns a tuple of one element? if so that works for me (it was the original issue), sorry if I misunderstood.

(this leaves out the use of a single selector like cs.strings() that may match one or more columns depending on the dataframe, but I would selfishly be fine with ignoring this case it since it's not my use case now :))

stinodego · 2024-01-11T15:39:53Z

(this leaves out the use of a single selector like cs.strings() that may match one or more columns depending on the dataframe, but I would selfishly be fine with ignoring this case it since it's not my use case now :))

A single selector may match multiple columns so it has to be a tuple key. In that sense it's the same as a sequence input.

mcrumiller · 2024-01-11T15:40:59Z

Is the proposal that single string returns a scalar while a list/tuple of a single element returns a tuple of one element?

yes.

this leaves out the use of a single selector like cs.strings() that may match one or more columns

This is a good point. The cs module is entirely on the python side and so I think that we could detect when selectors are used and always return a tuple for those cases, although it is possible that some selectors are guaranteed to return single columns. I am not sure whether we should try to "push down" the scalar-vs-column logic into selectors, as that seems non-obvious.

CaselIT · 2024-01-11T15:44:00Z

I am not sure whether we should try to "push down" the scalar-vs-column logic into selectors, as that seems non-obvious.

agreed the fact that a scalar selector (meaning a single one, not a list of them) may match more than a column complicate things, but this is likely another issue

myuanz · 2024-02-05T12:51:21Z

I recently received a warning:

`group_by` iteration will change to always return group identifiers as tuples. Pass `by` as a list to silence this warning.

However, I believe that maintaining the current state, where different types of 'by' return different keys, is quite beneficial. As mentioned:

a single string returns a scalar while a list/tuple of a single element returns a tuple of one element.

This approach offers clear and intuitive semantics.

Following the update in #13646 and modifying the code as per the warning, besides the previously discussed semantic concern, the new syntax seems somewhat verbose, requiring an additional pair of parentheses and a comma:

- for k, g in df.group_by(....)
+ for (k, ), g in df.group_by(....)

Often, I use for in df.group_by instead of df.group_by().agg() merely for debugging purposes, and revert to the agg syntax once debugging is complete.

Therefore, it would be greatly appreciated if this warning could be removed and the deprecation halted as is.

Thank you for this wonderful tool, which I use daily.

stinodego · 2024-02-05T14:18:06Z

This approach offers clear and intuitive semantics.

I don't think it does. It is clear if you read the docs, but you might try group_by(pl.col("a")) and find that you get a tuple instead of a non-tuple key. Or you might try pl.first() and again find that while grouping by a single column, you get a tuple result.

The fact that group_by("a") behaves differently than group_by(col("a")) is the opposite of clear and intuitive. We've had multiple issues advocating for various behaviors. If we always return tuples, there is no more confusion.

Regarding your use case: it might indeed be a bit more verbose to write that way. But depending on what you do with the group key, maybe just using the one-tuple is also fine and you may not need to extract it? Or use k[0] whenever you do use it? The benefit is that if you decide to change the group by call, the type of k remains the same.

myuanz · 2024-02-05T14:59:52Z

So, when the deprecation of the single key feature in group_by eventually takes place, what will the parameter of group_by be? The options are:

*by: IntoExpr
by: Iterable[IntoExpr]
by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr

Unless it's the second scenario, users might be compelled to change a single group_by parameter to (key, ) to eliminate warnings. However, if we look at the parameters for functions like with_columns, select, agg, and others, they all follow the first scenario. Therefore, no matter which signature is adopted after the deprecation, some users will be injured:

If options 1 or 3 are chosen, some users have to be warned until the deprecation is complete.
If option 2 is chosen, it will lead to inconsistency with the parameter conventions of existing systems, causing potential confusion.

How will things end up?

stinodego · 2024-02-05T15:03:08Z

The group_by function signature has been updated with 0.20.7. You can check out the docs.

myuanz · 2024-02-05T15:15:29Z

The signature for option 3 is copied from the source code. If you mean to say that the signature will remain the same even after the deprecation is complete, the following would be equivalent in the future:

df.group_by(["a"])
df.group_by("a")

However, before the deprecation is fully implemented, users who use df.group_by("a") and prefer to receive tuple returns would continually face warnings.

Regardless, I respect the decision of the development team. After all, having reached this point, any resolution might seem a bit peculiar.

stinodego · 2024-02-05T17:48:33Z

the following would be equivalent in the future

That's correct.

Unfortunately, single string input will warn for now. That's the deal with deprecation warnings.

char101 · 2024-02-07T17:06:01Z

I just got bitten by this warning

DeprecationWarning: `partition_by(..., as_dict=True)` will change to always return tuples as dictionary keys. Pass `by` as a list to silence this warnin g, e.g. `partition_by(['key'], as_dict=True).

I actually didn't read the warning too much. What I thought was that the function signature has changed to only accept list, so I did that, and the code broke, because now the partition has become a tuple. I think that pass by notice should be removed or added and you should change your code to process the key as tuple.

Also, I don't think making the partition key tuple only make sense.

Group by in SQL or doing group by manually using dict in python always use the type of the key of the grouping type. It does not make sense that I have to type SELECT (age,), COUNT(1) FROM table GROUP BY (age, ) when grouping by a single key.

People that want uniform key type should also use uniform grouping key type, i.e. when grouping by a single column use a tuple with one element, when grouping by two columns use a tuple with two elements. So the responsibility is put on the user itself.

The library itself should not force the result type of the key.

deanm0000 · 2024-02-07T21:46:48Z

@stinodego What do you think about making a different warning class so that it is easier to selectively ignore?

I'm thinking about putting (something like this) in polars exceptions

class PartitionByDeprecationWarning(DeprecationWarning):
    """Warning issued when partition_by used with as_dict and single key"""

and then changing the partition_by warning to that class.

and then people can do

warnings.filterwarnings("ignore", category=PartitionByDeprecationWarning)

I'm struggling to get the regex right for the message parameter of filterwarnings.

stinodego · 2024-02-07T22:22:44Z

I don't think that's a good idea. People running into this warning should ideally address it rather than ignore it. If they really want to ignore it, they can just ignore all deprecation warnings.

s-banach · 2024-02-20T14:37:22Z

Is the 1.0 behavior really going to be singleton tuples as keys for df.partition_by("a", as_dict=True)?
It seems like the only reasoning is that some users may write pl.col("a") instead of "a".

If the underlying issue is that pl.col("a") is expected to behave identically to "a", you should address the underlying issue by making polars translate pl.col("a") to "a" under the hood. Changing the behavior of group_by and partition_by is just treating a symptom.

char101 · 2024-02-20T15:56:48Z

The argument to partition_by is ambiguous between column names, expr and selector.

Maybe we could have a partition_by_col function which only accepts column names as string. If two or more column names are given then the dictionary is nested.

CaselIT added bug Something isn't working python Related to Python Polars labels Jan 2, 2024

deanm0000 self-assigned this Jan 2, 2024

deanm0000 assigned deanm0000 and unassigned deanm0000 Jan 2, 2024

deanm0000 mentioned this issue Jan 2, 2024

fix(python): Changes the group_by __iter__ method to check for actual column count #13380

Closed

CaselIT mentioned this issue Jan 8, 2024

wrong return type in partition_by with as_dict=True #11569

Closed

2 tasks

stinodego assigned stinodego and unassigned deanm0000 Jan 11, 2024

stinodego added the accepted Ready for implementation label Jan 11, 2024

github-project-automation bot added this to Backlog Jan 11, 2024

github-project-automation bot moved this to Ready in Backlog Jan 11, 2024

stinodego added the P-low Priority: low label Jan 11, 2024

stinodego mentioned this issue Jan 11, 2024

fix(python): Fix group keys in partition_by(as_dict=True) / GroupBy.__iter__ in some cases #13646

Merged

ritchie46 closed this as completed in #13646 Jan 12, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jan 12, 2024

stinodego mentioned this issue Jun 6, 2024

feat(python)!: Update group_by iteration and partition_by to always return tuple keys #16793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371

Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371

CaselIT commented Jan 2, 2024 •

edited

Loading

deanm0000 commented Jan 2, 2024

CaselIT commented Jan 2, 2024

gab23r commented Jan 2, 2024

CaselIT commented Jan 2, 2024 •

edited

Loading

gab23r commented Jan 2, 2024

deanm0000 commented Jan 2, 2024 •

edited

Loading

mcrumiller commented Jan 3, 2024 •

edited

Loading

CaselIT commented Jan 3, 2024

CaselIT commented Jan 8, 2024 •

edited

Loading

cmdlineluser commented Jan 8, 2024

deanm0000 commented Jan 8, 2024 •

edited

Loading

stinodego commented Jan 11, 2024 •

edited

Loading

CaselIT commented Jan 11, 2024

stinodego commented Jan 11, 2024 •

edited

Loading

mcrumiller commented Jan 11, 2024

CaselIT commented Jan 11, 2024 •

edited

Loading

stinodego commented Jan 11, 2024

mcrumiller commented Jan 11, 2024 •

edited

Loading

CaselIT commented Jan 11, 2024 •

edited

Loading

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

char101 commented Feb 7, 2024 •

edited by stinodego

Loading

deanm0000 commented Feb 7, 2024 •

edited

Loading

stinodego commented Feb 7, 2024 •

edited

Loading

s-banach commented Feb 20, 2024 •

edited

Loading

char101 commented Feb 20, 2024

Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371

Partition_by should returns tuples as keys when partitioning on a list of a single column, like group_by does #13371

Comments

CaselIT commented Jan 2, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Edit

Installed versions

deanm0000 commented Jan 2, 2024

CaselIT commented Jan 2, 2024

gab23r commented Jan 2, 2024

CaselIT commented Jan 2, 2024 • edited Loading

gab23r commented Jan 2, 2024

deanm0000 commented Jan 2, 2024 • edited Loading

mcrumiller commented Jan 3, 2024 • edited Loading

CaselIT commented Jan 3, 2024

CaselIT commented Jan 8, 2024 • edited Loading

cmdlineluser commented Jan 8, 2024

deanm0000 commented Jan 8, 2024 • edited Loading

stinodego commented Jan 11, 2024 • edited Loading

CaselIT commented Jan 11, 2024

stinodego commented Jan 11, 2024 • edited Loading

mcrumiller commented Jan 11, 2024

CaselIT commented Jan 11, 2024 • edited Loading

stinodego commented Jan 11, 2024

mcrumiller commented Jan 11, 2024 • edited Loading

CaselIT commented Jan 11, 2024 • edited Loading

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

myuanz commented Feb 5, 2024

stinodego commented Feb 5, 2024

char101 commented Feb 7, 2024 • edited by stinodego Loading

deanm0000 commented Feb 7, 2024 • edited Loading

stinodego commented Feb 7, 2024 • edited Loading

s-banach commented Feb 20, 2024 • edited Loading

char101 commented Feb 20, 2024

CaselIT commented Jan 2, 2024 •

edited

Loading

CaselIT commented Jan 2, 2024 •

edited

Loading

deanm0000 commented Jan 2, 2024 •

edited

Loading

mcrumiller commented Jan 3, 2024 •

edited

Loading

CaselIT commented Jan 8, 2024 •

edited

Loading

deanm0000 commented Jan 8, 2024 •

edited

Loading

stinodego commented Jan 11, 2024 •

edited

Loading

stinodego commented Jan 11, 2024 •

edited

Loading

CaselIT commented Jan 11, 2024 •

edited

Loading

mcrumiller commented Jan 11, 2024 •

edited

Loading

CaselIT commented Jan 11, 2024 •

edited

Loading

char101 commented Feb 7, 2024 •

edited by stinodego

Loading

deanm0000 commented Feb 7, 2024 •

edited

Loading

stinodego commented Feb 7, 2024 •

edited

Loading

s-banach commented Feb 20, 2024 •

edited

Loading