group_by should replace existing grouping #385

holgerbrandl · 2014-04-14T14:50:23Z

I would love to regroup my data using a chained dplyr command:

group_by(diamonds, color) %.% filter(mean(carat)>0.24) %.% group_by(cut) %.% filter(mean(depth)>60)

The result is:

Source: local data frame [53,940 x 10]
Groups: color, cut

   carat       cut color clarity depth table price    x    y    z
1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
...

The second group_by is not replacing the first grouping, but it iw just added to it. It took me a while to realize that it needs to be group_by(cut, add=F).

For sure this is a more a usability preference, but especially when it comes to grouped operations, it's easy to miss an incorrect grouping, so it would be imho more clear (especially for new users) to change the default of add to FALSE. Then above code could would behave as written.

In a chained operation users might think about adding add=F. However, when using unchained (line by line) scripting, many will tend to forget that they work with a group table and will get unexpected results when applying another group_by to it.

This seems related to #121 which was tagged as fixed, but I can't see how the fix works.

The text was updated successfully, but these errors were encountered:

hadley · 2014-04-14T15:07:50Z

I think you're probably right, and group_by() should default to add = FALSE. I'm a little worried about breaking existing code, but dplyr is still so young that this is probably ok.

holgerbrandl · 2014-04-14T21:01:44Z

Thanks a lot for the fix.

Looking through the diff, it seems that you've made a minor mistake when fixing the documentation: Lines 30+31 in group-by.r should be "To instead add to the existing groups use \code{add = TRUE}".

statsandwich · 2014-04-15T20:20:50Z

I like this change!

Actually, I somehow was not aware of the 'add' argument, so had been wrapping in as.data.frame() as a hack for chaining.

hadley · 2014-04-15T20:22:48Z

@statsandwich there's also ungroup().

statsandwich · 2014-04-15T20:46:58Z

The reason I didn't use ungroup() was that I frankly did not understand what was going on with the Groups: when I was chaining operations. The behavior was unpredictable to me, so I sought the intermediary comfort of a non-dplyr-class object.

Only saying this as an aside to understand how a less-sophisticated programmer may think. Not dplyr's problem.

hadley added this to the v0.2 milestone Apr 14, 2014

hadley closed this as completed in f9bcca0 Apr 14, 2014

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by should replace existing grouping #385

group_by should replace existing grouping #385

holgerbrandl commented Apr 14, 2014

hadley commented Apr 14, 2014

holgerbrandl commented Apr 14, 2014

statsandwich commented Apr 15, 2014

hadley commented Apr 15, 2014

statsandwich commented Apr 15, 2014

group_by should replace existing grouping #385

group_by should replace existing grouping #385

Comments

holgerbrandl commented Apr 14, 2014

hadley commented Apr 14, 2014

holgerbrandl commented Apr 14, 2014

statsandwich commented Apr 15, 2014

hadley commented Apr 15, 2014

statsandwich commented Apr 15, 2014