Update `group_by()` algorithm to utilize `vec_locate_sorted_groups()` #6297

DavisVaughan · 2022-06-14T21:00:17Z

Supersedes #6018, too out of date
Closes #4406

Linked tidyup, https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md

Main changes:

group_by() now internally computes and orders groups with vec_locate_sorted_groups(), which is often much faster than the previous approach
group_by() now internally sorts character vector group columns in the C locale rather than the system locale. This affects the ordering of the result of functions that use the group data, like summarise() and group_split(). Note that this is consistent with the new default of arrange(), which also uses the C locale.
A temporary boolean global option, ~~dplyr.legacy_group_by_locale~~ dplyr.legacy_locale, has been added that allows users to revert back to respecting the system locale. This is mainly for users that are on a deadline and need it to "just work" like it did before. It is better to explicitly call arrange(.locale =) after summarise() instead. I tried to make it clear that this option will be removed in the future, and gave an example of what to do instead.
- We will update arrange() to respect this as well.

"8965bce6-c05d-498f-a9f3-edc8d8f49740"

DavisVaughan · 2022-06-14T21:05:29Z

tests/testthat/test-grouped-df.r

+test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", {
+  on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin")
+  skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.")
+
+  local_options(dplyr.legacy_group_by_locale = TRUE)
+  withr::local_collate("en_US")
+
+  df <- tibble(x = c("a", "A", "Z", "b"))
+  result <- compute_groups(df, "x")
+  expect_identical(result$x, c("a", "A", "b", "Z"))
+})


This seemed like the best I could do to have a test that used a non-C locale so we could ensure the global option was working.

On Windows it isn't called "en_US" except in R >=4.2.0 with the new UTF-8 support

On Linux there might not be extensive locale support

I kind of needed testthat::skip_if_not_on_os()

I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US locale.

R/group-by.r

hadley · 2022-06-28T15:20:17Z

R/group-by.r

+#'
+#' Prior to dplyr 1.1.0, character vector grouping columns were ordered in the
+#' system locale. If you need to temporarily revert to this behavior, you can
+#' set the global option `dplyr.legacy_group_by_locale` to `TRUE`, but this


Could we make a single option that brings back the legacy behaviour of arrange()?

We could have these two global options

# Picked up in `dplyr_locale()` and used by `arrange()` as the default stringi locale # (this one is already implemented) dplyr.locale = NULL / string # If `TRUE`: # - `arrange()` would always use `vec_order_base()` and would ignore `.locale` # - `group_by()` would use `vec_order_base()` dplyr.legacy_locale = NULL / bool

Yeah, that sounds good to me.

I updated this pr to respect dplyr.legacy_locale in group_by(). arrange() will take a little work so I'll do that in a follow up PR - but I do think this is a good idea, even if it leaves us with some extra technical debt to maintain in arrange()

R/grouped-df.r

hadley · 2022-06-28T15:22:12Z

tests/testthat/test-grouped-df.r

+test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", {
+  on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin")
+  skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.")
+
+  local_options(dplyr.legacy_group_by_locale = TRUE)
+  withr::local_collate("en_US")
+
+  df <- tibble(x = c("a", "A", "Z", "b"))
+  result <- compute_groups(df, "x")
+  expect_identical(result$x, c("a", "A", "b", "Z"))
+})


I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US locale.

DavisVaughan commented Jun 14, 2022

View reviewed changes

DavisVaughan requested a review from hadley June 14, 2022 21:16

hadley reviewed Jun 28, 2022

View reviewed changes

DavisVaughan force-pushed the feature/group-by-vec-locate-sorted-groups branch from 896ea2e to f0a86df Compare July 5, 2022 18:52

markfairbanks mentioned this pull request Jul 6, 2022

S3 method overloading on namespace load tidyverse/dtplyr#312

Closed

DavisVaughan mentioned this pull request Jul 11, 2022

dplyr 1.1.0 revdep tracker #6262

Closed

DavisVaughan added 7 commits July 11, 2022 12:44

Use vec_locate_sorted_groups() in group_by()

548a530

Update the group_by() documentation with a section on ordering

6870640

NEWS bullet

4e708f3

Don't mention that this is an implementation detail

7b1ef26

Simplify code with %||%

480587e

Refine existing locale checker to work with LC_COLLATE

390fa80

Switch to dplyr.legacy_locale

a71ee8d

DavisVaughan force-pushed the feature/group-by-vec-locate-sorted-groups branch from 6b6258a to a71ee8d Compare July 11, 2022 16:44

DavisVaughan merged commit 5e27ef1 into tidyverse:main Jul 11, 2022

DavisVaughan deleted the feature/group-by-vec-locate-sorted-groups branch July 11, 2022 17:20

DavisVaughan mentioned this pull request Jul 11, 2022

Bring back legacy arrange() behavior with dplyr.legacy_locale #6327

Merged

DavisVaughan mentioned this pull request Sep 28, 2022

Consider sorting instead of hashing for uniqueness r-lib/vctrs#736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `group_by()` algorithm to utilize `vec_locate_sorted_groups()` #6297

Update `group_by()` algorithm to utilize `vec_locate_sorted_groups()` #6297

DavisVaughan commented Jun 14, 2022 •

edited

Loading

DavisVaughan Jun 14, 2022 •

edited

Loading

hadley Jun 28, 2022

hadley Jun 28, 2022

DavisVaughan Jul 5, 2022 •

edited

Loading

hadley Jul 5, 2022

DavisVaughan Jul 8, 2022 •

edited

Loading

hadley Jun 28, 2022

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6297

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6297

Conversation

DavisVaughan commented Jun 14, 2022 • edited Loading

DavisVaughan Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

hadley Jun 28, 2022

Choose a reason for hiding this comment

hadley Jun 28, 2022

Choose a reason for hiding this comment

DavisVaughan Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

hadley Jul 5, 2022

Choose a reason for hiding this comment

DavisVaughan Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

hadley Jun 28, 2022

Choose a reason for hiding this comment

Update `group_by()` algorithm to utilize `vec_locate_sorted_groups()` #6297

Update `group_by()` algorithm to utilize `vec_locate_sorted_groups()` #6297

DavisVaughan commented Jun 14, 2022 •

edited

Loading

DavisVaughan Jun 14, 2022 •

edited

Loading

DavisVaughan Jul 5, 2022 •

edited

Loading

DavisVaughan Jul 8, 2022 •

edited

Loading