-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update group_by()
algorithm to utilize vec_locate_sorted_groups()
#6297
Update group_by()
algorithm to utilize vec_locate_sorted_groups()
#6297
Conversation
tests/testthat/test-grouped-df.r
Outdated
test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", { | ||
on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin") | ||
skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.") | ||
|
||
local_options(dplyr.legacy_group_by_locale = TRUE) | ||
withr::local_collate("en_US") | ||
|
||
df <- tibble(x = c("a", "A", "Z", "b")) | ||
result <- compute_groups(df, "x") | ||
expect_identical(result$x, c("a", "A", "b", "Z")) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed like the best I could do to have a test that used a non-C locale so we could ensure the global option was working.
- On Windows it isn't called
"en_US"
except in R >=4.2.0 with the new UTF-8 support - On Linux there might not be extensive locale support
I kind of needed testthat::skip_if_not_on_os()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US
locale.
R/group-by.r
Outdated
#' | ||
#' Prior to dplyr 1.1.0, character vector grouping columns were ordered in the | ||
#' system locale. If you need to temporarily revert to this behavior, you can | ||
#' set the global option `dplyr.legacy_group_by_locale` to `TRUE`, but this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make a single option that brings back the legacy behaviour of arrange()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have these two global options
# Picked up in `dplyr_locale()` and used by `arrange()` as the default stringi locale
# (this one is already implemented)
dplyr.locale = NULL / string
# If `TRUE`:
# - `arrange()` would always use `vec_order_base()` and would ignore `.locale`
# - `group_by()` would use `vec_order_base()`
dplyr.legacy_locale = NULL / bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this pr to respect dplyr.legacy_locale
in group_by()
. arrange()
will take a little work so I'll do that in a follow up PR - but I do think this is a good idea, even if it leaves us with some extra technical debt to maintain in arrange()
tests/testthat/test-grouped-df.r
Outdated
test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", { | ||
on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin") | ||
skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.") | ||
|
||
local_options(dplyr.legacy_group_by_locale = TRUE) | ||
withr::local_collate("en_US") | ||
|
||
df <- tibble(x = c("a", "A", "Z", "b")) | ||
result <- compute_groups(df, "x") | ||
expect_identical(result$x, c("a", "A", "b", "Z")) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US
locale.
896ea2e
to
f0a86df
Compare
6b6258a
to
a71ee8d
Compare
Supersedes #6018, too out of date
Closes #4406
Linked tidyup, https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md
Main changes:
group_by()
now internally computes and orders groups withvec_locate_sorted_groups()
, which is often much faster than the previous approachgroup_by()
now internally sorts character vector group columns in the C locale rather than the system locale. This affects the ordering of the result of functions that use the group data, likesummarise()
andgroup_split()
. Note that this is consistent with the new default ofarrange()
, which also uses the C locale.A temporary boolean global option,
dplyr.legacy_group_by_locale
dplyr.legacy_locale
, has been added that allows users to revert back to respecting the system locale. This is mainly for users that are on a deadline and need it to "just work" like it did before. It is better to explicitly callarrange(.locale =)
aftersummarise()
instead. I tried to make it clear that this option will be removed in the future, and gave an example of what to do instead.arrange()
to respect this as well."8965bce6-c05d-498f-a9f3-edc8d8f49740"