Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6297

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 14, 2022

Supersedes #6018, too out of date
Closes #4406

Linked tidyup, https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md

Main changes:

  • group_by() now internally computes and orders groups with vec_locate_sorted_groups(), which is often much faster than the previous approach

  • group_by() now internally sorts character vector group columns in the C locale rather than the system locale. This affects the ordering of the result of functions that use the group data, like summarise() and group_split(). Note that this is consistent with the new default of arrange(), which also uses the C locale.

  • A temporary boolean global option, dplyr.legacy_group_by_locale dplyr.legacy_locale, has been added that allows users to revert back to respecting the system locale. This is mainly for users that are on a deadline and need it to "just work" like it did before. It is better to explicitly call arrange(.locale =) after summarise() instead. I tried to make it clear that this option will be removed in the future, and gave an example of what to do instead.

    • We will update arrange() to respect this as well.

"8965bce6-c05d-498f-a9f3-edc8d8f49740"

Comment on lines 205 to 214
test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", {
on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin")
skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.")

local_options(dplyr.legacy_group_by_locale = TRUE)
withr::local_collate("en_US")

df <- tibble(x = c("a", "A", "Z", "b"))
result <- compute_groups(df, "x")
expect_identical(result$x, c("a", "A", "b", "Z"))
})
Copy link
Member Author

@DavisVaughan DavisVaughan Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like the best I could do to have a test that used a non-C locale so we could ensure the global option was working.

  • On Windows it isn't called "en_US" except in R >=4.2.0 with the new UTF-8 support
  • On Linux there might not be extensive locale support

I kind of needed testthat::skip_if_not_on_os()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US locale.

@DavisVaughan DavisVaughan requested a review from hadley June 14, 2022 21:16
R/group-by.r Outdated Show resolved Hide resolved
R/group-by.r Outdated
#'
#' Prior to dplyr 1.1.0, character vector grouping columns were ordered in the
#' system locale. If you need to temporarily revert to this behavior, you can
#' set the global option `dplyr.legacy_group_by_locale` to `TRUE`, but this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make a single option that brings back the legacy behaviour of arrange()?

Copy link
Member Author

@DavisVaughan DavisVaughan Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have these two global options

# Picked up in `dplyr_locale()` and used by `arrange()` as the default stringi locale
# (this one is already implemented)
dplyr.locale = NULL / string

# If `TRUE`:
# - `arrange()` would always use `vec_order_base()` and would ignore `.locale`
# - `group_by()` would use `vec_order_base()`
dplyr.legacy_locale = NULL / bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good to me.

Copy link
Member Author

@DavisVaughan DavisVaughan Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this pr to respect dplyr.legacy_locale in group_by(). arrange() will take a little work so I'll do that in a follow up PR - but I do think this is a good idea, even if it leaves us with some extra technical debt to maintain in arrange()

R/grouped-df.r Outdated Show resolved Hide resolved
Comment on lines 205 to 214
test_that("using the global option `dplyr.legacy_group_by_locale` forces the system locale", {
on_mac <- identical(tolower(Sys.info()[["sysname"]]), "darwin")
skip_if_not(on_mac, message = "Not on Mac. Unsure if we can use 'en_US' locale.")

local_options(dplyr.legacy_group_by_locale = TRUE)
withr::local_collate("en_US")

df <- tibble(x = c("a", "A", "Z", "b"))
result <- compute_groups(df, "x")
expect_identical(result$x, c("a", "A", "b", "Z"))
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just skip on windows. I'm pretty sure it should be safe to assume that linux has a en_US locale.

@DavisVaughan DavisVaughan force-pushed the feature/group-by-vec-locate-sorted-groups branch from 6b6258a to a71ee8d Compare July 11, 2022 16:44
@DavisVaughan DavisVaughan merged commit 5e27ef1 into tidyverse:main Jul 11, 2022
@DavisVaughan DavisVaughan deleted the feature/group-by-vec-locate-sorted-groups branch July 11, 2022 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

group_by performance: potential for easy and substantial improvement
2 participants