Skip to content

Commit

Permalink
Default dplyr_locale() to C
Browse files Browse the repository at this point in the history
This is reproducible across all R sessions and OSes, and is much faster, which makes it a good default.

It will also make the default of `arrange()` continue to align with what `group_by() + summarize()` returns, since that will also unconditionally use the C locale.
  • Loading branch information
DavisVaughan committed Jun 10, 2022
1 parent 5f78c31 commit 0221c59
Show file tree
Hide file tree
Showing 10 changed files with 165 additions and 134 deletions.
17 changes: 10 additions & 7 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# dplyr (development version)

* `arrange()` now uses a faster algorithm for sorting character vectors,
which is heavily inspired by data.table's `forder()`. Additionally, the
default locale is now American English, which is a breaking change from
the previous behavior which utilized the system locale. The new `.locale`
argument can be used to adjust this. For a fuller explanation, refer to this
[tidyup](https://github.com/tidyverse/tidyups/blob/main/003-dplyr-radix-ordering.md)
which outlines and justifies this change (#4962).
* `arrange()` now uses a faster algorithm for sorting character vectors, which
is heavily inspired by data.table's `forder()`. Additionally, the default
locale for sorting character vectors is now the C locale, which is a breaking
change from the previous behavior that utilized the system locale. The new
`.locale` argument can be used to adjust this to, say, the American English
locale, which is an optional feature that requires the stringi package. This
change improves reproducibility across R sessions and operating systems. For a
fuller explanation, refer to this
[tidyup](https://github.com/tidyverse/tidyups/blob/main/003-dplyr-radix-ordering.md)
which outlines and justifies this change (#4962).

* `tbl_sum()` is no longer reexported from tibble (#6284).

Expand Down
21 changes: 12 additions & 9 deletions R/arrange.R
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,22 @@
#' grouped data frames only.
#' @param .locale The locale to sort character vectors in.
#'
#' - Defaults to [dplyr_locale()], which uses the American English locale
#' if the stringi package is installed and the global default locale
#' has not been altered. See the help page for [dplyr_locale()] for the
#' - Defaults to [dplyr_locale()], which uses the `"C"` locale unless this is
#' explicitly overriden. See the help page for [dplyr_locale()] for the
#' exact details.
#'
#' - If a single string is supplied, then this will be used as the sorting
#' locale. For example, `"fr"` will sort with the French locale. This
#' requires the stringi package. Use [stringi::stri_locale_list()] to
#' generate a list of possible locale identifiers.
#' - If a single string from [stringi::stri_locale_list()] is supplied, then
#' this will be used as the locale to sort with. For example, `"en"` will
#' sort with the American English locale. This requires the stringi package.
#'
#' - If `"C"` is supplied, then character vectors will be sorted in the C
#' locale. This does not require stringi and is often much faster than
#' - If `"C"` is supplied, then character vectors will always be sorted in the
#' C locale. This does not require stringi and is often much faster than
#' supplying a locale identifier.
#'
#' The C locale is not the same as English locales, such as `"en"`,
#' particularly when it comes to case-sensitivity. This is explained in
#' more detail in the help page of [dplyr_locale()] under the `Default locale`
#' section.
#' @family single table verbs
#' @examples
#' arrange(mtcars, cyl, disp)
Expand Down
88 changes: 41 additions & 47 deletions R/locale.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,24 @@
#'
#' ## Default locale
#'
#' - If stringi >=1.5.3 is installed, the default locale is set to American
#' English, represented by the locale identifier `"en"`.
#' The default locale returned by `dplyr_locale()` is the C locale, identical
#' to explicitly supplying `.locale = "C"`.
#'
#' - If stringi is not installed or is older than 1.5.3, the default locale
#' falls back to the C locale, represented by `"C"`. When this occurs, a
#' warning will be thrown encouraging you to either install stringi or
#' replace usage of `dplyr_locale()` with `"C"` to explicitly force the C
#' locale.
#' The C locale is not exactly the same as English locales, such as `"en"`. The
#' main difference is that the C locale groups the English alphabet by _case_,
#' while most English locales group the alphabet by _letter_. For example,
#' `c("a", "b", "C", "B", "c")` will sort as `c("B", "C", "a", "b", "c")` in the
#' C locale, with all uppercase letters coming before lowercase letters, but
#' will sort as `c("a", "b", "B", "c", "C")` in an English locale. This often
#' makes little practical difference during data analysis, because both return
#' identical results when case is consistent between observations.
#'
#' ## Global override
#'
#' To override the above default behavior, you can set the global option,
#' `dplyr.locale`, to either `"C"` or a stringi locale identifier from
#' [stringi::stri_locale_list()] to globally alter the default locale.
#' Setting this option to anything other than `"C"` requires stringi >=1.5.3.
#' `dplyr.locale`, to a stringi locale identifier from
#' [stringi::stri_locale_list()] to globally alter the default locale. This
#' requires stringi >=1.5.3.
#'
#' We generally recommend that you set the `.locale` argument of [arrange()]
#' explicitly rather than overriding the global locale, if possible.
Expand All @@ -30,31 +33,51 @@
#' scope through the use of [rlang::local_options()] or [rlang::with_options()].
#' This can be useful when a package that you don't control calls `arrange()`
#' internally.
#'
#' ## Reproducibility
#'
#' The C locale has the benefit of being completely reproducible across all
#' supported R versions and operating systems with no extra effort.
#'
#' If you set `.locale` to an option from [stringi::stri_locale_list()], then
#' stringi must be installed by anyone who wants to run your code. If you
#' utilize this in a package, then stringi should be placed in `Imports`.
#' @export
#' @keywords internal
#' @examplesIf dplyr:::has_minimum_stringi()
#' # Default locale is American English
#' # Default locale is C
#' dplyr_locale()
#'
#' # This Danish letter is typically sorted after `z`
#' df <- tibble(x = x <- c("o", "p", "\u00F8", "z"))
#' df <- tibble(x = c("a", "b", "C", "B", "c"))
#' df
#'
#' # The American English locale sorts it right after `o`
#' # The C locale groups the English alphabet by case, placing uppercase letters
#' # before lowercase letters. This is the default.
#' arrange(df, x)
#'
#' # Explicitly override `.locale` to `"da"` for Danish ordering
#' arrange(df, x, .locale = "da")
#' # The American English locale groups the alphabet by letter.
#' # Explicitly override `.locale` with `"en"` for this ordering.
#' arrange(df, x, .locale = "en")
#'
#' # Or temporarily override the `dplyr.locale` global option, which is useful
#' # if `arrange()` is called from a function you don't control
#' col_sorter <- function(df) {
#' arrange(df, x)
#' }
#'
#' rlang::with_options(dplyr.locale = "da", {
#' rlang::with_options(dplyr.locale = "en", {
#' col_sorter(df)
#' })
#'
#' # This Danish letter is expected to sort after `z`
#' df <- tibble(x = c("o", "p", "\u00F8", "z"))
#' df
#'
#' # The American English locale sorts it right after `o`
#' arrange(df, x, .locale = "en")
#'
#' # Using `"da"` for Danish ordering gives the expected result
#' arrange(df, x, .locale = "da")
dplyr_locale <- function() {
locale <- peek_option("dplyr.locale")

Expand All @@ -65,36 +88,7 @@ dplyr_locale <- function() {
abort("If set, the global option `dplyr.locale` must be a string.")
}

dplyr_locale_default()
}

dplyr_locale_default <- function(has_stringi = has_minimum_stringi()) {
if (has_stringi) {
"en"
} else {
warn_locale_fallback()
"C"
}
}

warn_locale_fallback <- function() {
header <- paste0(
"`dplyr_locale()` attempted to default to the American English locale (\"en\"), ",
"but the required package, stringi >=1.5.3, is not installed."
)

bullets <- c(
i = "Falling back to the C locale.",
i = paste0(
"Silence this warning by installing stringi or by ",
"explicitly replacing usage of `dplyr_locale()` with \"C\"."
)
)

warn(
message = c(header, bullets),
class = "dplyr_warn_locale_fallback"
)
"C"
}

has_minimum_stringi <- function() {
Expand Down
23 changes: 13 additions & 10 deletions man/arrange.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 13 additions & 10 deletions man/arrange_all.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

62 changes: 43 additions & 19 deletions man/dplyr_locale.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/_snaps/arrange.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
Error in `arrange()`:
! If `.locale` is a character vector, it must be a single string.

---
# arrange validates that `.locale` must be one from stringi

Code
arrange(df, .locale = "x")
Expand Down
12 changes: 0 additions & 12 deletions tests/testthat/_snaps/locale.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,3 @@
Error in `dplyr_locale()`:
! If set, the global option `dplyr.locale` must be a string.

# `dplyr_locale()` falls back to the C locale with a warning if stringi is not available

Code
dplyr_locale_default(has_stringi = FALSE)
Condition
Warning:
`dplyr_locale()` attempted to default to the American English locale ("en"), but the required package, stringi >=1.5.3, is not installed.
i Falling back to the C locale.
i Silence this warning by installing stringi or by explicitly replacing usage of `dplyr_locale()` with "C".
Output
[1] "C"

Loading

0 comments on commit 0221c59

Please sign in to comment.