UTF-8 in usage being converted to symbols again #748

bwiernik · 2018-06-25T01:19:00Z

Some UTF-8 characters, such as \u2212 and \u221e are being exported incorrectly in the usage section. On Windows, these are being converted to (- and 8, respectively). On Mac, they are being converted to their unicode symbols, rather than remaining as escaped sequences.

I thought this would have been fixed in #649, but apparently not.

Edit: Added details that it is failing in a different way on Mac.

Reprex:

#' Format numbers for presentation
#'
#' A function to format numbers and logical values as characters for display purposes.
#' Includes control over formatting of decimal digits, leading zeros, sign characters,
#' and characters to replace logical, NA, NaN, and Inf values. Factors are converted
#' to strings. Strings are returned verbatim.
#'
#' @encoding UTF-8
#'
#' @param x A vector, matrix, or data.frame of numbers to format
#' @param digits The number of decimal digits desired (used strictly; default: 2)
#' @param decimal.mark The character to use for the decimal point (defaults to locale default: \code{getOption("OutDec")})
#' @param leading0 How to print leading zeros on decimals. Can be logical to print (\code{TRUE}) or suppress (\code{FALSE}) leading zeros or a character string to subsitute for leading zeros. If \code{"conditional"} (default), leading zeros are shown if a column contains any absolute values greater than 1 and suppressed otherwise. If \code{"figure"}, leading zeros are replaced with a figure space (\href{https://unicode-table.com/en/2007/}{\code{U+2007}}) if a column contains any absolute values greater than 1 and suppressed otherwise.
#' @param drop0integer Logical. Should trailing decimal zeros be dropped for integers?
#' @param neg.sign Character to use as negative sign. Defaults to minus-sign (\href{https://unicode-table.com/en/2212/}{\code{U+2212}}).
#' @param pos.sign Character to use as positive sign. Set to \code{FALSE} to suppress. If \code{"figure"} (default), the positive sign is a figure-space (\href{https://unicode-table.com/en/2007/}{\code{U+2007}}) if a column contains any negative numbers and suppressed otherwise.
#' @param big.mark Character to mark between each \code{big.interval} digits \emph{before} the decimal point. Set to \code{FALSE} to suppress. Defaults to the SI/ISO 31-0 standard-recommened thin-spaces (\href{https://unicode-table.com/en/202f/}{\code{U+202F}}).
#' @param big.interval See \code{big.mark} above; defaults to 3.
#' @param small.mark Character to mark between each \code{small.interval} digits \emph{after} the decimal point. Set to \code{FALSE} to suppress. Defaults to the SI/ISO 31-0 standard-recommened thin-spaces (\href{https://unicode-table.com/en/202f/}{\code{U+202F}}).
#' @param small.interval See \code{small.mark} above; defaults to 3.
#' @param na.mark Character to replace \code{NA} and \code{NaN} values. Defaults to em-dash (\href{https://unicode-table.com/en/2014/}{\code{U+2014}}))
#' @param lgl.mark A length 2 vector containing characters to replace \code{TRUE} and \code{FALSE}. Defaults to c("+", "\href{https://unicode-table.com/en/2212/}{\code{U+2212}}").
#' @param inf.mark A length 2 vector containing characters to replace \code{Inf} and \code{-Inf}. Defaults to c("+\href{https://unicode-table.com/en/221e/}{\code{U+221e}}", "\href{https://unicode-table.com/en/2212/}{\code{U+2212}}\href{https://unicode-table.com/en/221e/}{\code{U+221e}}").
#'
#' @export
#' @examples
#' # format_num() converts numeric values to characters with the specified formatting options.
#' # By default, thousands digit groups are separated by thin spaces, negative signs are replaced
#' # with minus signs, and positive signs and leading zeros are replaced with figure spaces
#' # (which have the same width as numbers and minus signs). These options ensure that all
#' # results will align neatly in columns when tabled.
#' format_num(x = c(10000, 1000, 2.41, -1.20, 0.41, -0.20))
#'
#' # By default, format_num() uses your computer locale's default decimal mark as
#' # the decimal point. To force the usage of "." instead (e.g., for submission to
#' # a U.S. journal), set decimal.mark = ".":
#' format_num(x = .41, decimal.mark = ".")
#'
#' # By default, format_num() separates groups of large digits using thin spaces.
#' # This is following the international standard for scientific communication (SI/ISO 31-0),
#' # which advises against using "." or "," to seprate digits because doing so can lead
#' # to confusion for human and computer readers because "." and "," are also used
#' # as decimal marks in various countries. If you prefer to use commmas to separate
#' # large digit groups, set big.mark = ",":
#' format_num(x = 10000, big.mark = ",")
format_num <- function(x, digits = 2L, decimal.mark = getOption("OutDec"),
                       leading0 = "conditional", drop0integer = FALSE,
                       neg.sign = "minus", pos.sign = "figure",
                       big.mark = "thinspace", big.interval = 3L,
                       small.mark = "thinspace", small.interval = 3L,
                       na.mark = "\u2014", lgl.mark = c("+", "\u2212"),
                       inf.mark = c("+\u221e", "\u2212\u221e") ){

        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  {if(is.numeric(x)) abs(x - round(x)) < tol else FALSE}

        # Input checking
        if(length(lgl.mark) == 1) lgl.mark <- c(lgl.mark, lgl.mark)
        if(length(inf.mark) == 1) inf.mark <- c(inf.mark, inf.mark)

        if(is.null(dim(x))) {
                x_type <- "vector"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else if("tbl_df" %in% class(x)) {
                x_type <- "tibble"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else if("matrix" %in% class(x)) {
                x_type <- "matrix"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else x_type <- "other"

        # Classify inputs
        which_logical  <- purrr::modify(x, is.logical) %>% as.matrix()
        which_integers <- as.matrix(purrr::modify(x, is.wholenumber))
        which_integers[is.na(which_integers)] <- FALSE
        which_infinite <- purrr::modify(x, is.infinite) %>% as.matrix()
        which_numeric <-
                purrr::modify(x, ~ is.numeric(.x) & !is.na(.x)) &
                !which_infinite &
                !which_integers %>% as.matrix()

        if(neg.sign == "minus") neg.sign <- "\u2212"
        if(big.mark == "thinspace") big.mark <- "\u202F"
        if(small.mark == "thinspace") small.mark <- "\u202F"

        if(pos.sign == FALSE) flag <- "" else flag <- "+"

        # Initial formatting for each type of data
        out <- x
        out[is.na(out)] <- na.mark
        ### Convert characters, factors, date/times, and and other unsupported
        ### types to character
        out <- purrr::modify_if(out, ~ !is.numeric(.x) & !is.logical(.x) & !is.complex(.x), as.character)
        out[which_logical & x == TRUE] <- lgl.mark[1]
        out[which_logical & x == FALSE] <- lgl.mark[2]
        out[which_infinite & x == Inf] <- inf.mark[1]
        out[which_infinite & x == -Inf] <- inf.mark[2]

        out <- purrr::modify_if(out, is.complex,
                                ~ format(.x, trim = TRUE, digits = digits, nsmall = digits,
                                         scientific = FALSE, big.mark = big.mark,
                                         big.interval = big.interval, small.mark = small.mark,
                                         small.interval = small.interval,
                                         decimal.mark = decimal.mark, drop0trailing = FALSE))

        out[which_integers] <- as.integer(x[which_integers]) %>%
                formatC(digits = digits, format = "f", flag = flag,
                        decimal.mark = decimal.mark,
                        big.mark = big.mark, big.interval = big.interval,
                        small.mark = small.mark, small.interval = small.interval,
                        drop0trailing = drop0integer)

        out[which_numeric] <- as.double(x[which_numeric]) %>%
                formatC(digits = digits, format = "f", flag = flag,
                        decimal.mark = decimal.mark,
                        big.mark = big.mark, big.interval = big.interval,
                        small.mark = small.mark, small.interval = small.interval,
                        drop0trailing = FALSE)


        # Clean up unicode big.mark and small.mark
        out[] <- dplyr::mutate_all(out,
                                   funs(stringr::str_replace_all(.data$.,
                                                                 paste0("(",paste(rev(strsplit(sub(" ", big.mark, " "),"")[[1]]), collapse=""),"|",sub(" ", big.mark, " "),")"),
                                                                 big.mark)))
        out[] <- dplyr::mutate_all(out,
                                   funs(stringr::str_replace_all(.data$.,
                                                                 paste0("(",paste(rev(strsplit(sub(" ", small.mark, " "),"")[[1]]), collapse=""),"|",sub(" ", small.mark, " "),")"),
                                                                 small.mark)))

        # Clean up leading zeros
        switch(leading0,
               "TRUE" = {},
               "FALSE" = out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark))),
               conditional = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark)))
               },
               figure = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1\u2007", decimal.mark)))
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark)))
               },
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", leading0, decimal.mark)))}
        )

        # Clean up positive signs
        switch(pos.sign,
               "TRUE" = {},
               "FALSE" = {},
               figure = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) i < 0, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, "^\\+", ""))
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & any(if(is.numeric(i)) i < 0, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, "^\\+", "\u2007"))
               },
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^\\+", pos.sign))}
        )

        # Clean up negative signs
        switch(neg.sign,
               "TRUE" = {},
               "FALSE" = {},
               "-" = {},
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^-", neg.sign))}
        )

        if(x_type == "tibble") {
                out <- as_tibble(out, validate = FALSE)
        } else if(x_type == "vector") {
                out <- unlist(out)
        } else if(x_type == "matrix") {
                out <- as.matrix(out)
        }

        return(out)
}

The text was updated successfully, but these errors were encountered:

bwiernik · 2019-02-02T21:25:57Z

Here is a minimal example. This function converts a numeric vector to character and replaces - with a specified character, defaulting to \u2007

#' Format numbers for presentation
#'
#' @encoding UTF-8
#'
#' @param neg.sign Character to use as negative sign. Defaults to minus-sign (\href{https://unicode-table.com/en/2212/}{\code{U+2212}}).
#' @export
#' @examples
#' format_num(x = c(10000, 1000, 2.41, -1.20, 0.41, -0.20))

format_num <- function(x, neg.sign = "\u2212") {
     x <- formatC(x, digits = 2)
     # Clean up negative signs
     switch(neg.sign,
            "TRUE" = {},
            "FALSE" = {},
            "-" = {},
            # else =
            {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^-", neg.sign))}
     )
     return(out)
}

When I document this in roxygen, the \u2007 in the default usage section is converted to a the minus sign character. On Windows, it is further corrupted as shown as <U+2007>. This produces the following warning from devtoools::check() on Windows:

Mismatches in argument default values:
      Name: 'neg.sign' Code: "-" Docs: "-"

When documenting, roxygen should render the text in usage directly as-is, without any interpretation.

hadley · 2019-08-22T07:24:38Z

Thanks, that's helpful. That allowed me to create this minimal reprex:

library(roxygen2)

roc_proc_text(rd_roclet(),  "
  #' Correct compilation with explicit usage section
  #' @usage format_num(neg.sign = '\\u2212')
  works <- function(neg.sign = '\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> format_num(neg.sign = '\u2212')
#> }

roc_proc_text(rd_roclet(), "
  #' Incorrect compilation if usage is determined from the function call
  #' roxygen2 renders the actual unicode characters in the documentation
  fails_1 <- function(neg.sign = '\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> fails_1(neg.sign = "−")
#> }

roc_proc_text(rd_roclet(), "
  #' Also incorrect compilation if usage is determined from the function call with double escapes
  #' roxygen2 renders both slashes in the documentation and the function behavior changes
  fails_2 <- function(neg.sign = '\\\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> fails_2(neg.sign = "\\\\u2212")
#> }

^{Created on 2019-08-22 by the reprex package (v0.3.0)}

hadley · 2019-08-22T07:28:35Z

This is going to be hard (impossible?) to fix, because roxygen currently looks at the abstract syntax tree where that distinction is lost:

f1 <- quote(function(neg.sign = '\u2212') {})
f1
#> function(neg.sign = "−") {
#> }

^{Created on 2019-08-22 by the reprex package (v0.3.0)}

bwiernik · 2019-08-23T02:33:04Z

If it can’t be fixed, maybe a note about needing an explicit usage section when using unicode characters in function arguments should be added?

hadley · 2019-09-10T22:57:22Z

I can't recreate this R CMD check failure on any platform: https://dev.azure.com/r-lib/roxygen2/_build/results?buildId=194. My documentation at b741e0e

bwiernik changed the title ~~UTF-8 in usage being converted on Windows~~ UTF-8 in usage being converted to symbols again Jun 25, 2018

This comment has been minimized.

Sign in to view

hadley closed this as completed Jun 28, 2018

This comment has been minimized.

Sign in to view

hadley reopened this Jun 28, 2018

hadley added the reprex needs a minimal reproducible example label Jul 24, 2018

This comment has been minimized.

Sign in to view

hadley added bug an unexpected problem or unintended behavior rd ✍️ and removed reprex needs a minimal reproducible example labels Aug 22, 2019

pbreheny added a commit to pbreheny/breheny that referenced this issue Aug 28, 2019

plotl documentation fixed manually...see Roxygen bug r-lib/roxygen2#748

ff30089

hadley closed this as completed Sep 10, 2019

bwiernik mentioned this issue Apr 4, 2021

mismatch of UTF-8 charachters between code and docs usage section #1186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 in usage being converted to symbols again #748

UTF-8 in usage being converted to symbols again #748

bwiernik commented Jun 25, 2018 •

edited by hadley

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

bwiernik commented Feb 2, 2019 •

edited by hadley

Loading

This comment has been minimized.

This comment has been minimized.

hadley commented Aug 22, 2019

hadley commented Aug 22, 2019

bwiernik commented Aug 23, 2019

hadley commented Sep 10, 2019

UTF-8 in usage being converted to symbols again #748

UTF-8 in usage being converted to symbols again #748

Comments

bwiernik commented Jun 25, 2018 • edited by hadley Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

bwiernik commented Feb 2, 2019 • edited by hadley Loading

This comment has been minimized.

This comment has been minimized.

hadley commented Aug 22, 2019

hadley commented Aug 22, 2019

bwiernik commented Aug 23, 2019

hadley commented Sep 10, 2019

bwiernik commented Jun 25, 2018 •

edited by hadley

Loading

bwiernik commented Feb 2, 2019 •

edited by hadley

Loading