Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 in usage being converted to symbols again #748

Closed
bwiernik opened this issue Jun 25, 2018 · 11 comments
Closed

UTF-8 in usage being converted to symbols again #748

bwiernik opened this issue Jun 25, 2018 · 11 comments
Labels
bug an unexpected problem or unintended behavior rd ✍️

Comments

@bwiernik
Copy link
Contributor

bwiernik commented Jun 25, 2018

Some UTF-8 characters, such as \u2212 and \u221e are being exported incorrectly in the usage section. On Windows, these are being converted to (- and 8, respectively). On Mac, they are being converted to their unicode symbols, rather than remaining as escaped sequences.

I thought this would have been fixed in #649, but apparently not.

Edit: Added details that it is failing in a different way on Mac.

Reprex:

#' Format numbers for presentation
#'
#' A function to format numbers and logical values as characters for display purposes.
#' Includes control over formatting of decimal digits, leading zeros, sign characters,
#' and characters to replace logical, NA, NaN, and Inf values. Factors are converted
#' to strings. Strings are returned verbatim.
#'
#' @encoding UTF-8
#'
#' @param x A vector, matrix, or data.frame of numbers to format
#' @param digits The number of decimal digits desired (used strictly; default: 2)
#' @param decimal.mark The character to use for the decimal point (defaults to locale default: \code{getOption("OutDec")})
#' @param leading0 How to print leading zeros on decimals. Can be logical to print (\code{TRUE}) or suppress (\code{FALSE}) leading zeros or a character string to subsitute for leading zeros. If \code{"conditional"} (default), leading zeros are shown if a column contains any absolute values greater than 1 and suppressed otherwise. If \code{"figure"}, leading zeros are replaced with a figure space (\href{https://unicode-table.com/en/2007/}{\code{U+2007}}) if a column contains any absolute values greater than 1 and suppressed otherwise.
#' @param drop0integer Logical. Should trailing decimal zeros be dropped for integers?
#' @param neg.sign Character to use as negative sign. Defaults to minus-sign (\href{https://unicode-table.com/en/2212/}{\code{U+2212}}).
#' @param pos.sign Character to use as positive sign. Set to \code{FALSE} to suppress. If \code{"figure"} (default), the positive sign is a figure-space (\href{https://unicode-table.com/en/2007/}{\code{U+2007}}) if a column contains any negative numbers and suppressed otherwise.
#' @param big.mark Character to mark between each \code{big.interval} digits \emph{before} the decimal point. Set to \code{FALSE} to suppress. Defaults to the SI/ISO 31-0 standard-recommened thin-spaces (\href{https://unicode-table.com/en/202f/}{\code{U+202F}}).
#' @param big.interval See \code{big.mark} above; defaults to 3.
#' @param small.mark Character to mark between each \code{small.interval} digits \emph{after} the decimal point. Set to \code{FALSE} to suppress. Defaults to the SI/ISO 31-0 standard-recommened thin-spaces (\href{https://unicode-table.com/en/202f/}{\code{U+202F}}).
#' @param small.interval See \code{small.mark} above; defaults to 3.
#' @param na.mark Character to replace \code{NA} and \code{NaN} values. Defaults to em-dash (\href{https://unicode-table.com/en/2014/}{\code{U+2014}}))
#' @param lgl.mark A length 2 vector containing characters to replace \code{TRUE} and \code{FALSE}. Defaults to c("+", "\href{https://unicode-table.com/en/2212/}{\code{U+2212}}").
#' @param inf.mark A length 2 vector containing characters to replace \code{Inf} and \code{-Inf}. Defaults to c("+\href{https://unicode-table.com/en/221e/}{\code{U+221e}}", "\href{https://unicode-table.com/en/2212/}{\code{U+2212}}\href{https://unicode-table.com/en/221e/}{\code{U+221e}}").
#'
#' @export
#' @examples
#' # format_num() converts numeric values to characters with the specified formatting options.
#' # By default, thousands digit groups are separated by thin spaces, negative signs are replaced
#' # with minus signs, and positive signs and leading zeros are replaced with figure spaces
#' # (which have the same width as numbers and minus signs). These options ensure that all
#' # results will align neatly in columns when tabled.
#' format_num(x = c(10000, 1000, 2.41, -1.20, 0.41, -0.20))
#'
#' # By default, format_num() uses your computer locale's default decimal mark as
#' # the decimal point. To force the usage of "." instead (e.g., for submission to
#' # a U.S. journal), set decimal.mark = ".":
#' format_num(x = .41, decimal.mark = ".")
#'
#' # By default, format_num() separates groups of large digits using thin spaces.
#' # This is following the international standard for scientific communication (SI/ISO 31-0),
#' # which advises against using "." or "," to seprate digits because doing so can lead
#' # to confusion for human and computer readers because "." and "," are also used
#' # as decimal marks in various countries. If you prefer to use commmas to separate
#' # large digit groups, set big.mark = ",":
#' format_num(x = 10000, big.mark = ",")
format_num <- function(x, digits = 2L, decimal.mark = getOption("OutDec"),
                       leading0 = "conditional", drop0integer = FALSE,
                       neg.sign = "minus", pos.sign = "figure",
                       big.mark = "thinspace", big.interval = 3L,
                       small.mark = "thinspace", small.interval = 3L,
                       na.mark = "\u2014", lgl.mark = c("+", "\u2212"),
                       inf.mark = c("+\u221e", "\u2212\u221e") ){

        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  {if(is.numeric(x)) abs(x - round(x)) < tol else FALSE}

        # Input checking
        if(length(lgl.mark) == 1) lgl.mark <- c(lgl.mark, lgl.mark)
        if(length(inf.mark) == 1) inf.mark <- c(inf.mark, inf.mark)

        if(is.null(dim(x))) {
                x_type <- "vector"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else if("tbl_df" %in% class(x)) {
                x_type <- "tibble"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else if("matrix" %in% class(x)) {
                x_type <- "matrix"
                x <- as.data.frame(x, stringsAsFactors = FALSE)
        } else x_type <- "other"

        # Classify inputs
        which_logical  <- purrr::modify(x, is.logical) %>% as.matrix()
        which_integers <- as.matrix(purrr::modify(x, is.wholenumber))
        which_integers[is.na(which_integers)] <- FALSE
        which_infinite <- purrr::modify(x, is.infinite) %>% as.matrix()
        which_numeric <-
                purrr::modify(x, ~ is.numeric(.x) & !is.na(.x)) &
                !which_infinite &
                !which_integers %>% as.matrix()

        if(neg.sign == "minus") neg.sign <- "\u2212"
        if(big.mark == "thinspace") big.mark <- "\u202F"
        if(small.mark == "thinspace") small.mark <- "\u202F"

        if(pos.sign == FALSE) flag <- "" else flag <- "+"

        # Initial formatting for each type of data
        out <- x
        out[is.na(out)] <- na.mark
        ### Convert characters, factors, date/times, and and other unsupported
        ### types to character
        out <- purrr::modify_if(out, ~ !is.numeric(.x) & !is.logical(.x) & !is.complex(.x), as.character)
        out[which_logical & x == TRUE] <- lgl.mark[1]
        out[which_logical & x == FALSE] <- lgl.mark[2]
        out[which_infinite & x == Inf] <- inf.mark[1]
        out[which_infinite & x == -Inf] <- inf.mark[2]

        out <- purrr::modify_if(out, is.complex,
                                ~ format(.x, trim = TRUE, digits = digits, nsmall = digits,
                                         scientific = FALSE, big.mark = big.mark,
                                         big.interval = big.interval, small.mark = small.mark,
                                         small.interval = small.interval,
                                         decimal.mark = decimal.mark, drop0trailing = FALSE))

        out[which_integers] <- as.integer(x[which_integers]) %>%
                formatC(digits = digits, format = "f", flag = flag,
                        decimal.mark = decimal.mark,
                        big.mark = big.mark, big.interval = big.interval,
                        small.mark = small.mark, small.interval = small.interval,
                        drop0trailing = drop0integer)

        out[which_numeric] <- as.double(x[which_numeric]) %>%
                formatC(digits = digits, format = "f", flag = flag,
                        decimal.mark = decimal.mark,
                        big.mark = big.mark, big.interval = big.interval,
                        small.mark = small.mark, small.interval = small.interval,
                        drop0trailing = FALSE)


        # Clean up unicode big.mark and small.mark
        out[] <- dplyr::mutate_all(out,
                                   funs(stringr::str_replace_all(.data$.,
                                                                 paste0("(",paste(rev(strsplit(sub(" ", big.mark, " "),"")[[1]]), collapse=""),"|",sub(" ", big.mark, " "),")"),
                                                                 big.mark)))
        out[] <- dplyr::mutate_all(out,
                                   funs(stringr::str_replace_all(.data$.,
                                                                 paste0("(",paste(rev(strsplit(sub(" ", small.mark, " "),"")[[1]]), collapse=""),"|",sub(" ", small.mark, " "),")"),
                                                                 small.mark)))

        # Clean up leading zeros
        switch(leading0,
               "TRUE" = {},
               "FALSE" = out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark))),
               conditional = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark)))
               },
               figure = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1\u2007", decimal.mark)))
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) abs(i) >= 1, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", decimal.mark)))
               },
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, paste0("^(\\+|-?)0", decimal.mark), paste0("\\1", leading0, decimal.mark)))}
        )

        # Clean up positive signs
        switch(pos.sign,
               "TRUE" = {},
               "FALSE" = {},
               figure = {
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & !any(if(is.numeric(i)) i < 0, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, "^\\+", ""))
                       out <- dplyr::mutate_if(out,
                                               sapply(x, function(i) {is.numeric(i) & any(if(is.numeric(i)) i < 0, na.rm = TRUE)}),
                                               function(i) stringr::str_replace_all(i, "^\\+", "\u2007"))
               },
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^\\+", pos.sign))}
        )

        # Clean up negative signs
        switch(neg.sign,
               "TRUE" = {},
               "FALSE" = {},
               "-" = {},
               # else =
               {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^-", neg.sign))}
        )

        if(x_type == "tibble") {
                out <- as_tibble(out, validate = FALSE)
        } else if(x_type == "vector") {
                out <- unlist(out)
        } else if(x_type == "matrix") {
                out <- as.matrix(out)
        }

        return(out)
}
@bwiernik bwiernik changed the title UTF-8 in usage being converted on Windows UTF-8 in usage being converted to symbols again Jun 25, 2018
@hadley

This comment has been minimized.

@hadley hadley closed this as completed Jun 28, 2018
@bwiernik

This comment has been minimized.

@bwiernik

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley hadley reopened this Jun 28, 2018
@hadley hadley added the reprex needs a minimal reproducible example label Jul 24, 2018
@bwiernik
Copy link
Contributor Author

bwiernik commented Feb 2, 2019

Here is a minimal example. This function converts a numeric vector to character and replaces - with a specified character, defaulting to \u2007

#' Format numbers for presentation
#'
#' @encoding UTF-8
#'
#' @param neg.sign Character to use as negative sign. Defaults to minus-sign (\href{https://unicode-table.com/en/2212/}{\code{U+2212}}).
#' @export
#' @examples
#' format_num(x = c(10000, 1000, 2.41, -1.20, 0.41, -0.20))

format_num <- function(x, neg.sign = "\u2212") {
     x <- formatC(x, digits = 2)
     # Clean up negative signs
     switch(neg.sign,
            "TRUE" = {},
            "FALSE" = {},
            "-" = {},
            # else =
            {out[] <- purrr::map(out, ~ stringr::str_replace_all(.x, "^-", neg.sign))}
     )
     return(out)
}

When I document this in roxygen, the \u2007 in the default usage section is converted to a the minus sign character. On Windows, it is further corrupted as shown as <U+2007>. This produces the following warning from devtoools::check() on Windows:

Mismatches in argument default values:
      Name: 'neg.sign' Code: "-" Docs: "-"

When documenting, roxygen should render the text in usage directly as-is, without any interpretation.

@hadley

This comment has been minimized.

@bwiernik

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Aug 22, 2019

Thanks, that's helpful. That allowed me to create this minimal reprex:

library(roxygen2)

roc_proc_text(rd_roclet(),  "
  #' Correct compilation with explicit usage section
  #' @usage format_num(neg.sign = '\\u2212')
  works <- function(neg.sign = '\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> format_num(neg.sign = '\u2212')
#> }

roc_proc_text(rd_roclet(), "
  #' Incorrect compilation if usage is determined from the function call
  #' roxygen2 renders the actual unicode characters in the documentation
  fails_1 <- function(neg.sign = '\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> fails_1(neg.sign = "−")
#> }

roc_proc_text(rd_roclet(), "
  #' Also incorrect compilation if usage is determined from the function call with double escapes
  #' roxygen2 renders both slashes in the documentation and the function behavior changes
  fails_2 <- function(neg.sign = '\\\\u2212') {}
")[[1]]$get_field("usage")
#> \usage{
#> fails_2(neg.sign = "\\\\u2212")
#> }

Created on 2019-08-22 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Aug 22, 2019

This is going to be hard (impossible?) to fix, because roxygen currently looks at the abstract syntax tree where that distinction is lost:

f1 <- quote(function(neg.sign = '\u2212') {})
f1
#> function(neg.sign = "−") {
#> }

Created on 2019-08-22 by the reprex package (v0.3.0)

@hadley hadley added bug an unexpected problem or unintended behavior rd ✍️ and removed reprex needs a minimal reproducible example labels Aug 22, 2019
@bwiernik
Copy link
Contributor Author

If it can’t be fixed, maybe a note about needing an explicit usage section when using unicode characters in function arguments should be added?

@hadley
Copy link
Member

hadley commented Sep 10, 2019

I can't recreate this R CMD check failure on any platform: https://dev.azure.com/r-lib/roxygen2/_build/results?buildId=194. My documentation at b741e0e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior rd ✍️
Projects
None yet
Development

No branches or pull requests

2 participants