mismatch of UTF-8 charachters between code and docs usage section #1186

unDocUMeantIt · 2021-01-16T23:53:04Z

i've had some escaped UTF-8 characters in function arguments for several years. today for the first time, the windbuilder package check for R-devel complains about a mismatch between the function code and the usage section of the roxygen2 (7.1.1-1cran1.2004.0 ubuntu package) generated docs:

  Mismatches in argument default values:
    Name: 'heur.fix' Code: list(pre = c("�", "'"), suf = c("�", "'")) Docs: list(pre = c("’", "'"), suf = c("’", "'"))

in the function code, the respective characters are escaped as \u2019, in the .Rd files they appear unescaped, try:

library(roxygen2)
roc_proc_text(rd_roclet(), "
  #' Example
  #' @export
  foo <- function(x=\"\u2019\"){}"
)

shouldn't roxygen2 keep the function code as-is?

The text was updated successfully, but these errors were encountered:

bwiernik · 2021-04-04T16:03:10Z

Related issue: #748

gaborcsardi · 2021-04-05T07:07:00Z

roxygen2 needs to parse the function definition, to find the arguments, default values, etc. Then it uses deparse() to create text again from the code. Unfortunately deparse() is not the inverse of parse() in R.

In a UTF-8 locale the escaped Unicode characters are not restored:

❯ deparse(parse(text = "function(x=\"\u2019\"){}"))
[1] "structure(expression(function(x = \"’\") {"
[2] "}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L, "
[3] "2L, 0L, 0L, 0L, 1L, 2L), srcfile = <environment>, class = \"srcref\"))"

and in a non-UTF-8 locale they are escaped differently:

❯ Sys.setlocale(locale = "C")
[1] "C/C/C/C/C/en_US.UTF-8"

> deparse(parse(text = "function(x=\"\u2019\"){}"))
[1] "structure(expression(function(x = \"<U+2019>\") {"
[2] "}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L, "
[3] "2L, 0L, 0L, 0L, 1L, 2L), srcfile = <environment>, class = \"srcref\"))"

So to fix this we'd need to change how roxygen2 parses/deparses the code. Maybe it is possible to restore the original code from the parse tree, but this is not trivial, and the parse tree also has bugs.

A workaround to your problem is to supply a @usage tag explicitly.

hadley · 2021-04-16T12:40:37Z

We could maybe just re-escape all non-ASCII characters on the theory that you probably need to do that to appease R CMD check anyway?

gaborcsardi · 2021-04-16T12:50:35Z

It seems that if we parse from a file, then the escaped form is kept in the parse data:

cat("function(x=\"\\u2019\"){x}\n", file = tmp <- tempfile())
getParseData(parse(tmp, keep.source = TRUE))
#>    line1 col1 line2 col2 id parent          token terminal      text
#> 19     1    1     1   23 19      0           expr    FALSE          
#> 1      1    1     1    8  1     19       FUNCTION     TRUE  function
#> 2      1    9     1    9  2     19            '('     TRUE         (
#> 3      1   10     1   10  3     19 SYMBOL_FORMALS     TRUE         x
#> 4      1   11     1   11  4     19     EQ_FORMALS     TRUE         =
#> 5      1   12     1   19  5      7      STR_CONST     TRUE "\\u2019"
#> 7      1   12     1   19  7     19           expr    FALSE          
#> 6      1   20     1   20  6     19            ')'     TRUE         )
#> 16     1   21     1   23 16     19           expr    FALSE          
#> 10     1   21     1   21 10     16            '{'     TRUE         {
#> 11     1   22     1   22 11     13         SYMBOL     TRUE         x
#> 13     1   22     1   22 13     16           expr    FALSE          
#> 12     1   23     1   23 12     16            '}'     TRUE         }

^{Created on 2021-04-16 by the reprex package (v2.0.0)}

hadley · 2021-04-16T13:18:41Z

WAT

gaborcsardi · 2021-04-16T13:25:09Z

Yeah, it is the same from text of course.... so it is in the parse data, but deparse() changes the escaped form to the unicode characters.

hadley · 2022-07-10T22:53:40Z

It doesn't seem like there's much we can do about this, and it should naturally become less important as more windows users switch to 4.2.

hadley mentioned this issue Apr 16, 2021

Unicode Remapping Causes Function-Documentation Mismatch #1121

Closed

billdenney mentioned this issue Jul 9, 2021

avoid collision of µg and mg becoming mg in make_clean_names sfirke/janitor#448

Closed

hadley closed this as completed Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mismatch of UTF-8 charachters between code and docs usage section #1186

mismatch of UTF-8 charachters between code and docs usage section #1186

unDocUMeantIt commented Jan 16, 2021

bwiernik commented Apr 4, 2021

gaborcsardi commented Apr 5, 2021

hadley commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

hadley commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

hadley commented Jul 10, 2022

mismatch of UTF-8 charachters between code and docs usage section #1186

mismatch of UTF-8 charachters between code and docs usage section #1186

Comments

unDocUMeantIt commented Jan 16, 2021

bwiernik commented Apr 4, 2021

gaborcsardi commented Apr 5, 2021

hadley commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

hadley commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

hadley commented Jul 10, 2022