Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatch of UTF-8 charachters between code and docs usage section #1186

Closed
unDocUMeantIt opened this issue Jan 16, 2021 · 7 comments
Closed

Comments

@unDocUMeantIt
Copy link

i've had some escaped UTF-8 characters in function arguments for several years. today for the first time, the windbuilder package check for R-devel complains about a mismatch between the function code and the usage section of the roxygen2 (7.1.1-1cran1.2004.0 ubuntu package) generated docs:

  Mismatches in argument default values:
    Name: 'heur.fix' Code: list(pre = c("�", "'"), suf = c("�", "'")) Docs: list(pre = c("’", "'"), suf = c("’", "'"))

in the function code, the respective characters are escaped as \u2019, in the .Rd files they appear unescaped, try:

library(roxygen2)
roc_proc_text(rd_roclet(), "
  #' Example
  #' @export
  foo <- function(x=\"\u2019\"){}"
)

shouldn't roxygen2 keep the function code as-is?

@bwiernik
Copy link
Contributor

bwiernik commented Apr 4, 2021

Related issue: #748

@gaborcsardi
Copy link
Member

roxygen2 needs to parse the function definition, to find the arguments, default values, etc. Then it uses deparse() to create text again from the code. Unfortunately deparse() is not the inverse of parse() in R.

In a UTF-8 locale the escaped Unicode characters are not restored:

❯ deparse(parse(text = "function(x=\"\u2019\"){}"))
[1] "structure(expression(function(x = \"\") {"
[2] "}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L, "
[3] "2L, 0L, 0L, 0L, 1L, 2L), srcfile = <environment>, class = \"srcref\"))"

and in a non-UTF-8 locale they are escaped differently:

❯ Sys.setlocale(locale = "C")
[1] "C/C/C/C/C/en_US.UTF-8"

> deparse(parse(text = "function(x=\"\u2019\"){}"))
[1] "structure(expression(function(x = \"<U+2019>\") {"
[2] "}), srcfile = <environment>, wholeSrcref = structure(c(1L, 0L, "
[3] "2L, 0L, 0L, 0L, 1L, 2L), srcfile = <environment>, class = \"srcref\"))"

So to fix this we'd need to change how roxygen2 parses/deparses the code. Maybe it is possible to restore the original code from the parse tree, but this is not trivial, and the parse tree also has bugs.

A workaround to your problem is to supply a @usage tag explicitly.

@hadley
Copy link
Member

hadley commented Apr 16, 2021

We could maybe just re-escape all non-ASCII characters on the theory that you probably need to do that to appease R CMD check anyway?

@gaborcsardi
Copy link
Member

It seems that if we parse from a file, then the escaped form is kept in the parse data:

cat("function(x=\"\\u2019\"){x}\n", file = tmp <- tempfile())
getParseData(parse(tmp, keep.source = TRUE))
#>    line1 col1 line2 col2 id parent          token terminal      text
#> 19     1    1     1   23 19      0           expr    FALSE          
#> 1      1    1     1    8  1     19       FUNCTION     TRUE  function
#> 2      1    9     1    9  2     19            '('     TRUE         (
#> 3      1   10     1   10  3     19 SYMBOL_FORMALS     TRUE         x
#> 4      1   11     1   11  4     19     EQ_FORMALS     TRUE         =
#> 5      1   12     1   19  5      7      STR_CONST     TRUE "\\u2019"
#> 7      1   12     1   19  7     19           expr    FALSE          
#> 6      1   20     1   20  6     19            ')'     TRUE         )
#> 16     1   21     1   23 16     19           expr    FALSE          
#> 10     1   21     1   21 10     16            '{'     TRUE         {
#> 11     1   22     1   22 11     13         SYMBOL     TRUE         x
#> 13     1   22     1   22 13     16           expr    FALSE          
#> 12     1   23     1   23 12     16            '}'     TRUE         }

Created on 2021-04-16 by the reprex package (v2.0.0)

@hadley
Copy link
Member

hadley commented Apr 16, 2021

WAT

@gaborcsardi
Copy link
Member

Yeah, it is the same from text of course.... so it is in the parse data, but deparse() changes the escaped form to the unicode characters.

@hadley
Copy link
Member

hadley commented Jul 10, 2022

It doesn't seem like there's much we can do about this, and it should naturally become less important as more windows users switch to 4.2.

@hadley hadley closed this as completed Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants