Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with special characters #654

Closed
natgoodman opened this issue Aug 22, 2017 · 11 comments
Closed

Problems with special characters #654

natgoodman opened this issue Aug 22, 2017 · 11 comments
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@natgoodman
Copy link

Greetings

After encountering problems with a few special characters I undertook a comprehensive test to see what worked and what didn’t. My test process involved generating @param tags with descriptions containing special characters in 4 contexts: normal text, quoted text, normal code, and quoted code. For each context, I attempted three ways to get the character to work: naked, escaped, and double-escaped. By way of example, the test lines for special character ‘$’ (with apologies for the messed up formatting of the 'code' cases) are

  • #' @param param0003 text unescaped normal: $
  • #' @param param0021 text unescaped quoted: "$"
  • #' @param param0039 text escaped normal: \$
  • #' @param param0057 text escaped quoted: "\$"
  • #' @param param0075 text double normal: \\$
  • #' @param param0093 text double quoted: "\\$"
  • #' @param param0111 code unescaped normal: `$`
  • #' @param param0129 code unescaped quoted: `"$"`
  • #' @param param0147 code escaped normal: `$`
  • #' @param param0165 code escaped quoted: '"$"`
  • #' @param param0183 code double normal: `\\$``
  • #' @param param0201 code double quoted: `"\\$"`

I placed the test lines in a .R file (attached), converted the roxygen to Rd using devtools::document, and converted the Rd to HTML using tools::Rd2HTML. Every so often I produced PDF using R CMD Rd2pdf just to be safe and never saw a case where the conversion to HTML worked, while the PDF conversion had problems.

The special characters I tested were & % $ # _ { } ~ ^ \ @ [ ] ( ) {} [] (). I included balanced pairs - {}, [], () - since balanced and unbalanced work differently in some contexts. These are the 10 LaTeX special characters, plus a few that I saw mentioned as special in roxygen or Rd, plus parentheses for good measure.

The table below shows what needs to be typed to get each special character rendered correctly, or 'NONE' if none of my attempts worked.

spcl text-normal text-quoted code-normal code-quoted
# # (but see note 2) "#" `#` `"#"`
$ $ "$" `$` `"$"`
% % "%" NONE NONE
& & "&" `&` `"&"`
( ( "(" `(` `"("`
() () "()" `()` `"()"`
) ) ")" `)` `")"`
@ @ "@" `@` `"@"`
[ [ "[" `[` `"["`
[] [] "[]" `[]` `"[]"`
\ NONE (but see note 3) "\" `\` NONE (but see note 3)
] ] "]" `]` `"]"`
^ ^ "^" `^` `"^"`
_ _ "_" `_` `"_"`
{ NONE NONE `{` `"{"`
{} \{\} "\{\}" `{}` `"{}"`
} NONE NONE `}` NONE
~ ~ "~" `~` `"~"`

Summary

  1. The following work without drama: $ & ( () ) @ [ [] ] ^ _ ~
  2. # works without drama except at start-of-text where (in a separate test) it triggers an error (“attempt to apply non-function”).
  3. The \ failures only occur when it’s is at the end of the text or code. It works without drama elsewhere.
  4. % works when escaped in text, but as reported by others, it doesn’t work in code
  5. As reported by others, unbalanced braces don’t work in text. Open brace works in code when escaped. Close brace works in normal code when escaped but not in quoted code.
  6. Balanced braces have to be double escaped in text, but work naked in code.

Reasoning that the problems may be caused by Rd limitations, I redid the test generating Rd directly. The results were far better: everything worked except \ in quoted code. I’m happy to provide the Rd results should that be of interest.

It goes without saying I’m also happy to provide the Perl script I used to generate the test cases or modify the script to run additional test cases as you wish.

roxescape.R.zip

@hadley

This comment has been minimized.

@natgoodman

This comment has been minimized.

@natgoodman
Copy link
Author

I reran with roxygen2 6.0.1.9000, which I hope is the latest dev version. A few more cases produced output, but there was no change to the final results. Attached are text files with detailed results from the two runs. The columns are pretty self-explanatory, I think, except perhaps these

  • render - what the output actually looked like in a web browser (it's the HTML with some obvious cleanup)
  • correct - does the output much what we want to see
  • okay - did the input generate an error somewhere along the way
  • annotation - my manual annotation of errors or incorrect rendering; I did this more carefully in the first (cran) run than the second (dev).

To see the differences that might affect the final results, I cut the first 9 columns and diff'ed them (attached).

A new bug has creeped in: & is being translated to & in code.

roxescape.cran.txt
roxescape.dev.txt
roxescape.dev-vs-cran.diff.txt

@brodieG

This comment has been minimized.

@jeroen

This comment has been minimized.

@hadley hadley added the bug an unexpected problem or unintended behavior label Oct 1, 2017
@rmvegasm

This comment has been minimized.

@hadley hadley added this to the v6.2.0 milestone Aug 22, 2019
@hadley
Copy link
Member

hadley commented Sep 10, 2019

The problems with the HTML escapes for &, <, and > appear to be fixed:

roxygen2:::markdown("`a && b`")
#> [1] "`a && b`"
roxygen2:::markdown("`<>`")
#> [1] "`<>`"

Created on 2019-09-10 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Sep 10, 2019

To unpack what's going on here, here are a few helpers to convert markdown to Rd (using roxygen2), and then convert Rd to text and html (using the tools package)

library(purrr)

roxy_md <- function(x) {
  roxygen2:::markdown(x)
}

parse_rd <- function(x) {
  con <- textConnection(x)
  on.exit(close(con), add = TRUE)
  
  tryCatch(
    tools::parse_Rd(con, fragment = TRUE, encoding = "UTF-8"),
    warning = function(cnd) NULL
  )
}

rd_text <- function(x) {
  x <- parse_rd(x)
  if (is.null(x)) return(NA_character_)
  
  path <- tempfile()
  tools::Rd2txt(x, path, fragment = TRUE)
  gsub("\n$", "", readChar(path, 100))
}

rd_html <- function(x) {
  x <- parse_rd(x)
  if (is.null(x)) return(NA_character_)
  
  path <- tempfile()
  tools::Rd2HTML(x, path, fragment = TRUE)
  gsub("\n$", "", readChar(path, 1000))
}

rd_deparse <- function(x) {
  paste0(as.character(x, deparse = TRUE), collapse = "")
}

Let's first look at what happens when we surround symbols with ` (I've weeded down the list to the cases that I think are most crucial; please let me know if I've missed something important):

symbols <- c(
  "&",  # needs escaping in html
  "%",  # Rd comment
  "{}", # matched parens,
  "{",  # unmatched parens
  "\\", # single backslash
  "\\\\" # double backslash
)

tibble::tibble(
  code = paste0("`", symbols, "`"),
  rd = map_chr(code, roxy_md),
  text = map_chr(rd, rd_text),
  html = map_chr(rd, rd_html)
)

This yields the following table (not using reprex here since the additional layer of quoting when processed via Rmd is giving me different results):

# A tibble: 6 x 4
  code     rd             text   html                 
  <chr>    <chr>          <chr>  <chr>                
1 `&`      "\\code{&}"    ‘&’    <p><code>&amp;</code>
2 `%`      "\\code{\\%}"  ‘%’    <p><code>%</code>    
3 `{}`     "\\code{{}}"   ‘{}’   <p><code>{}</code>   
4 `{`      "\\code{{}"    NA     NA                   
5 "`\\`"   "\\code{\\}"   NA     NA                   
6 "`\\\\`" "\\code{\\\\}" "‘\\’" "<p><code>\\</code>" 

I think that's correct, as { and \\ are not valid R code, so parseRd() fails on them. I'm a bit surprised that \code{\\\\} works, but there are definitely a few minor bugs so it doesn't surprise me too much.

It's a bit more informative to surround the symbols in quotes because that should make a string that is always parseable:

tibble::tibble(
  code = paste0("`\"", symbols, "\"`"),
  rd = map_chr(code, roxy_md),
  text = map_chr(rd, rd_text),
  html = map_chr(rd, rd_html)
)
  code         rd                 text       html                       
  <chr>        <chr>              <chr>      <chr>                      
1 "`\"&\"`"    "\\code{\"&\"}"    "‘\"&\"’"  "<p><code>\"&amp;\"</code>"
2 "`\"%\"`"    "\\code{\"\\%\"}"  "‘\"%\"’"  "<p><code>\"%\"</code>"    
3 "`\"{}\"`"   "\\code{\"{}\"}"   "‘\"{}\"’" "<p><code>\"{}\"</code>"   
4 "`\"{\"`"    "\\code{\"{\"}"    "‘\"{\"’"  "<p><code>\"{\"</code>"    
5 "`\"\\\"`"   "\\code{\"\\\"}"   NA         NA                         
6 "`\"\\\\\"`" "\\code{\"\\\\\"}" NA         NA                         

I think row 5 is correct because it generates "\" (i.e an unterminated string). Why does row 6 fail? It's "\\", i.e. a string containing a single backslash.

@hadley
Copy link
Member

hadley commented Sep 10, 2019

Maybe the problem is that we're translating `x` to \code{x} when really a better translation of the semantics of markdown backticks would be to translate it to \verb{x}. We'd then do to our own escaping of \, {, and } so that `}` would be translated to \verb{\{}.

(Making this change would also require adjusting pkgdown since currently it only auto-links the contents of \code, not of \verb)

@hadley
Copy link
Member

hadley commented Sep 10, 2019

Or maybe if the code parses, generate \code{} and if it doesn't, use \verb{}?

hadley added a commit that referenced this issue Sep 10, 2019
@natgoodman
Copy link
Author

Impressive! It's been a long time since I looked at this issue, but it seems you've fixed it. I agree with your solution to conditionally generate \code vs \verb. Perhaps there could also be a switch in the API that turns off \code completely in cases where there's no need to distinguish the cases.

@hadley hadley closed this as completed in 732ec14 Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants