Problems with special characters #654

natgoodman · 2017-08-22T23:48:59Z

Greetings

After encountering problems with a few special characters I undertook a comprehensive test to see what worked and what didn’t. My test process involved generating @param tags with descriptions containing special characters in 4 contexts: normal text, quoted text, normal code, and quoted code. For each context, I attempted three ways to get the character to work: naked, escaped, and double-escaped. By way of example, the test lines for special character ‘$’ (with apologies for the messed up formatting of the 'code' cases) are

#' @param param0003 text unescaped normal: $
#' @param param0021 text unescaped quoted: "$"
#' @param param0039 text escaped normal: \$
#' @param param0057 text escaped quoted: "\$"
#' @param param0075 text double normal: \\$
#' @param param0093 text double quoted: "\\$"
#' @param param0111 code unescaped normal: `$`
#' @param param0129 code unescaped quoted: `"$"`
#' @param param0147 code escaped normal: `$`
#' @param param0165 code escaped quoted: '"$"`
#' @param param0183 code double normal: `\\$``
#' @param param0201 code double quoted: `"\\$"`

I placed the test lines in a .R file (attached), converted the roxygen to Rd using devtools::document, and converted the Rd to HTML using tools::Rd2HTML. Every so often I produced PDF using R CMD Rd2pdf just to be safe and never saw a case where the conversion to HTML worked, while the PDF conversion had problems.

The special characters I tested were & % $ # _ { } ~ ^ \ @ [ ] ( ) {} [] (). I included balanced pairs - {}, [], () - since balanced and unbalanced work differently in some contexts. These are the 10 LaTeX special characters, plus a few that I saw mentioned as special in roxygen or Rd, plus parentheses for good measure.

The table below shows what needs to be typed to get each special character rendered correctly, or 'NONE' if none of my attempts worked.

spcl	text-normal	text-quoted	code-normal	code-quoted
#	# (but see note 2)	"#"	`#`	`"#"`
$	$	"$"	`$`	`"$"`
%	%	"%"	NONE	NONE
&	&	"&"	`&`	`"&"`
(	(	"("	`(`	`"("`
()	()	"()"	`()`	`"()"`
)	)	")"	`)`	`")"`
@	@	"@"	`@`	`"@"`
[	[	"["	`[`	`"["`
[]	[]	"[]"	`[]`	`"[]"`
\	NONE (but see note 3)	"\"	`\`	NONE (but see note 3)
]	]	"]"	`]`	`"]"`
^	^	"^"	`^`	`"^"`
_	_	"_"	`_`	`"_"`
{	NONE	NONE	`{`	`"{"`
{}	\{\}	"\{\}"	`{}`	`"{}"`
}	NONE	NONE	`}`	NONE
~	~	"~"	`~`	`"~"`

Summary

The following work without drama: $ & ( () ) @ [ [] ] ^ _ ~
# works without drama except at start-of-text where (in a separate test) it triggers an error (“attempt to apply non-function”).
The \ failures only occur when it’s is at the end of the text or code. It works without drama elsewhere.
% works when escaped in text, but as reported by others, it doesn’t work in code
As reported by others, unbalanced braces don’t work in text. Open brace works in code when escaped. Close brace works in normal code when escaped but not in quoted code.
Balanced braces have to be double escaped in text, but work naked in code.

Reasoning that the problems may be caused by Rd limitations, I redid the test generating Rd directly. The results were far better: everything worked except \ in quoted code. I’m happy to provide the Rd results should that be of interest.

It goes without saying I’m also happy to provide the Perl script I used to generate the test cases or modify the script to run additional test cases as you wish.

roxescape.R.zip

The text was updated successfully, but these errors were encountered:

natgoodman · 2017-08-23T22:30:41Z

I reran with roxygen2 6.0.1.9000, which I hope is the latest dev version. A few more cases produced output, but there was no change to the final results. Attached are text files with detailed results from the two runs. The columns are pretty self-explanatory, I think, except perhaps these

render - what the output actually looked like in a web browser (it's the HTML with some obvious cleanup)
correct - does the output much what we want to see
okay - did the input generate an error somewhere along the way
annotation - my manual annotation of errors or incorrect rendering; I did this more carefully in the first (cran) run than the second (dev).

To see the differences that might affect the final results, I cut the first 9 columns and diff'ed them (attached).

A new bug has creeped in: & is being translated to & in code.

roxescape.cran.txt
roxescape.dev.txt
roxescape.dev-vs-cran.diff.txt

hadley · 2019-09-10T18:24:29Z

The problems with the HTML escapes for &, <, and > appear to be fixed:

roxygen2:::markdown("`a && b`")
#> [1] "`a && b`"
roxygen2:::markdown("`<>`")
#> [1] "`<>`"

^{Created on 2019-09-10 by the reprex package (v0.3.0)}

hadley · 2019-09-10T19:09:55Z

To unpack what's going on here, here are a few helpers to convert markdown to Rd (using roxygen2), and then convert Rd to text and html (using the tools package)

library(purrr)

roxy_md <- function(x) {
  roxygen2:::markdown(x)
}

parse_rd <- function(x) {
  con <- textConnection(x)
  on.exit(close(con), add = TRUE)
  
  tryCatch(
    tools::parse_Rd(con, fragment = TRUE, encoding = "UTF-8"),
    warning = function(cnd) NULL
  )
}

rd_text <- function(x) {
  x <- parse_rd(x)
  if (is.null(x)) return(NA_character_)
  
  path <- tempfile()
  tools::Rd2txt(x, path, fragment = TRUE)
  gsub("\n$", "", readChar(path, 100))
}

rd_html <- function(x) {
  x <- parse_rd(x)
  if (is.null(x)) return(NA_character_)
  
  path <- tempfile()
  tools::Rd2HTML(x, path, fragment = TRUE)
  gsub("\n$", "", readChar(path, 1000))
}

rd_deparse <- function(x) {
  paste0(as.character(x, deparse = TRUE), collapse = "")
}

Let's first look at what happens when we surround symbols with ` (I've weeded down the list to the cases that I think are most crucial; please let me know if I've missed something important):

symbols <- c(
  "&",  # needs escaping in html
  "%",  # Rd comment
  "{}", # matched parens,
  "{",  # unmatched parens
  "\\", # single backslash
  "\\\\" # double backslash
)

tibble::tibble(
  code = paste0("`", symbols, "`"),
  rd = map_chr(code, roxy_md),
  text = map_chr(rd, rd_text),
  html = map_chr(rd, rd_html)
)

This yields the following table (not using reprex here since the additional layer of quoting when processed via Rmd is giving me different results):

# A tibble: 6 x 4
  code     rd             text   html                 
  <chr>    <chr>          <chr>  <chr>                
1 `&`      "\\code{&}"    ‘&’    <p><code>&amp;</code>
2 `%`      "\\code{\\%}"  ‘%’    <p><code>%</code>    
3 `{}`     "\\code{{}}"   ‘{}’   <p><code>{}</code>   
4 `{`      "\\code{{}"    NA     NA                   
5 "`\\`"   "\\code{\\}"   NA     NA                   
6 "`\\\\`" "\\code{\\\\}" "‘\\’" "<p><code>\\</code>"

I think that's correct, as { and \\ are not valid R code, so parseRd() fails on them. I'm a bit surprised that \code{\\\\} works, but there are definitely a few minor bugs so it doesn't surprise me too much.

It's a bit more informative to surround the symbols in quotes because that should make a string that is always parseable:

tibble::tibble(
  code = paste0("`\"", symbols, "\"`"),
  rd = map_chr(code, roxy_md),
  text = map_chr(rd, rd_text),
  html = map_chr(rd, rd_html)
)

  code         rd                 text       html                       
  <chr>        <chr>              <chr>      <chr>                      
1 "`\"&\"`"    "\\code{\"&\"}"    "‘\"&\"’"  "<p><code>\"&amp;\"</code>"
2 "`\"%\"`"    "\\code{\"\\%\"}"  "‘\"%\"’"  "<p><code>\"%\"</code>"    
3 "`\"{}\"`"   "\\code{\"{}\"}"   "‘\"{}\"’" "<p><code>\"{}\"</code>"   
4 "`\"{\"`"    "\\code{\"{\"}"    "‘\"{\"’"  "<p><code>\"{\"</code>"    
5 "`\"\\\"`"   "\\code{\"\\\"}"   NA         NA                         
6 "`\"\\\\\"`" "\\code{\"\\\\\"}" NA         NA

I think row 5 is correct because it generates "\" (i.e an unterminated string). Why does row 6 fail? It's "\\", i.e. a string containing a single backslash.

hadley · 2019-09-10T19:18:36Z

Maybe the problem is that we're translating `x` to \code{x} when really a better translation of the semantics of markdown backticks would be to translate it to \verb{x}. We'd then do to our own escaping of \, {, and } so that `}` would be translated to \verb{\{}.

(Making this change would also require adjusting pkgdown since currently it only auto-links the contents of \code, not of \verb)

hadley · 2019-09-10T19:27:34Z

Or maybe if the code parses, generate \code{} and if it doesn't, use \verb{}?

Fixes #654

natgoodman · 2019-09-11T12:03:17Z

Impressive! It's been a long time since I looked at this issue, but it seems you've fixed it. I agree with your solution to conditionally generate \code vs \verb. Perhaps there could also be a switch in the API that turns off \code completely in cases where there's no need to distinguish the cases.

This comment has been minimized.

Sign in to view

hadley added the bug an unexpected problem or unintended behavior label Oct 1, 2017

jennybc referenced this issue in r-lib/usethis Jan 15, 2018

More battling with preformatted in Rd

0f62b1d

This comment has been minimized.

Sign in to view

hadley added this to the v6.2.0 milestone Aug 22, 2019

hadley added a commit that referenced this issue Sep 10, 2019

Conditionally generate \code or \verb

de1a41c

Fixes #654

hadley closed this as completed in 732ec14 Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with special characters #654

Problems with special characters #654

natgoodman commented Aug 22, 2017

This comment has been minimized.

This comment has been minimized.

natgoodman commented Aug 23, 2017

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Sep 10, 2019

hadley commented Sep 10, 2019

hadley commented Sep 10, 2019 •

edited

Loading

hadley commented Sep 10, 2019

natgoodman commented Sep 11, 2019

Problems with special characters #654

Problems with special characters #654

Comments

natgoodman commented Aug 22, 2017

This comment has been minimized.

This comment has been minimized.

natgoodman commented Aug 23, 2017

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hadley commented Sep 10, 2019

hadley commented Sep 10, 2019

hadley commented Sep 10, 2019 • edited Loading

hadley commented Sep 10, 2019

natgoodman commented Sep 11, 2019

hadley commented Sep 10, 2019 •

edited

Loading