correctly parse non-ASCII characters of R files in Windows #532

shrektan · 2016-10-24T12:53:10Z

If there's non-ASCII characters in R files (like Chinese Chars), roxygen2 will result an error, because the base::parse() function internally uses readLines() without explicit setting the encoding to UTF-8. Instead, the default value of encoding in readLines() is unknow, which is not desired in windows, since we all agree to build R package with UTF-8 scripts.

So I replaced parse() with the codes from devtools().

Hope this PR gets approved. Or we have to use stringi::stri_escape_unicode() again and again, leaving
a lot of chars like "\u4e2d\u6587", which is not so human-friendly.

Thanks.

Reference to r-lib/devtools#1378

NOTES

source_package() uses base::sys.source() to parse functions. Unfortunately, base::sys.source() has the same issue that no setting encoding in readLines(). So, based on base::sys.source(), I defined a function sys_source() to parse UTF-8 encoded script correctly.

hadley · 2016-10-24T13:44:12Z

Wouldn't it be better to use the Encoding field from the DESCRIPTION?

hadley

Can you please also add a unit test?

hadley · 2016-10-24T13:48:11Z

R/source.R


  env
 }

+sys_source <- function(file, envir = baseenv()) {


If you copy code from base R, you must include an attribution of the source, check that the licenses are compatible, and add base R to the Authors@R. Instead, I think you can create a file() connection with the correct encoding and pass that to sys.source()

Unfortunately, it seems like not possible to use sys.source(), because it doesn't allow file() connection as the input. Here's the first line source code of sys.source():

if (!(is.character(file) && file.exists(file))) stop(gettextf("'%s' is not an existing file", file))

My proposal is to use source(), does it make sense for you?

expr <- substitute( quote(source(file, encoding = "UTF-8", keep.source = FALSE, local = TRUE)), list(file = file) ) eval(expr, envir = envir)

Why do you need substitute() + eval()?

My bad, not notice local accepts environment values. So just use source(file, encoding = "UTF-8", keep.source = FALSE, local = envir) is ok then.

hadley · 2016-10-24T13:49:13Z

R/parse.R

@@ -25,7 +25,12 @@ parse_text <- function(text, registry = default_tags(), global_options = list())
 }

 parse_blocks <- function(file, env, registry, global_options = list()) {
-  parsed <- parse(file = file, keep.source = TRUE)
+
+  lines <- readLines(file, warn = FALSE, encoding = "UTF-8")


As below, it would be simpler to do

con <- file(file, encoding = "UTF-8") on.exit(close(con)) parsed <- parse(con, keep.source = TRUE)

Thanks. Find it needs to explicitly set the encoding of srcfile to preserve the srcref attribution:

con <- file(file, encoding = "UTF-8") on.exit(close(con), add = TRUE) parsed <- parse(con, keep.source = TRUE, srcfile = srcfile(file, "UTF-8"))

shrektan · 2016-10-25T00:30:47Z

Thanks for you quick respond. Would you please give me a hint of how to get the Encoding field from the DESCRIPTION file? I would like to add this little feature.

BTW, actually will anybody use non-UTF8 script to write packages? Seems like not a good practice for me...

hadley · 2016-10-25T14:49:44Z

Yeah, maybe your right. It's unusual to use anything other than UTF-8, and if turns out to be a problem for someone we can fix it then.

shrektan · 2016-11-16T15:11:40Z

Note the unit testing hasn't been completed. Will finish it soon. Will also try to implement encoding other than UTF-8.

shrektan · 2016-11-24T15:42:17Z

@hadley I updated this PR with improvements on :

Improve the codes based on your suggestions,
Could read text based on Encoding field in DESCRIPTION,
Add a unit test.

hadley · 2016-11-28T17:53:50Z

Looking good. Using devtools::build_win() can you please confirm that this doesn't cause any problems on windows?

shrektan · 2016-12-01T12:02:30Z

@hadley I'm not able to use devtools::build_win(), because I just can't receive the email (I definitely changed the author's email to my personal email)...

However, I have a windows machine as well. I think run devtools::check() in the windows machine is literally the same as using devtools::build_win(), right? Then with one additional patch, the check succeeds. Below is the log:

LOG of `devtools::check()`

Restarting R session...

> devtools::check()
Updating roxygen2 documentation
Loading roxygen2
Writing NAMESPACE
Writing update_collate.Rd
Writing markdown-internals.Rd
Writing double_escape_md.Rd
Writing markdown-test.Rd
Writing namespace_roclet.Rd
Writing object_format.Rd
Writing is_s3_generic.Rd
Writing object.Rd
Writing rd_roclet.Rd
Writing roclet.Rd
Writing roclet_find.Rd
Writing roc_proc_text.Rd
Writing roxygen2-package.Rd
Writing roxygenize.Rd
Writing load_options.Rd
Writing source_package.Rd
Writing roxy_tag.Rd
Writing vignette_roclet.Rd
Setting env vars -----------------------------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD build  \
  "D:/RWD/roxygen" --no-resave-data --no-manual 

* checking for file 'D:/RWD/roxygen/DESCRIPTION' ... OK
* preparing 'roxygen2':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... OK
* cleaning src
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* building 'roxygen2_5.0.1.9000.tar.gz'

Setting env vars -----------------------------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_ : FALSE
_R_CHECK_FORCE_SUGGESTS_: FALSE
Checking roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD check  \
  "C:\Users\amc038\AppData\Local\Temp\RtmpiCDMhc/roxygen2_5.0.1.9000.tar.gz" --as-cran --timings  \
  --no-manual 

* using log directory 'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck'
* using R version 3.3.2 (2016-10-31)
* using platform: i386-w64-mingw32 (32-bit)
* using session charset: CP936
* using options '--no-manual --as-cran'
* checking for file 'roxygen2/DESCRIPTION' ... OK
* this is package 'roxygen2' version '5.0.1.9000'
* checking package namespace information ... OK
* checking package dependencies ...Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/src/contrib:
  Line starting '<!--//////////////// ...' is malformed!
 NOTE
Package suggested but not available for checking: 'covr'
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking whether package 'roxygen2' can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking 'build' directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* loading checks for arch 'i386'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* loading checks for arch 'x64'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking compiled code ... OK
 WARNING
'qpdf' is needed for checks on size reduction of PDFs
* checking installed files from 'inst/doc' ... OK
* checking files in 'vignettes' ... OK
* checking examples ...
** running examples for arch 'i386' ... OK
** running examples for arch 'x64' ... OK
* checking for unstated dependencies in 'tests' ... OK
* checking tests ...
** running tests for arch 'i386' ...
  Running 'testthat.R'
 OK
** running tests for arch 'x64' ...
  Running 'testthat.R'
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking re-building of vignette outputs ... OK
* DONE

Status: 1 WARNING, 1 NOTE
See
  'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck/00check.log'
for details.


R CMD check results
0 errors | 0 warnings | 1 note 
checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'

Session info

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roxygen2_5.0.1.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8          digest_0.6.10        crayon_1.3.2         withr_1.0.2         
 [5] commonmark_0.9       R6_2.2.0             magrittr_1.5         stringi_1.1.2       
 [9] testthat_1.0.2.9000  xml2_1.0.0           brew_1.0-6           devtools_1.12.0.9000
[13] desc_1.0.1           tools_3.3.2          stringr_1.1.0        memoise_1.0.0

hadley · 2016-12-01T14:31:02Z

tests/testthat/test-nonASCII.R

+  # Because the parse in testthat::test don't specify encoding to UTF-8 as well,
+  # we have to use `stringi::stri_escape_unicode` to avoid errors.
+  expect_true(
+    any(grepl("\u6211\u7231\u4e2d\u6587", cnChar)) &&


It would be better to use two expect_true() statements here

Why does one string get wrapped with enc2utf8() and not the other? A comment would be helpful.

Sorry... the later one is a typo....

hadley · 2016-12-01T14:32:18Z

tests/testthat/test-nonASCII.R

+  cnChar <- readLines(file.path(test_pkg, "man", "printChineseMsg.Rd"), encoding = "UTF-8")
+
+  # Because the parse in testthat::test don't specify encoding to UTF-8 as well,
+  # we have to use `stringi::stri_escape_unicode` to avoid errors.


Just say "so we have to use unicode escapes" (it doesn't matter how you generated them)

hadley · 2016-12-01T14:33:12Z

tests/testthat/testNonASCII/R/a.r

+# This script is intended to be saved in GB2312 to test if non UTF-8 encoding is
+# supported.
+
+#' ����ע��


The fact that this previews correctly makes me worried that it isn't actually saved as GB2312.

I doubted as well in the first beginning, but it seems like GitHub added a new feature to guess the file encoding. In the past, I remember the file encoded other than UTF-8 doesn't display well. But now when I checked my repos, they display good as well.

Below is the what I saw in RStudio:

hadley · 2016-12-01T14:33:49Z

Thanks for your help with this - a view more minor comments.

shrektan · 2016-12-02T00:15:51Z

You're welcome. I've learned more by submitting this PR 👍

hadley · 2016-12-02T15:21:38Z

Thanks!

hadley approved these changes Oct 24, 2016

View reviewed changes

shrektan force-pushed the parseNonASCII branch from 14d6273 to 0bb12da Compare November 24, 2016 15:37

add supports for nonASCII chars in roxygenize

8937f4b

shrektan force-pushed the parseNonASCII branch from 0bb12da to 8937f4b Compare November 24, 2016 15:38

hadley reviewed Dec 1, 2016

View reviewed changes

shrektan force-pushed the parseNonASCII branch from 5f66da4 to 3be15de Compare December 2, 2016 00:08

ensure write UTF-8 rd files

000b831

shrektan force-pushed the parseNonASCII branch from 3be15de to 000b831 Compare December 2, 2016 00:13

hadley merged commit 8d879e9 into r-lib:master Dec 2, 2016

MKLau mentioned this pull request Mar 14, 2018

plot.lindeman has non-ASCII encoding SEELab/enaR#275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correctly parse non-ASCII characters of R files in Windows #532

correctly parse non-ASCII characters of R files in Windows #532

shrektan commented Oct 24, 2016 •

edited

Loading

hadley commented Oct 24, 2016

hadley left a comment

hadley Oct 24, 2016

shrektan Oct 25, 2016 •

edited

Loading

shrektan Oct 25, 2016 •

edited

Loading

hadley Oct 25, 2016

shrektan Oct 26, 2016

hadley Oct 24, 2016

shrektan Oct 25, 2016 •

edited

Loading

shrektan commented Oct 25, 2016 •

edited

Loading

hadley commented Oct 25, 2016

shrektan commented Nov 16, 2016

shrektan commented Nov 24, 2016

hadley commented Nov 28, 2016

shrektan commented Dec 1, 2016

hadley Dec 1, 2016

hadley Dec 1, 2016

shrektan Dec 2, 2016

hadley Dec 1, 2016

shrektan Dec 2, 2016

hadley Dec 1, 2016

shrektan Dec 2, 2016

hadley commented Dec 1, 2016

shrektan commented Dec 2, 2016

hadley commented Dec 2, 2016

correctly parse non-ASCII characters of R files in Windows #532

correctly parse non-ASCII characters of R files in Windows #532

Conversation

shrektan commented Oct 24, 2016 • edited Loading

NOTES

hadley commented Oct 24, 2016

hadley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrektan Oct 25, 2016 • edited Loading

Choose a reason for hiding this comment

shrektan Oct 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrektan Oct 25, 2016 • edited Loading

Choose a reason for hiding this comment

shrektan commented Oct 25, 2016 • edited Loading

hadley commented Oct 25, 2016

shrektan commented Nov 16, 2016

shrektan commented Nov 24, 2016

hadley commented Nov 28, 2016

shrektan commented Dec 1, 2016

LOG of devtools::check()

Session info

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadley commented Dec 1, 2016

shrektan commented Dec 2, 2016

hadley commented Dec 2, 2016

shrektan commented Oct 24, 2016 •

edited

Loading

shrektan Oct 25, 2016 •

edited

Loading

shrektan Oct 25, 2016 •

edited

Loading

shrektan Oct 25, 2016 •

edited

Loading

shrektan commented Oct 25, 2016 •

edited

Loading

LOG of `devtools::check()`