Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly parse non-ASCII characters of R files in Windows #532

Merged
merged 2 commits into from
Dec 2, 2016

Conversation

shrektan
Copy link
Contributor

@shrektan shrektan commented Oct 24, 2016

If there's non-ASCII characters in R files (like Chinese Chars), roxygen2 will result an error, because the base::parse() function internally uses readLines() without explicit setting the encoding to UTF-8. Instead, the default value of encoding in readLines() is unknow, which is not desired in windows, since we all agree to build R package with UTF-8 scripts.

So I replaced parse() with the codes from devtools().

Hope this PR gets approved. Or we have to use stringi::stri_escape_unicode() again and again, leaving
a lot of chars like "\u4e2d\u6587", which is not so human-friendly.

Thanks.

Reference to r-lib/devtools#1378

NOTES

source_package() uses base::sys.source() to parse functions. Unfortunately, base::sys.source() has the same issue that no setting encoding in readLines(). So, based on base::sys.source(), I defined a function sys_source() to parse UTF-8 encoded script correctly.

@hadley
Copy link
Member

hadley commented Oct 24, 2016

Wouldn't it be better to use the Encoding field from the DESCRIPTION?

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a unit test?


env
}

sys_source <- function(file, envir = baseenv()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you copy code from base R, you must include an attribution of the source, check that the licenses are compatible, and add base R to the Authors@R. Instead, I think you can create a file() connection with the correct encoding and pass that to sys.source()

Copy link
Contributor Author

@shrektan shrektan Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it seems like not possible to use sys.source(), because it doesn't allow file() connection as the input. Here's the first line source code of sys.source():

    if (!(is.character(file) && file.exists(file))) 
        stop(gettextf("'%s' is not an existing file", file))

Copy link
Contributor Author

@shrektan shrektan Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal is to use source(), does it make sense for you?

 expr <-
    substitute(
      quote(source(file, encoding = "UTF-8", keep.source = FALSE, local = TRUE)),
      list(file = file)
    )
  eval(expr, envir = envir)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need substitute() + eval()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, not notice local accepts environment values. So just use source(file, encoding = "UTF-8", keep.source = FALSE, local = envir) is ok then.

@@ -25,7 +25,12 @@ parse_text <- function(text, registry = default_tags(), global_options = list())
}

parse_blocks <- function(file, env, registry, global_options = list()) {
parsed <- parse(file = file, keep.source = TRUE)

lines <- readLines(file, warn = FALSE, encoding = "UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As below, it would be simpler to do

con <- file(file, encoding = "UTF-8")
on.exit(close(con))
parsed <- parse(con, keep.source = TRUE)

Copy link
Contributor Author

@shrektan shrektan Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Find it needs to explicitly set the encoding of srcfile to preserve the srcref attribution:

  con <- file(file, encoding = "UTF-8")
  on.exit(close(con), add = TRUE)
  parsed <- parse(con, keep.source = TRUE, srcfile = srcfile(file, "UTF-8"))

@shrektan
Copy link
Contributor Author

shrektan commented Oct 25, 2016

Thanks for you quick respond. Would you please give me a hint of how to get the Encoding field from the DESCRIPTION file? I would like to add this little feature.

BTW, actually will anybody use non-UTF8 script to write packages? Seems like not a good practice for me...

@hadley
Copy link
Member

hadley commented Oct 25, 2016

Yeah, maybe your right. It's unusual to use anything other than UTF-8, and if turns out to be a problem for someone we can fix it then.

@shrektan
Copy link
Contributor Author

Note the unit testing hasn't been completed. Will finish it soon. Will also try to implement encoding other than UTF-8.

@shrektan
Copy link
Contributor Author

@hadley I updated this PR with improvements on :

  1. Improve the codes based on your suggestions,
  2. Could read text based on Encoding field in DESCRIPTION,
  3. Add a unit test.

@hadley
Copy link
Member

hadley commented Nov 28, 2016

Looking good. Using devtools::build_win() can you please confirm that this doesn't cause any problems on windows?

@shrektan
Copy link
Contributor Author

shrektan commented Dec 1, 2016

@hadley I'm not able to use devtools::build_win(), because I just can't receive the email (I definitely changed the author's email to my personal email)...

However, I have a windows machine as well. I think run devtools::check() in the windows machine is literally the same as using devtools::build_win(), right? Then with one additional patch, the check succeeds. Below is the log:

LOG of devtools::check()

Restarting R session...

> devtools::check()
Updating roxygen2 documentation
Loading roxygen2
Writing NAMESPACE
Writing update_collate.Rd
Writing markdown-internals.Rd
Writing double_escape_md.Rd
Writing markdown-test.Rd
Writing namespace_roclet.Rd
Writing object_format.Rd
Writing is_s3_generic.Rd
Writing object.Rd
Writing rd_roclet.Rd
Writing roclet.Rd
Writing roclet_find.Rd
Writing roc_proc_text.Rd
Writing roxygen2-package.Rd
Writing roxygenize.Rd
Writing load_options.Rd
Writing source_package.Rd
Writing roxy_tag.Rd
Writing vignette_roclet.Rd
Setting env vars -----------------------------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD build  \
  "D:/RWD/roxygen" --no-resave-data --no-manual 

* checking for file 'D:/RWD/roxygen/DESCRIPTION' ... OK
* preparing 'roxygen2':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... OK
* cleaning src
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* building 'roxygen2_5.0.1.9000.tar.gz'

Setting env vars -----------------------------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_ : FALSE
_R_CHECK_FORCE_SUGGESTS_: FALSE
Checking roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD check  \
  "C:\Users\amc038\AppData\Local\Temp\RtmpiCDMhc/roxygen2_5.0.1.9000.tar.gz" --as-cran --timings  \
  --no-manual 

* using log directory 'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck'
* using R version 3.3.2 (2016-10-31)
* using platform: i386-w64-mingw32 (32-bit)
* using session charset: CP936
* using options '--no-manual --as-cran'
* checking for file 'roxygen2/DESCRIPTION' ... OK
* this is package 'roxygen2' version '5.0.1.9000'
* checking package namespace information ... OK
* checking package dependencies ...Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/src/contrib:
  Line starting '<!--//////////////// ...' is malformed!
 NOTE
Package suggested but not available for checking: 'covr'
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking whether package 'roxygen2' can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking 'build' directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* loading checks for arch 'i386'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* loading checks for arch 'x64'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking compiled code ... OK
 WARNING
'qpdf' is needed for checks on size reduction of PDFs
* checking installed files from 'inst/doc' ... OK
* checking files in 'vignettes' ... OK
* checking examples ...
** running examples for arch 'i386' ... OK
** running examples for arch 'x64' ... OK
* checking for unstated dependencies in 'tests' ... OK
* checking tests ...
** running tests for arch 'i386' ...
  Running 'testthat.R'
 OK
** running tests for arch 'x64' ...
  Running 'testthat.R'
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking re-building of vignette outputs ... OK
* DONE

Status: 1 WARNING, 1 NOTE
See
  'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck/00check.log'
for details.


R CMD check results
0 errors | 0 warnings | 1 note 
checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'

Session info

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roxygen2_5.0.1.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8          digest_0.6.10        crayon_1.3.2         withr_1.0.2         
 [5] commonmark_0.9       R6_2.2.0             magrittr_1.5         stringi_1.1.2       
 [9] testthat_1.0.2.9000  xml2_1.0.0           brew_1.0-6           devtools_1.12.0.9000
[13] desc_1.0.1           tools_3.3.2          stringr_1.1.0        memoise_1.0.0  

# Because the parse in testthat::test don't specify encoding to UTF-8 as well,
# we have to use `stringi::stri_escape_unicode` to avoid errors.
expect_true(
any(grepl("\u6211\u7231\u4e2d\u6587", cnChar)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use two expect_true() statements here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does one string get wrapped with enc2utf8() and not the other? A comment would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry... the later one is a typo....

cnChar <- readLines(file.path(test_pkg, "man", "printChineseMsg.Rd"), encoding = "UTF-8")

# Because the parse in testthat::test don't specify encoding to UTF-8 as well,
# we have to use `stringi::stri_escape_unicode` to avoid errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say "so we have to use unicode escapes" (it doesn't matter how you generated them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

# This script is intended to be saved in GB2312 to test if non UTF-8 encoding is
# supported.

#' ����ע��
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this previews correctly makes me worried that it isn't actually saved as GB2312.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubted as well in the first beginning, but it seems like GitHub added a new feature to guess the file encoding. In the past, I remember the file encoded other than UTF-8 doesn't display well. But now when I checked my repos, they display good as well.

Below is the what I saw in RStudio:

image

image

@hadley
Copy link
Member

hadley commented Dec 1, 2016

Thanks for your help with this - a view more minor comments.

@shrektan
Copy link
Contributor Author

shrektan commented Dec 2, 2016

You're welcome. I've learned more by submitting this PR 👍

@hadley hadley merged commit 8d879e9 into r-lib:master Dec 2, 2016
@hadley
Copy link
Member

hadley commented Dec 2, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants