-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
correctly parse non-ASCII characters of R files in Windows #532
Conversation
Wouldn't it be better to use the Encoding field from the DESCRIPTION? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add a unit test?
|
||
env | ||
} | ||
|
||
sys_source <- function(file, envir = baseenv()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you copy code from base R, you must include an attribution of the source, check that the licenses are compatible, and add base R to the Authors@R
. Instead, I think you can create a file()
connection with the correct encoding and pass that to sys.source()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it seems like not possible to use sys.source()
, because it doesn't allow file()
connection as the input. Here's the first line source code of sys.source()
:
if (!(is.character(file) && file.exists(file)))
stop(gettextf("'%s' is not an existing file", file))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal is to use source()
, does it make sense for you?
expr <-
substitute(
quote(source(file, encoding = "UTF-8", keep.source = FALSE, local = TRUE)),
list(file = file)
)
eval(expr, envir = envir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need substitute()
+ eval()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, not notice local
accepts environment
values. So just use source(file, encoding = "UTF-8", keep.source = FALSE, local = envir)
is ok then.
@@ -25,7 +25,12 @@ parse_text <- function(text, registry = default_tags(), global_options = list()) | |||
} | |||
|
|||
parse_blocks <- function(file, env, registry, global_options = list()) { | |||
parsed <- parse(file = file, keep.source = TRUE) | |||
|
|||
lines <- readLines(file, warn = FALSE, encoding = "UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As below, it would be simpler to do
con <- file(file, encoding = "UTF-8")
on.exit(close(con))
parsed <- parse(con, keep.source = TRUE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Find it needs to explicitly set the encoding of srcfile
to preserve the srcref
attribution:
con <- file(file, encoding = "UTF-8")
on.exit(close(con), add = TRUE)
parsed <- parse(con, keep.source = TRUE, srcfile = srcfile(file, "UTF-8"))
Thanks for you quick respond. Would you please give me a hint of how to get the Encoding field from the DESCRIPTION file? I would like to add this little feature. BTW, actually will anybody use non-UTF8 script to write packages? Seems like not a good practice for me... |
Yeah, maybe your right. It's unusual to use anything other than UTF-8, and if turns out to be a problem for someone we can fix it then. |
Note the unit testing hasn't been completed. Will finish it soon. Will also try to implement encoding other than |
14d6273
to
0bb12da
Compare
0bb12da
to
8937f4b
Compare
@hadley I updated this PR with improvements on :
|
Looking good. Using |
@hadley I'm not able to use However, I have a windows machine as well. I think run LOG of
|
# Because the parse in testthat::test don't specify encoding to UTF-8 as well, | ||
# we have to use `stringi::stri_escape_unicode` to avoid errors. | ||
expect_true( | ||
any(grepl("\u6211\u7231\u4e2d\u6587", cnChar)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use two expect_true()
statements here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does one string get wrapped with enc2utf8()
and not the other? A comment would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry... the later one is a typo....
cnChar <- readLines(file.path(test_pkg, "man", "printChineseMsg.Rd"), encoding = "UTF-8") | ||
|
||
# Because the parse in testthat::test don't specify encoding to UTF-8 as well, | ||
# we have to use `stringi::stri_escape_unicode` to avoid errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say "so we have to use unicode escapes" (it doesn't matter how you generated them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
# This script is intended to be saved in GB2312 to test if non UTF-8 encoding is | ||
# supported. | ||
|
||
#' ����ע�� |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this previews correctly makes me worried that it isn't actually saved as GB2312.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help with this - a view more minor comments. |
5f66da4
to
3be15de
Compare
3be15de
to
000b831
Compare
You're welcome. I've learned more by submitting this PR 👍 |
Thanks! |
If there's non-ASCII characters in R files (like Chinese Chars),
roxygen2
will result an error, because thebase::parse()
function internally usesreadLines()
without explicit setting theencoding
toUTF-8
. Instead, the default value ofencoding
inreadLines()
isunknow
, which is not desired in windows, since we all agree to build R package with UTF-8 scripts.So I replaced
parse()
with the codes fromdevtools()
.Hope this PR gets approved. Or we have to use
stringi::stri_escape_unicode()
again and again, leavinga lot of chars like
"\u4e2d\u6587"
, which is not so human-friendly.Thanks.
Reference to r-lib/devtools#1378
NOTES
source_package()
usesbase::sys.source()
to parse functions. Unfortunately,base::sys.source()
has the same issue that no setting encoding inreadLines()
. So, based onbase::sys.source()
, I defined a functionsys_source()
to parse UTF-8 encoded script correctly.