Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-UTF8 characters fail to parse on Windows #6141

Closed
MichaelChirico opened this issue May 19, 2024 · 3 comments · Fixed by #6150
Closed

non-UTF8 characters fail to parse on Windows #6141

MichaelChirico opened this issue May 19, 2024 · 3 comments · Fixed by #6150
Labels
encoding issues related to Encoding platform-specific tests

Comments

@MichaelChirico
Copy link
Member

https://github.com/Rdatatable/data.table/blob/2b52437c9ca9f83591fc615fdc1bbad286fd0b6b/inst/tests/tests.Rraw#L18450

As observed here:

https://github.com/Rdatatable/data.table/actions/runs/9140204969/job/25133295034

Error in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : 
  18450:119: unexpected INCOMPLETE_STRING
18449: DT = data.table(strrep(ja_ichi, 1L:4L))
18450: test(2253.10, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c("
                                                                                                                             ^
In addition: Warning message:
In parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
  invalid input found on input connection 'C:/Users/runneradmin/AppData/Local/Temp/RtmpYBE91X/RLIBS_6bc22f371e3/data.table/tests/tests.Rraw'
@joshhwuu
Copy link
Member

joshhwuu commented May 22, 2024

Seems like the code doesn't compile properly wherever I used the Japanese characters in the tests, should be able to get around this with strrep, however I'm not confident that the tests will pass even if the code compiles, similar to the issues we had with AppVeyor encoding in #5096. In that case, should I try to encapsulate the tests for the Windows platform and manually change the system locale to pass the tests? Although testing that may be a bit tricker -- I don't use Windows to build usually.

I was able to reproduce the failure on my Linux system by setting locales to match the one in the GHA, and refactoring the tests to not explicitly use the Japanese characters + wrapping the set of tests in Sys.setlocale() then restoring allow the tests to pass. I saw some tests before that also did it this way, just need a way to test it with the Windows GHA, how could I go about doing this? @MichaelChirico Thanks!

@joshhwuu
Copy link
Member

Also wanted to add that while I was working on this, test(2194.7) was also failing with my locale settings:

Tue May 21 19:36:42 2024  endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Vancouver', Sys.getlocale()=='C', l10n_info()=='MBCS=FALSE; UTF-8=FALSE; Latin-1=FALSE; codeset=ANSI_X3.4-1968', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==16; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==16; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 8 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
  Error in stopf("%d error(s) out of %d. Search %s for test number(s) %s. Duration: %s.",  : 
    1 error(s) out of 11223. Search tests/tests.Rraw for test number(s) 2194.7. Duration: 15.2s elapsed (17.3s cpu).
  Calls: test.data.table -> stopf -> raise_condition -> signal
  In addition: Warning message:
  In readLines(testDir("issue_563_fread.txt")) :
    invalid input found on input connection '/home/joshhwuu/work/data.table/data.table.Rcheck/data.table/tests/issue_563_fread.txt'

Seems to be the same test described in #5484, so just wanted to note that it is reproducible

@joshhwuu
Copy link
Member

joshhwuu commented Jul 12, 2024

Looks like this has come back:

https://github.com/Rdatatable/data.table/actions/runs/9908622817/job/27374797919

Seems the issue remains with some Windows versions not having UTF-8 locale installed, so setting locale doesn't work

In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored
Execution halted

Thoughts on skipping the tests if the correct locale isn't installed? Not sure the best way to handle it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding platform-specific tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants