Changed char.trunc to better handle combining and full-width multibyte characters #6048

joshhwuu · 2024-04-04T10:33:15Z

Second PR, see: #5995.

The old version of trunc.char in data.table.print used nchar(x) internally which works fine in most cases except when dealing with combining characters such as "á" (The Latin "a" and accent) as nchar(x) recognizes two distinct readable characters. In this case using nchar(x, 'width') works better

This PR changes trunc.char to recognize when the char is combining or full-width and indexes accordingly. Additionally, strtrim() is used now in place of substr() because as mentioned in ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

Tasks:

Find a way to make tests work on all encodings and pass build tests (Works now because Appveyor dropped?)
Refactor tests for readability
Update NEWS.md
Add comment describing potential issues and suggestions to char.trunc

codecov · 2024-04-04T10:41:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.51%. Comparing base (502c59e) to head (2bb5da3).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6048      +/-   ##
==========================================
- Coverage   97.53%   97.51%   -0.02%     
==========================================
  Files          80       80              
  Lines       14916    14918       +2     
==========================================
- Hits        14548    14547       -1     
- Misses        368      371       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tdhock · 2024-04-04T15:46:58Z

this looks good to me.
make sure to add your name as a contributor in DESCRIPTION, and add an item to NEWS.md

joshhwuu · 2024-04-04T19:09:17Z

Done!

MichaelChirico · 2024-04-04T19:55:28Z

Works now because Appveyor dropped?

Interesting that it passes on GHA-on-Windows but not Appveyor, hmm. Let's try submitting this to CRAN's winbuilder service to see if it passes there at least.

MichaelChirico · 2024-04-04T20:19:13Z

OK, it passes on winbuilder:

https://win-builder.r-project.org/vpDSszNIo19V/00check.log

I am happy to merge and revisit this if someone else encounters the issue. It may be related to GHA and CRAN using UCRT-enabled machines. I see all the Windows checkers on CRAN use utf8, and in fact there are no non-UTF8 checkers left: https://cran.r-project.org/web/checks/check_flavors.html#r-release-windows-x86_64

Likely there's an issue in the tests for non-UTF8 users, but it will be easier to work with a user encountering this who can do fast iteration on a fix themselves.

MichaelChirico · 2024-04-04T20:21:39Z

inst/tests/tests.Rraw

+clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output
+# Tests for combining character latin a and acute accent, single row
+DT = data.table(strrep(accented_a, 4L))
+test(2253.01, gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 4L), options=list(datatable.prettyprint.char = 4L))


stylistic request before we merge -- use style test(n, options=, ...) to make options= clear up-front.

Can you also try going back to the simple output= approach of the tests? It will make it a lot cleaner to read. Hopefully the simpler test also passes on windows GHA, if not we can revert back to this functioning version.

I'll do this once I get home later! I used this style as I found it easier to read the expected output of the test.

For the first few tests, the simplified output= works well:

... test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L)) test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots)) test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots)) ...

But for the later tests (Once we have more columns/rows), I find that using this approach requires something like:

test(2253.07, options=list(datatable.prettyprint.char = 4L), DT, output=" V1\n1: á\n2: áá\n3: ááá\n4: áááá") # vs test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c("á", "áá", "ááá", "áááá"))

I'm not a huge fan of how the expected output looks, but LMK if you think this way is more readable and I can refactor the rest of the tests. For now, I'll commit the options syntax style change and using the simple approach with the first few tests.

ah, I had something else in mind but it looks like the force push erased the history in the other branch. oh well. I won't belabor it any further :)

tdhock · 2024-04-04T20:57:07Z

great thanks for the contribution. I invited you to https://github.com/orgs/Rdatatable/teams/project-members which gives you permission to create branches in this repo, so next time you make a PR, can you please create the branch in this repo? (instead of your fork)

joshhwuu added 2 commits April 4, 2024 02:49

changed char.trunc to better handle full-width and combining characters

98c2b64

Tests added, make pretty later

725c037

joshhwuu requested review from MichaelChirico and jangorecki as code owners April 4, 2024 10:33

joshhwuu added 4 commits April 4, 2024 11:58

Added comment to char.trunc for future issues/suggestions

03d75fb

Refactored tests for readability, added multiple rows/columns tests

fd89e81

Updated NEWS.md

da9cd01

Added myself as contributor in DESCRIPTION

e94d863

MichaelChirico reviewed Apr 4, 2024

View reviewed changes

changed test style to have options in front

2bb5da3

MichaelChirico approved these changes Apr 4, 2024

View reviewed changes

Merge branch 'master' into trunc

618909c

MichaelChirico merged commit 420e60b into Rdatatable:master Apr 4, 2024
3 checks passed

joshhwuu deleted the trunc branch April 4, 2024 23:17

joshhwuu mentioned this pull request Aug 30, 2024

char.trunc fails with NA #6441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Changed char.trunc to better handle combining and full-width multibyte characters #6048

joshhwuu commented Apr 4, 2024 •

edited

Loading

codecov bot commented Apr 4, 2024 •

edited

Loading

tdhock commented Apr 4, 2024

joshhwuu commented Apr 4, 2024 •

edited

Loading

MichaelChirico commented Apr 4, 2024

MichaelChirico commented Apr 4, 2024

MichaelChirico Apr 4, 2024

joshhwuu Apr 4, 2024 •

edited

Loading

joshhwuu Apr 4, 2024 •

edited

Loading

MichaelChirico Apr 4, 2024

tdhock commented Apr 4, 2024

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Conversation

joshhwuu commented Apr 4, 2024 • edited Loading

codecov bot commented Apr 4, 2024 • edited Loading

Codecov Report

tdhock commented Apr 4, 2024

joshhwuu commented Apr 4, 2024 • edited Loading

MichaelChirico commented Apr 4, 2024

MichaelChirico commented Apr 4, 2024

MichaelChirico Apr 4, 2024

Choose a reason for hiding this comment

joshhwuu Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

joshhwuu Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

MichaelChirico Apr 4, 2024

Choose a reason for hiding this comment

tdhock commented Apr 4, 2024

joshhwuu commented Apr 4, 2024 •

edited

Loading

codecov bot commented Apr 4, 2024 •

edited

Loading

joshhwuu commented Apr 4, 2024 •

edited

Loading

joshhwuu Apr 4, 2024 •

edited

Loading

joshhwuu Apr 4, 2024 •

edited

Loading