Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Merged
merged 8 commits into from
Apr 4, 2024

Conversation

joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented Apr 4, 2024

Closes #5096

Second PR, see: #5995.

The old version of trunc.char in data.table.print used nchar(x) internally which works fine in most cases except when dealing with combining characters such as "á" (The Latin "a" and accent) as nchar(x) recognizes two distinct readable characters. In this case using nchar(x, 'width') works better

This PR changes trunc.char to recognize when the char is combining or full-width and indexes accordingly. Additionally, strtrim() is used now in place of substr() because as mentioned in ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

Tasks:

  • Find a way to make tests work on all encodings and pass build tests (Works now because Appveyor dropped?)
  • Refactor tests for readability
  • Update NEWS.md
  • Add comment describing potential issues and suggestions to char.trunc

Copy link

codecov bot commented Apr 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.51%. Comparing base (502c59e) to head (2bb5da3).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6048      +/-   ##
==========================================
- Coverage   97.53%   97.51%   -0.02%     
==========================================
  Files          80       80              
  Lines       14916    14918       +2     
==========================================
- Hits        14548    14547       -1     
- Misses        368      371       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tdhock
Copy link
Member

tdhock commented Apr 4, 2024

this looks good to me.
make sure to add your name as a contributor in DESCRIPTION, and add an item to NEWS.md

@joshhwuu
Copy link
Member Author

joshhwuu commented Apr 4, 2024

Done!

@MichaelChirico
Copy link
Member

Works now because Appveyor dropped?

Interesting that it passes on GHA-on-Windows but not Appveyor, hmm. Let's try submitting this to CRAN's winbuilder service to see if it passes there at least.

@MichaelChirico
Copy link
Member

OK, it passes on winbuilder:

https://win-builder.r-project.org/vpDSszNIo19V/00check.log

I am happy to merge and revisit this if someone else encounters the issue. It may be related to GHA and CRAN using UCRT-enabled machines. I see all the Windows checkers on CRAN use utf8, and in fact there are no non-UTF8 checkers left: https://cran.r-project.org/web/checks/check_flavors.html#r-release-windows-x86_64

Likely there's an issue in the tests for non-UTF8 users, but it will be easier to work with a user encountering this who can do fast iteration on a fix themselves.

clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output
# Tests for combining character latin a and acute accent, single row
DT = data.table(strrep(accented_a, 4L))
test(2253.01, gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 4L), options=list(datatable.prettyprint.char = 4L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic request before we merge -- use style test(n, options=, ...) to make options= clear up-front.

Can you also try going back to the simple output= approach of the tests? It will make it a lot cleaner to read. Hopefully the simpler test also passes on windows GHA, if not we can revert back to this functioning version.

Copy link
Member Author

@joshhwuu joshhwuu Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this once I get home later! I used this style as I found it easier to read the expected output of the test.

Copy link
Member Author

@joshhwuu joshhwuu Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first few tests, the simplified output= works well:

...
test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L))
test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots))
test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots))
...

But for the later tests (Once we have more columns/rows), I find that using this approach requires something like:

test(2253.07, options=list(datatable.prettyprint.char = 4L), DT, output="     V1\n1:    á\n2:   áá\n3:  ááá\n4: áááá")
# vs
test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c("", "áá", "ááá", "áááá"))

I'm not a huge fan of how the expected output looks, but LMK if you think this way is more readable and I can refactor the rest of the tests. For now, I'll commit the options syntax style change and using the simple approach with the first few tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I had something else in mind but it looks like the force push erased the history in the other branch. oh well. I won't belabor it any further :)

@tdhock
Copy link
Member

tdhock commented Apr 4, 2024

great thanks for the contribution. I invited you to https://github.com/orgs/Rdatatable/teams/project-members which gives you permission to create branches in this repo, so next time you make a PR, can you please create the branch in this repo? (instead of your fork)

@MichaelChirico MichaelChirico merged commit 420e60b into Rdatatable:master Apr 4, 2024
3 checks passed
@joshhwuu joshhwuu deleted the trunc branch April 4, 2024 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

truncate.char misbehaves for multibyte characters
3 participants