Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

joshhwuu · 2024-03-13T06:45:36Z

The old version of trunc.char in data.table.print used nchar(x) internally which works fine in most cases except when dealing with combining characters such as "á" (The Latin "a" and accent) as nchar(x) recognizes two distinct readable characters. In this case using nchar(x, 'width') works as intended.

This PR changes trunc.char to recognize when the char is combining or full-width and indexes accordingly. Additionally, strtrim() is used now in place of substr() because as mentioned in ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

inst/tests/tests.Rraw

NEWS.md

R/print.data.table.R

inst/tests/tests.Rraw

MichaelChirico · 2024-03-13T19:19:23Z

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

joshhwuu · 2024-03-13T20:46:46Z

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

By this do you mean in tests.Rraw I should replace the non-Latin characters and combining characters with Unicode escapes? Eg:

"á" -> "\u0061\u0301"

MichaelChirico · 2024-03-14T00:22:53Z

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

By this do you mean in tests.Rraw I should replace the non-Latin characters and combining characters with Unicode escapes? Eg:
"á" -> "\u0061\u0301"

Yes, I see other similar examples in tests.Rraw you can search for \u there.

R/print.data.table.R

inst/tests/tests.Rraw

R/print.data.table.R

inst/tests/tests.Rraw

MichaelChirico · 2024-03-14T07:17:53Z

Hmm still failing on windows. I'm not very familiar with how to get this working on windows, but take a look at the test failure output on Appveyor to see if it makes sense to you.

joshhwuu · 2024-03-16T11:48:13Z

It seems that the Appveyor test likes this format of test better:

test(2248.01, capture.output(print(DT))[2], paste("1:", strrep(accented_a, 4L)))

so instead of giving us "Test 2248.xx did not produce the correct result" the error becomes mismatched strings:

Test 2248.01 ran without errors but failed check that x equals y:
  > x = capture.output(print(DT))[2] 
  First 1 of 1 (type 'character'): 
  [1] "1: a�a�a�a�"
  > y = paste("1:", strrep(accented_a, 4L)) 
  First 1 of 1 (type 'character'): 
  [1] "1: a�a�a�a�"
  1 string mismatch

Maybe because with the Latin-1 charset some Windows machines have can't resolve the acute accent part of accented a: \u0301 it defaults to saying the strings are mismatched?

joshhwuu · 2024-03-19T02:04:58Z

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)
accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'
DT = data.table(strrep(accented_a, 4L))
print(DT)

# output:
#                  V1
#              <char>
# 1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

MichaelChirico · 2024-03-19T02:08:39Z

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.
options(datatable.prettyprint.char=8L)
accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'
DT = data.table(strrep(accented_a, 4L))
print(DT)

# output:
#                  V1
#              <char>
# 1: aÌ<81>aÌ<81>aÌ...
I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅

don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

joshhwuu · 2024-03-19T02:13:25Z

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.
options(datatable.prettyprint.char=8L)

accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'

DT = data.table(strrep(accented_a, 4L))

print(DT)

output:

V1

1: aÌ<81>aÌ<81>aÌ...
I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅

don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

Thanks so much! As I mentioned before, this was one of my first contributions to open source ever and you've made the experience super enjoyable and I really learned a lot!

codecov · 2024-03-19T07:43:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.53%. Comparing base (096b20f) to head (f2cc3b1).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5995   +/-   ##
=======================================
  Coverage   97.53%   97.53%           
=======================================
  Files          80       80           
  Lines       14916    14919    +3     
=======================================
+ Hits        14548    14551    +3     
  Misses        368      368

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MichaelChirico · 2024-03-19T17:48:56Z

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)

accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'

DT = data.table(strrep(accented_a, 4L))

print(DT)

output:

V1

1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅
don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

Thanks so much! As I mentioned before, this was one of my first contributions to open source ever and you've made the experience super enjoyable and I really learned a lot!

HTH! It's quite an interesting first PR & I hope you're not deterred! Encoding remains one of the messier problems to try and solve. You might like to read 1 2 3 for some more advanced reading, I still refer to those whenever I need to handle something particularly complicated like here.

joshhwuu · 2024-04-04T07:50:58Z

Oops, looks like I accidentally closed while messing around with git. Not sure how to recover this but I don't mind remaking the changes with another PR, I think it should be a cleaner PR as well, I wanted to take another crack at passing the Windows build test. Do you think that is okay @MichaelChirico?

ben-schwen · 2024-04-04T08:28:54Z

Oops, looks like I accidentally closed while messing around with git. Not sure how to recover this but I don't mind remaking the changes with another PR, I think it should be a cleaner PR as well, I wanted to take another crack at passing the Windows build test. Do you think that is okay @MichaelChirico?

You can reopen the PR or open a new one (up to you). From a developing/educating point of view I would find it more useful to clean up this one since it also contains the history of things that did not work (lessons learned).
Regarding the amount of commits, we squash when merging so that's not a problem. Note that we also try to avoid force push because it messes up the collaboration with others on the same branch.

joshhwuu · 2024-04-04T08:32:00Z

Thanks! I'm not an expert with git so I'm going to avoid messing around with it anymore - so I think I'll open a new PR. Gives me a chance to work through the issue again to refresh my mind.

joshhwuu requested review from MichaelChirico and jangorecki as code owners March 13, 2024 06:45

MichaelChirico reviewed Mar 13, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 13, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 13, 2024

View reviewed changes

R/print.data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 13, 2024

View reviewed changes

R/print.data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 13, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

R/print.data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

R/print.data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico reviewed Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

joshhwuu commented Mar 14, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

MichaelChirico mentioned this pull request Mar 14, 2024

Run GHA on Windows, drop Appveyor test #5998

Closed

MichaelChirico mentioned this pull request Mar 17, 2024

Better test failure with non-ASCII characters #6007

Merged

joshhwuu closed this Mar 19, 2024

joshhwuu force-pushed the master branch from 1b61ddd to 54f9048 Compare March 19, 2024 07:33

joshhwuu reopened this Mar 19, 2024

joshhwuu closed this Apr 3, 2024

joshhwuu force-pushed the master branch from f2cc3b1 to 096b20f Compare April 3, 2024 04:09

joshhwuu mentioned this pull request Apr 4, 2024

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

joshhwuu commented Mar 13, 2024

MichaelChirico commented Mar 13, 2024

joshhwuu commented Mar 13, 2024

MichaelChirico commented Mar 14, 2024

MichaelChirico commented Mar 14, 2024

joshhwuu commented Mar 16, 2024 •

edited

Loading

joshhwuu commented Mar 19, 2024

MichaelChirico commented Mar 19, 2024

joshhwuu commented Mar 19, 2024

output:

V1

1: aÌ<81>aÌ<81>aÌ...

codecov bot commented Mar 19, 2024 •

edited

Loading

MichaelChirico commented Mar 19, 2024

output:

V1

1: aÌ<81>aÌ<81>aÌ...

joshhwuu commented Apr 4, 2024

ben-schwen commented Apr 4, 2024

joshhwuu commented Apr 4, 2024 •

edited

Loading

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

Conversation

joshhwuu commented Mar 13, 2024

MichaelChirico commented Mar 13, 2024

joshhwuu commented Mar 13, 2024

MichaelChirico commented Mar 14, 2024

MichaelChirico commented Mar 14, 2024

joshhwuu commented Mar 16, 2024 • edited Loading

joshhwuu commented Mar 19, 2024

MichaelChirico commented Mar 19, 2024

joshhwuu commented Mar 19, 2024

output:

V1

1: aÌ<81>aÌ<81>aÌ...

codecov bot commented Mar 19, 2024 • edited Loading

Codecov Report

MichaelChirico commented Mar 19, 2024

output:

V1

1: aÌ<81>aÌ<81>aÌ...

joshhwuu commented Apr 4, 2024

ben-schwen commented Apr 4, 2024

joshhwuu commented Apr 4, 2024 • edited Loading

joshhwuu commented Mar 16, 2024 •

edited

Loading

codecov bot commented Mar 19, 2024 •

edited

Loading

joshhwuu commented Apr 4, 2024 •

edited

Loading