Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

Closed
wants to merge 0 commits into from

Conversation

joshhwuu
Copy link
Member

Closes #5096

The old version of trunc.char in data.table.print used nchar(x) internally which works fine in most cases except when dealing with combining characters such as "á" (The Latin "a" and accent) as nchar(x) recognizes two distinct readable characters. In this case using nchar(x, 'width') works as intended.

This PR changes trunc.char to recognize when the char is combining or full-width and indexes accordingly. Additionally, strtrim() is used now in place of substr() because as mentioned in ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

NEWS.md Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

@joshhwuu
Copy link
Member Author

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

By this do you mean in tests.Rraw I should replace the non-Latin characters and combining characters with Unicode escapes? Eg:

"" -> "\u0061\u0301"

@MichaelChirico
Copy link
Member

The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead.

By this do you mean in tests.Rraw I should replace the non-Latin characters and combining characters with Unicode escapes? Eg:

"" -> "\u0061\u0301"

Yes, I see other similar examples in tests.Rraw you can search for \u there.

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Hmm still failing on windows. I'm not very familiar with how to get this working on windows, but take a look at the test failure output on Appveyor to see if it makes sense to you.

@joshhwuu
Copy link
Member Author

joshhwuu commented Mar 16, 2024

It seems that the Appveyor test likes this format of test better:

test(2248.01, capture.output(print(DT))[2], paste("1:", strrep(accented_a, 4L)))

so instead of giving us "Test 2248.xx did not produce the correct result" the error becomes mismatched strings:

Test 2248.01 ran without errors but failed check that x equals y:
  > x = capture.output(print(DT))[2] 
  First 1 of 1 (type 'character'): 
  [1] "1: a�a�a�a�"
  > y = paste("1:", strrep(accented_a, 4L)) 
  First 1 of 1 (type 'character'): 
  [1] "1: a�a�a�a�"
  1 string mismatch

Maybe because with the Latin-1 charset some Windows machines have can't resolve the acute accent part of accented a: \u0301 it defaults to saying the strings are mismatched?

@joshhwuu
Copy link
Member Author

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)
accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'
DT = data.table(strrep(accented_a, 4L))
print(DT)

# output:
#                  V1
#              <char>
# 1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

@MichaelChirico
Copy link
Member

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)
accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'
DT = data.table(strrep(accented_a, 4L))
print(DT)

# output:
#                  V1
#              <char>
# 1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅

don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

@joshhwuu
Copy link
Member Author

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)

accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'

DT = data.table(strrep(accented_a, 4L))

print(DT)

output:

V1

1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅

don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

Thanks so much! As I mentioned before, this was one of my first contributions to open source ever and you've made the experience super enjoyable and I really learned a lot!

Copy link

codecov bot commented Mar 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.53%. Comparing base (096b20f) to head (f2cc3b1).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5995   +/-   ##
=======================================
  Coverage   97.53%   97.53%           
=======================================
  Files          80       80           
  Lines       14916    14919    +3     
=======================================
+ Hits        14548    14551    +3     
  Misses        368      368           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichaelChirico
Copy link
Member

@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to trunc.char incorrectly. For example, when it tests a line such as:

áááá, nchar() = 4

it passes instead:

a�a�a�a�, nchar()=8

and causes issues with the truncation itself. I can't for sure tell if this is a case that trunc.char should address because from what I can tell even the current version of trunc.char behaves funny when dealing with certain UTF-8 characters encoded in Latin-1. The current version also doesn't test the truncation of any non-Latin-1 characters.

options(datatable.prettyprint.char=8L)

accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'

DT = data.table(strrep(accented_a, 4L))

print(DT)

output:

V1

1: aÌ<81>aÌ<81>aÌ...

I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings.

PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with data.table and I want to work with some more areas of the project. Thanks!

thanks for your efforts & sorry the testing is turning out to be painful 😅
don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time.

Thanks so much! As I mentioned before, this was one of my first contributions to open source ever and you've made the experience super enjoyable and I really learned a lot!

HTH! It's quite an interesting first PR & I hope you're not deterred! Encoding remains one of the messier problems to try and solve. You might like to read 1 2 3 for some more advanced reading, I still refer to those whenever I need to handle something particularly complicated like here.

@joshhwuu
Copy link
Member Author

joshhwuu commented Apr 4, 2024

Oops, looks like I accidentally closed while messing around with git. Not sure how to recover this but I don't mind remaking the changes with another PR, I think it should be a cleaner PR as well, I wanted to take another crack at passing the Windows build test. Do you think that is okay @MichaelChirico?

@ben-schwen
Copy link
Member

Oops, looks like I accidentally closed while messing around with git. Not sure how to recover this but I don't mind remaking the changes with another PR, I think it should be a cleaner PR as well, I wanted to take another crack at passing the Windows build test. Do you think that is okay @MichaelChirico?

You can reopen the PR or open a new one (up to you). From a developing/educating point of view I would find it more useful to clean up this one since it also contains the history of things that did not work (lessons learned).
Regarding the amount of commits, we squash when merging so that's not a problem. Note that we also try to avoid force push because it messes up the collaboration with others on the same branch.

@joshhwuu
Copy link
Member Author

joshhwuu commented Apr 4, 2024

Thanks! I'm not an expert with git so I'm going to avoid messing around with it anymore - so I think I'll open a new PR. Gives me a chance to work through the issue again to refresh my mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

truncate.char misbehaves for multibyte characters
3 participants