Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncate.char misbehaves for multibyte characters #5096

Closed
MichaelChirico opened this issue Aug 6, 2021 · 7 comments · Fixed by #6048
Closed

truncate.char misbehaves for multibyte characters #5096

MichaelChirico opened this issue Aug 6, 2021 · 7 comments · Fixed by #6048

Comments

@MichaelChirico
Copy link
Member

This is a follow-up to the comments in #3414

Basically print.data.table internals currently use nchar() for printing in a way that won't work well with multibyte characters. it should be as simple as using nchar(., type='width') in the internal truncate.char and then writing tests.

@joshhwuu
Copy link
Member

joshhwuu commented Mar 7, 2024

Hi! I'm interested in working on this issue - may I be assigned? Thanks!

@joshhwuu
Copy link
Member

joshhwuu commented Mar 7, 2024

Hi @MichaelChirico, I was wondering if there is an MRE you had for this issue. I noticed that copying multibyte characters such as Chinese or Kanji works fine with print, eg:

DT <- data.table('一', '二', '三', '四', '五')
print(DT)

output:
V1 V2 V3 V4 V5
1: 一 二 三 四 五

Which seems correct to me. Another case I tried was trying to print data tables with unicode:

dt <- data.table('\x65')
print(dt)

output:
V1
1: e

and

#(x100 == U+0100, which is the latin capital A with Macron: Ā)
dt <- data.table('\x100')
print(dt)

output:
V1
1: \0200

which obviously doesn't look right, but I'm not 100% sure if this is the case that the issue is intended to fix.

@MichaelChirico
Copy link
Member Author

this constraint needs to be binding for the multi byte issue to bind:

idx = which(nchar(x) > trunc.char)

@joshhwuu
Copy link
Member

joshhwuu commented Mar 10, 2024

@MichaelChirico sorry for the pings, it's my first time contributing so I want to make sure my understanding is correct. I experimented with this but I can't seem to find many examples where type='width' will produce a better output than the default 'char'.

type='width' concatenates ". . ." with multibyte characters when truncating.

A few cases I tried with type='width' were:

options(datatable.prettyprint.char=1L)
DT = data.table("")
print(DT)
#        V1
#    <char>
# 1:  一...

options(datatable.prettyprint.char=3L)
DT = data.table("一一")
print(DT)
#         V1
#     <char>
# 1: 一一...

options(datatable.prettyprint.char=4L)
DT = data.table("一一")
print(DT)
#        V1
#    <char>
# 1:   一一

This doesn't look right to me, and I think it's because when it comes to double-wide multibyte characters such as Chinese or Kanji, nchar() with type='width' returns 2 whereas the default type='char' returns 1. And because of the constraint mentioned here:

idx = which(nchar(x) > trunc.char)

we incorrectly return the character vector with ". . ." concatenated to the end unless we have datatable.prettyprint.char >= 2 * nchar(x).

However I found one niche case where type='width' is better, when we use combining characters such as "á" with the Latin "a" and the combining acute accent.

Printing combining characters with character truncating.

options(datatable.prettyprint.char=1L)
DT = data.table("")
print(DT)
## output with type='char'
#        V1
#    <char>
# 1:   a...
## output with type='width'
#        V1
#    <char>
# 1:      á

I believe this is because nchar() with type='char' counts "á" as 2 distinct characters and 'width' correctly approximates it to only have a width of 1.

Please advise if I am understanding the problem wrong, thanks!

@MichaelChirico
Copy link
Member Author

Thanks for the investigation! I confirm your findings and agree that it may not be so simple as just changing to type='width'. substr() is also involved; note this from ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

PS, it seems options(datatable.prettyprint.char) is completely undocumented, so we should address that too.

@joshhwuu
Copy link
Member

I'm currently looking into some solutions for the failing combining characters case, and I came up with a relatively simple solution:

char.trunc = function(x, trunc.char = getOption("datatable.prettyprint.char")) {
  trunc.char = max(0L, suppressWarnings(as.integer(trunc.char[1L])), na.rm=TRUE)
  if (!is.character(x) || trunc.char <= 0L) return(x)
  widthSize = nchar(x, 'width')
  charSize = nchar(x)
  isFull = widthSize > charSize
  idx = which(pmin(widthSize, charSize) > trunc.char)
  x[idx] = paste0(strtrim(x[idx], as.integer(ifelse(isFull, trunc.char * 2, trunc.char))), "...")
  x
}

I ran this against some basic tests and I'm quite confident that this covers the combined, full-width, and half-width character cases correctly. However, there is an issue with printing strings that consist of a mixture of combined characters and other characters, because both nchar(x) and nchar(x, 'width') wouldn't give the correct index to start our truncation at. This issue exists in the current version of char.trunc as well. The only way I can think of solving this problem is by traversing character by character, decrementing appropriately for a full-width, half-width, or combined character, which would be a lot slower but would allow for us to correctly truncate strings such as "á一」". I'm wondering if it is worth sacrificing the performance to be able to cover this niche case or even consider adding an option to use either the traversal method or the method I proposed above.

If the proposed solution is viable, I was wondering if I can create a PR for this issue. As for the missing documentation, I noticed that as well when I was working on my repro, and I would love to help with adding that as well! I'm just not 100% sure if I should work on both issues in the same PR or not.

@MichaelChirico
Copy link
Member Author

Thanks for your thorough investigation! Please proceed to a PR.

I would love to help with adding that as well!

Let's try and keep one issue per PR. The documentation issue is now #5994.

Performance shouldn't be much of a concern, since print.data.table() typically only displays a few rows of data --> very few things are computationally intensive on an absolute scale. However, it seems nobody has run into this issue yet, so I am fine skipping it for now -- maybe just note in a comment about the potential issue/proposed solution for future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment