truncate.char misbehaves for multibyte characters #5096

MichaelChirico · 2021-08-06T02:09:26Z

This is a follow-up to the comments in #3414

Basically print.data.table internals currently use nchar() for printing in a way that won't work well with multibyte characters. it should be as simple as using nchar(., type='width') in the internal truncate.char and then writing tests.

joshhwuu · 2024-03-07T01:58:04Z

Hi! I'm interested in working on this issue - may I be assigned? Thanks!

joshhwuu · 2024-03-07T22:54:14Z

Hi @MichaelChirico, I was wondering if there is an MRE you had for this issue. I noticed that copying multibyte characters such as Chinese or Kanji works fine with print, eg:

DT <- data.table('一', '二', '三', '四', '五')
print(DT)

output:
V1 V2 V3 V4 V5
1: 一二三四五

Which seems correct to me. Another case I tried was trying to print data tables with unicode:

dt <- data.table('\x65')
print(dt)

output:
V1
1: e

and

#(x100 == U+0100, which is the latin capital A with Macron: Ā)
dt <- data.table('\x100')
print(dt)

output:
V1
1: \0200

which obviously doesn't look right, but I'm not 100% sure if this is the case that the issue is intended to fix.

MichaelChirico · 2024-03-07T23:03:05Z

this constraint needs to be binding for the multi byte issue to bind:

data.table/R/print.data.table.R

Line 235 in 815e054

idx = which(nchar(x) > trunc.char)

joshhwuu · 2024-03-10T23:09:24Z

@MichaelChirico sorry for the pings, it's my first time contributing so I want to make sure my understanding is correct. I experimented with this but I can't seem to find many examples where type='width' will produce a better output than the default 'char'.

`type='width'` concatenates ". . ." with multibyte characters when truncating.

A few cases I tried with type='width' were:

options(datatable.prettyprint.char=1L)
DT = data.table("一")
print(DT)
#        V1
#    <char>
# 1:  一...

options(datatable.prettyprint.char=3L)
DT = data.table("一一")
print(DT)
#         V1
#     <char>
# 1: 一一...

options(datatable.prettyprint.char=4L)
DT = data.table("一一")
print(DT)
#        V1
#    <char>
# 1:   一一

This doesn't look right to me, and I think it's because when it comes to double-wide multibyte characters such as Chinese or Kanji, nchar() with type='width' returns 2 whereas the default type='char' returns 1. And because of the constraint mentioned here:

data.table/R/print.data.table.R

Line 235 in 815e054

idx = which(nchar(x) > trunc.char)

we incorrectly return the character vector with ". . ." concatenated to the end unless we have datatable.prettyprint.char >= 2 * nchar(x).

However I found one niche case where type='width' is better, when we use combining characters such as "á" with the Latin "a" and the combining acute accent.

Printing combining characters with character truncating.

options(datatable.prettyprint.char=1L)
DT = data.table("á")
print(DT)
## output with type='char'
#        V1
#    <char>
# 1:   a...
## output with type='width'
#        V1
#    <char>
# 1:      á

I believe this is because nchar() with type='char' counts "á" as 2 distinct characters and 'width' correctly approximates it to only have a width of 1.

Please advise if I am understanding the problem wrong, thanks!

MichaelChirico · 2024-03-11T02:06:03Z

Thanks for the investigation! I confirm your findings and agree that it may not be so simple as just changing to type='width'. substr() is also involved; note this from ?substr:

These functions are often used with nchar to truncate a display. That does not really work (you want to limit the width, not the number of characters, so it would be better to use strtrim), but at least make sure you use the default nchar(type = "c").

PS, it seems options(datatable.prettyprint.char) is completely undocumented, so we should address that too.

joshhwuu · 2024-03-12T23:55:59Z

I'm currently looking into some solutions for the failing combining characters case, and I came up with a relatively simple solution:

char.trunc = function(x, trunc.char = getOption("datatable.prettyprint.char")) {
  trunc.char = max(0L, suppressWarnings(as.integer(trunc.char[1L])), na.rm=TRUE)
  if (!is.character(x) || trunc.char <= 0L) return(x)
  widthSize = nchar(x, 'width')
  charSize = nchar(x)
  isFull = widthSize > charSize
  idx = which(pmin(widthSize, charSize) > trunc.char)
  x[idx] = paste0(strtrim(x[idx], as.integer(ifelse(isFull, trunc.char * 2, trunc.char))), "...")
  x
}

I ran this against some basic tests and I'm quite confident that this covers the combined, full-width, and half-width character cases correctly. However, there is an issue with printing strings that consist of a mixture of combined characters and other characters, because both nchar(x) and nchar(x, 'width') wouldn't give the correct index to start our truncation at. This issue exists in the current version of char.trunc as well. The only way I can think of solving this problem is by traversing character by character, decrementing appropriately for a full-width, half-width, or combined character, which would be a lot slower but would allow for us to correctly truncate strings such as "á一｣". I'm wondering if it is worth sacrificing the performance to be able to cover this niche case or even consider adding an option to use either the traversal method or the method I proposed above.

If the proposed solution is viable, I was wondering if I can create a PR for this issue. As for the missing documentation, I noticed that as well when I was working on my repro, and I would love to help with adding that as well! I'm just not 100% sure if I should work on both issues in the same PR or not.

MichaelChirico · 2024-03-13T01:25:50Z

Thanks for your thorough investigation! Please proceed to a PR.

I would love to help with adding that as well!

Let's try and keep one issue per PR. The documentation issue is now #5994.

Performance shouldn't be much of a concern, since print.data.table() typically only displays a few rows of data --> very few things are computationally intensive on an absolute scale. However, it seems nobody has run into this issue yet, so I am fine skipping it for now -- maybe just note in a comment about the potential issue/proposed solution for future readers.

MichaelChirico added beginner-task print labels Aug 6, 2021

MichaelChirico assigned joshhwuu Mar 7, 2024

MichaelChirico mentioned this issue Mar 13, 2024

options(datatable.prettyprint.char) is not documented #5994

Closed

joshhwuu mentioned this issue Mar 13, 2024

Changed trunc.char to better handle combining chars and full-width multibyte chars #5995

Closed

joshhwuu mentioned this issue Apr 4, 2024

Changed char.trunc to better handle combining and full-width multibyte characters #6048

Merged

4 tasks

MichaelChirico closed this as completed in #6048 Apr 4, 2024

joshhwuu mentioned this issue May 22, 2024

non-UTF8 characters fail to parse on Windows #6141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

truncate.char misbehaves for multibyte characters #5096

truncate.char misbehaves for multibyte characters #5096

MichaelChirico commented Aug 6, 2021

joshhwuu commented Mar 7, 2024

joshhwuu commented Mar 7, 2024

MichaelChirico commented Mar 7, 2024

joshhwuu commented Mar 10, 2024 •

edited by MichaelChirico

Loading

MichaelChirico commented Mar 11, 2024

joshhwuu commented Mar 12, 2024

MichaelChirico commented Mar 13, 2024

truncate.char misbehaves for multibyte characters #5096

truncate.char misbehaves for multibyte characters #5096

Comments

MichaelChirico commented Aug 6, 2021

joshhwuu commented Mar 7, 2024

joshhwuu commented Mar 7, 2024

MichaelChirico commented Mar 7, 2024

joshhwuu commented Mar 10, 2024 • edited by MichaelChirico Loading

type='width' concatenates ". . ." with multibyte characters when truncating.

Printing combining characters with character truncating.

MichaelChirico commented Mar 11, 2024

joshhwuu commented Mar 12, 2024

MichaelChirico commented Mar 13, 2024

joshhwuu commented Mar 10, 2024 •

edited by MichaelChirico

Loading

`type='width'` concatenates ". . ." with multibyte characters when truncating.