Add native UTF8 support for InputText and remove ImWchar buffer #7925

alektron · 2024-08-27T19:40:02Z

!! Please see the notes on the docking branch issue at the bottom !!

Overview

To implement common text editing operations, Dear ImGui uses a modified version of stb_textedit under the hood. stb_textedit currently only works for fixed size character encodings (e.g. ASCII or UTF16). Since Dear ImGui uses UTF8 for its API a lot of conversion between UTF8 and UTF16 are being done under the hood.
This PR is making the necessary adjustments to stb_textedit and Dear ImGui to support native UTF8 editing and get rid of the additional UTF16 buffer and conversions.

Details

CurLenW and the TextW buffer in ImGuiInputTextState got removed completely. TextA and CurLenA are still there (I also kept the 'A' postfix. I am not sure if there is any good reason to keep it). I didn't put too much thought into it yet but it may or may not be possible to even remove the TextA buffer and get rid of copying the user buffer entirely? Even if that is the case, it might be desirable to still keep it around for potential future implementations of the ImGuiInputTextFlags_NoLiveEdit flag
(InputText, InputInt, InputFloat: add ImGuiInputTextFlags_NoLiveEdit flag #701). Also note that the comments for CurLenA and TextA might not be entirely fitting anymore. I adjusted them a little bit but I am not sure how much of the information there is still true.
The changes to stb_textedit are for the most part based on the suggestions from this related issue. Some more changes were necessary. Including a change to stb_textedit_key that adds the decoded UTF8 character byte sequence to the parameters. This means we now have UTF8 specific code in stb_textedit.
Valid values can be passed for other encodings but it really only makes sense for UTF8. One could maybe think of a better, more encoding agnostic, solution but to be honest, I am not sure if that is worth the additional complexity. I guess it depends on how much we are willing to deviate from the original stb_textedit.
There was an outstanding FIXME-OPT about having to copy/convert the buffer to UTF16 for read-only InputText to allow selection/cursor usage. This should now not be an issue anymore and I removed it.
The test suite ran succesfully. Of course some tests had to be adjusted slightly since TextW doesn't exist anymore. I am not sure how many UTF8 specific edge cases are even tested so that may not be all too meaningful.
I did of course also do some manual testing and encountered no issues with mixes of European and Japanese glyphs.
Mouse clicking/dragging, arrow key navigation (in rows and between rows), arrow key whole word navigation, all seem to be working as expected. The callback demos in the imgui_demo work as well (except for some minor issues that have been there before e.g. an 'ü' can not be made uppercase by the barebones demo callback implementations)
Performance was not tested. As far as I can tell there isn't any more rescanning happening now than it was before. Here and there there is some forward/backwards scanning e.g. to check for multi byte sequences adjacent to the cursor but nothing crazy. If anything I'd expect a performance improvement due to all the conversions that are now gone.
There might be a lot more code that could be simplified now. I did not get too far into it. For now I would like to keep the changes manageable so they can be reviewed easier. But I do expect this to open the door for some more improvements.

NOTE: Docking branch

Due to an oversight on my end all changes were made to the docking branch instead of master.
A low level change like this would probably be prefered to be implemented in the master branch and then be merged into docking.
Pulling the changes into master and doing the necessary adjustments is not impossible but it's also not trivial. If necessary I would be willing to do it. For now the pull request is only for the docking branch. Some feedback on this would be appreciated.

ocornut · 2024-09-05T13:03:57Z

Thank you very much. This looks pretty good and I will try to review it shortly. I have already rebased the branch locally to look at it. It's been indeed the much needed first step toward a fuller refactor of InputText() so I appreciate it.

The truth is, rokups has done this work already but as part of a larger refactor.
I will need to look at both work. Yours here has the advantage that it is limited to that one step and therefore for me is an attractive merge. Rokas's one has the advantage of (I assume) more naturally fitting with his remaining changes. I'll figure things out.

ocornut · 2024-09-05T14:40:57Z

Here are some changes I made:

I introduced a stb_textedit_text() function. legacy stb_textedit_key() path can still call stb_textedit_text() but imgui calls this directly. Added equivalent ImGuiInputTextState::OnCharPressed().
tweaked InputTextCalcTextSize() so it doesn't need to call a function for ASCII. it's a potential debug-mode perf regression for large buffers. using similar code as ImFont::CalcTextSizeA() and added note that ideally should would be shared (no hurry with that).
I added a note that STB_TEXTEDIT_GETCHAR() was at this point more of a STB_TEXTEDIT_GETBYTE() and verified that it is used accordingly (ascii compares or block copy).

// With our UTF-8 use of stb_textedit:
// - STB_TEXTEDIT_GETCHAR is nothing more than a a "GETBYTE". It's only used to compare to ascii or to copy blocks of text so we are fine.
// - One exception is the STB_TEXTEDIT_IS_SPACE feature which would expect a full char in order to handle full-width space such as 0x3000 (see ImCharIsBlankW).
// - ...but we don't use that feature.

If while working on this (before or after) you can come up with any extra test to add to test suite please let me know! even if you don't want to add them yourself, i'll log every idea.
We can remove the A suffixes later after this is merged and settled.

This needs further works:

Couldn't IMSTB_TEXTEDIT_GETPREVCHARINDEX_IMPL() simply use ImTextFindPreviousUtf8Codepoint() ? much less code and likely faster.
STB_TEXTEDIT_MOVEWORDLEFT_IMPL, STB_TEXTEDIT_MOVEWORDRIGHT_MAC, STB_TEXTEDIT_MOVEWORDRIGHT_WIN as we don't handle utf-8 codepoints in there?
insert mode using STB_TEXTEDIT_DELETECHARS(str, state->cursor, 1); is likely broken?
the InputTextReconcileUndoStateAfterUserCallback() path is sufficiently obscure and rare that I believe we should design some tests for it.
rebase on master.

…tibyte characters. for ocornut/imgui#7925

…rl+arrow on multibyte characters. for ocornut/imgui#7925

ocornut · 2024-09-05T19:47:28Z

I think I have fixed all the things except InputTextReconcileUndoStateAfterUserCallback().
Would you mind reviewing my commits?

If for some reason you need to run or modify tests, the modified test is in a public branch: https://github.com/ocornut/imgui_test_engine/tree/features/input_text_7925

About InputTextReconcileUndoStateAfterUserCallback(), notice the comment:

// Find the shortest single replacement we can make to get the new text from the old text.
// Important: needs to be run before TextW is rewritten with the new characters because calling STB_TEXTEDIT_GETCHAR() at the end.
// FIXME: Ideally we should transition toward (1) making InsertChars()/DeleteChars() update undo-stack (2) discourage (and keep reconcile) or obsolete (and remove reconcile) accessing buffer directly.

The part about needing to run before TextW is rewritten is essentially broken now, for the good reason that we were using this buffer as an "old buffer". Solving this the same way would annoyingly requires us to make a copy of the full text before running the callback. aka possibly worse case every frame since the Always callback is technically allowed to modify character...

PLAN A

We only do a copy and handle reconcile on selected callbacks: Completion, History. Seems limiting.
We implement what is suggested in the comment aka make InsertChars()/DeleteChars() update undo stack themselves. And we introduce a slightly breaking feature where directly modifying the buffer will break reconcile.
Having no reconcile can literally break undo stack and crash: Editing the buffer in InputTextCallback desyncs the undo/redo stack #4947, Editing the buffer in InputTextCallback desyncs the undo/redo stack (Fix #4947) #4949. Actually it will only crash/mess-up with things if string len changed. So we could I suppose track if InsertChars/DeleteChars() have been used. If they have NOT been used and user set BufDirty = true AND BufTextLen changed, then we could decide to assert.

That seems all possible to implement. The slight annoyance is that it means multiple calls to e.g. InsertChars() would lead to multiple undo steps, so that's a little bit of a regression now. It feels nicer if that's not the case...

PLAN B

We introduce a new callback data e.g. BeginDirectBufEdit(). InsertChars/DeleteChar can call it, and we instruct people modifying text in callback to call this first instead of setting BufDirty=true. There we can do a copy.

PLAN C

We get lazy and always do a full copy + reconcile.

alektron · 2024-09-05T22:04:03Z

First of all, I am very glad my PR is helpful even if it ended up requiring more work on your end than I expected.
Unfortunately I will not be able to work on this until at least Monday. I will get back to you in more detail then.
Some quick notes:

Couldn't IMSTB_TEXTEDIT_GETPREVCHARINDEX_IMPL() simply use ImTextFindPreviousUtf8Codepoint() ? much less code and likely faster.

I was somehow thinking, I only want to handle correctly formed byte sequences when in reality ImTextFindPreviousUtf8Codepoint is not only much simpler, it also behaves better for ill formed sequences.

insert mode using STB_TEXTEDIT_DELETECHARS(str, state->cursor, 1); is likely broken?

Indeed, I must have missed that.

STB_TEXTEDIT_MOVEWORDLEFT_IMPL, STB_TEXTEDIT_MOVEWORDRIGHT_MAC, STB_TEXTEDIT_MOVEWORDRIGHT_WIN as we don't handle utf-8 codepoints in there?

I believe the functions do not technically have to handle that directly since ImTextFindPreviousUtf8Codepoint inside is_word_boundary_from_right/is_word_boundary_from_left will basically make it work even when the outer loop is walking through a multibyte sequence byte by byte. It will however do the same work multiple times per multibyteseq which is indeed a bit wasteful.

If while working on this (before or after) you can come up with any extra test to add to test suite please let me know! even if you don't want to add them yourself, i'll log every idea.

I'll see what I can do

For everything else I will have to get back to you next week, sorry.
Especially the whole InputTextReconcileUndoStateAfterUserCallback() topic. I already had some trouble understanding the purpose of it when working on this. I'll have to read into it in more detail first.

ocornut · 2024-09-09T15:22:50Z

I believe the functions do not technically have to handle that directly since ImTextFindPreviousUtf8Codepoint inside is_word_boundary_from_right/is_word_boundary_from_left will basically make it work even when the outer loop is walking through a multibyte sequence byte by byte. It will however do the same work multiple times per multibyteseq which is indeed a bit wasteful.

It definitively broke a few of the new test cases I added (basic testing of prev/next words on mulit-byte character).
Note that I forgot to actually push that change the other day, now it is pushed.

I have rebased the branch today following unrelated changes which conflicted dearimgui/dear_bindings#47 (comment)

alektron · 2024-09-10T17:37:46Z

Correct me if I'm wrong but we have been copying/converting on every frame before as well, right?
In here:

// Apply ASCII value
if (!is_readonly)
{
    state->TextAIsValid = true;
    state->TextA.resize(state->TextW.Size * 4 + 1);
    ImTextStrToUtf8(state->TextA.Data, state->TextA.Size, state->TextW.Data, NULL);
}

Not saying that we should not strive to be better but if we are not worse than before I'm almost tempted to go with plan C for now.

This could still be seen as a win:

The copying could be limited to InputTexts that use callbacks, which would at least spare us the copying for non-callback widgets, which might still be substantial.
Memory usage will stay similar for those cases but the code complexity (which may not even be that complex) would at least be very local at the callback handling site. Compared to the TextA/TextW conversions sprinkled everywhere in InputText() that we had before
Copying is now a memcpy instead of a conversion
ReadOnly InputText does not need any copying anymore (I think?)

That being said, being unsatisfied with the unnecessary copies is the main reason I started this PR, so I'm not too happy with that solution either.
However it would still be the path of least resistance and the least user code breaking change.
It would also allow the rest of the changes to settle and the issue could be independently picked up at a later time.

It definitively broke a few of the new test cases I added (basic testing of prev/next words on mulit-byte character).

You're right. It generally worked but some nuances in regard to whitespaces (should the cursor end up before or after the whitespace) were behaving differently.

ocornut · 2024-09-11T09:54:56Z

You are right that Plan C wouldn't be a regression and can always be a step to better. I will work on this for now. It's easy to implement but mostly I would like to implement some testing.

ocornut · 2024-09-11T10:17:23Z

ReadOnly InputText does not need any copying anymore (I think?)

Unfortunately it does because of this: (but only when there's a callback)
8a946b6

…text_callback_replace" with undo tests aimed to excercise reconcile code. for ocornut/imgui#7925

ocornut · 2024-09-11T11:10:34Z

Added new tests ocornut/imgui_test_engine@7341b8b.
Implemented Plan C a118e54.
While running tests discovered a new bug. I was puzzled because it only happened on the first test run, but it turns out that callback_buf / callback_data.Buf pointers may be be invalidated because of the resize:

if (callback_data.BufTextLen > backup_current_text_length && is_resizable)
   state->TextA.resize(state->TextA.Size + (callback_data.BufTextLen - backup_current_text_length)); // Worse case scenario resize
memcpy(state->TextA.Data, callback_data.Buf, callback_data.BufTextLen);

As it turns out none of this is necessary anymore! Since the callback ran over our buffer anyhow. We just need to update TextA.Size.

Fixed a crash in DebugNodeInputTextState(). I am not sure what your change was for but it is incorrect since unused records are not initialized.

Future improvements:

Implement Plan B (explicit edit requests for callback handlers not using DeleteChars()/InsertChars()).
DeleteChars()/InsertChars() have been using different code path because they always dealt with UTF-8 buffer, so that may also be simplified.
There's likely quite a few things to further clean up. e.g. renaming fields.

ocornut · 2024-09-11T12:28:17Z

Fixed infinite loop in clipboard paste filtering 1af5884
Optimized and simplified some search loops, now can use memchr/strchr 0020036

That's large optimization is extremely valuable, and couldn't be done before because we had to support ImWchar and they can be either 16 or 32 bytes.

Quick test pasting a ~900 KB file, VS2022 x64 Debug Mode:

BEFORE
large, top, ~3.65 ms
large, bottom ~4.0 ms
select all, top ~3.7 ms
select all, bottom, ~6.0 ms

BRANCH BEFORE LAST COMMIT (memchr/strchr)
large, top ~2.2 ms
large, bottom ~2.5 ms
select all, top ~2.2 ms
select all, bottom ~4.5 ms

BRANCH AFTER LAST COMMIT
large, top ~0.87 ms
large, bottom ~1.2 ms
select all, top, ~0.9 ms
select all, bottom, ~1.9 ms

This quite good.

I am pretty sure they are other good opportunities that will now become easier to see and take advantage of.
Among other, the multi-line renderer already established top and bottom visible lines, and this work is duplicated by the AddText() call.

… the wchar buffer. (ocornut#7925) WIP (requires subsequent commits for fixes)

It seems sensible to push this change in stb_textedit repo eventually.

…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (ocornut#7925)

+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().

…ixes character filtering. (ocornut#7925) Refer to imgui_test_suite for tests.

…uch faster! (ocornut#7925)

… the wchar buffer. (#7925) WIP (requires subsequent commits for fixes)

It seems sensible to push this change in stb_textedit repo eventually.

…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (#7925)

+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().

…ixes character filtering. (#7925) Refer to imgui_test_suite for tests.

ocornut · 2024-09-11T12:46:54Z

Merged into main branch. Thanks for your help!

colordiver · 2024-09-11T12:52:21Z

Hello!🌞
I read your messages on mail, when I start testing, need time for creating step-by-step, I n00b, need more time.

…chr() shaves 0.2 ms on 865k multi-line text case. Approximately 20%. (#7925)

ocornut · 2024-09-16T12:54:06Z

BRANCH AFTER LAST COMMIT
large, top ~0.87 ms
large, bottom ~1.2 ms
select all, top, ~0.9 ms
select all, bottom, ~1.9 ms

Pushed a small optimization which shaves a further 0.2 ms in debug mode on the large text case.
Also one in the inactive case (where we were doing a naive loop in InputTextCalcTextLenAndLineCount(), could have been optimized before this PR).

Among other, the multi-line renderer already established top and bottom visible lines, and this work is duplicated by the AddText() call.

This seems very possible and valuable but requires a little more work than expected so I am putting that aside.

…ultiline path. (#7925)

Broken by abd07f6

ocornut · 2024-09-17T19:07:38Z

Found another bug from Initial commit abd07f6

I luckily noticed it because it broke the memory editor which used callback_data.SelectionEnd.

Fixed by f7ba645!

ocornut added the inputtext label Sep 3, 2024

ocornut added this to the v1.92 milestone Sep 5, 2024

ocornut force-pushed the TextEditNativeUtf8_docking branch from ea012fe to b9a7042 Compare September 5, 2024 14:40

ocornut changed the base branch from docking to master September 5, 2024 14:43

ocornut force-pushed the TextEditNativeUtf8_docking branch 2 times, most recently from e91acc2 to 82ed98d Compare September 5, 2024 18:47

ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 5, 2024

TestSuite: amend "widgets_inputtext_cursor" to test ctrl+arrow on mul…

a9bf5be

…tibyte characters. for ocornut/imgui#7925

ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 5, 2024

TestSuite: amend "widgets_inputtext_cursor_prevnext_words" to test ct…

2303543

…rl+arrow on multibyte characters. for ocornut/imgui#7925

ocornut force-pushed the TextEditNativeUtf8_docking branch from 2dedbf2 to ce5f7ab Compare September 10, 2024 13:03

ocornut force-pushed the TextEditNativeUtf8_docking branch from ce5f7ab to a118e54 Compare September 11, 2024 11:08

ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 11, 2024

TestSuite: amend "widgets_inputtext_callback_misc" and "widgets_input…

7341b8b

…text_callback_replace" with undo tests aimed to excercise reconcile code. for ocornut/imgui#7925

alektron and others added 7 commits September 11, 2024 14:30

InputText: added native support for UTF-8 text editing and god rid of…

f37019e

… the wchar buffer. (ocornut#7925) WIP (requires subsequent commits for fixes)

InputText: amends: renames, coding style. (ocornut#7925)

3ca611e

InputText: amends: add stb_textedit_text() api. (ocornut#7925)

017c3a6

It seems sensible to push this change in stb_textedit repo eventually.

InputText: amends: add note about STB_TEXTEDIT_GETCHAR. tweak InputTe…

1e3fc01

…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (ocornut#7925)

InputText: amends: fixed next/prev word implementation. (ocornut#7925)

93d8b5d

+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().

InputText: amends: fixed undo-stack reconcile. fixed metrics crash. f…

6f39291

…ixes character filtering. (ocornut#7925) Refer to imgui_test_suite for tests.

InputText: amends: now can use memchr()/strchr() for line counting. M…

0008f3d

…uch faster! (ocornut#7925)

ocornut force-pushed the TextEditNativeUtf8_docking branch from 0020036 to 0008f3d Compare September 11, 2024 12:42

ocornut added the optimization label Sep 11, 2024

ocornut merged commit 8807b01 into ocornut:master Sep 11, 2024
6 checks passed

ocornut pushed a commit that referenced this pull request Sep 11, 2024

InputText: added native support for UTF-8 text editing and god rid of…

abd07f6

… the wchar buffer. (#7925) WIP (requires subsequent commits for fixes)

ocornut added a commit that referenced this pull request Sep 11, 2024

InputText: amends: renames, coding style. (#7925)

d1b7817

ocornut added a commit that referenced this pull request Sep 11, 2024

InputText: amends: add stb_textedit_text() api. (#7925)

1674fe9

It seems sensible to push this change in stb_textedit repo eventually.

ocornut added a commit that referenced this pull request Sep 11, 2024

InputText: amends: add note about STB_TEXTEDIT_GETCHAR. tweak InputTe…

e240bc1

…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (#7925)

ocornut added a commit that referenced this pull request Sep 11, 2024

InputText: amends: fixed next/prev word implementation. (#7925)

3d1e593

+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().

ocornut added a commit that referenced this pull request Sep 11, 2024

InputText: amends: fixed undo-stack reconcile. fixed metrics crash. f…

19accb1

…ixes character filtering. (#7925) Refer to imgui_test_suite for tests.

ocornut added a commit that referenced this pull request Sep 16, 2024

InputText: optimization for large text: using memchr() instead of str…

b53d91a

…chr() shaves 0.2 ms on 865k multi-line text case. Approximately 20%. (#7925)

ocornut added a commit that referenced this pull request Sep 16, 2024

InputText: optimize InputTextCalcTextLenAndLineCount() for inactive m…

4925695

…ultiline path. (#7925)

ocornut added a commit that referenced this pull request Sep 17, 2024

InputText: fixed not filling callback's SelectionEnd. (#7925)

f7ba645

Broken by abd07f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native UTF8 support for InputText and remove ImWchar buffer #7925

Add native UTF8 support for InputText and remove ImWchar buffer #7925

alektron commented Aug 27, 2024 •

edited

Loading

ocornut commented Sep 5, 2024

ocornut commented Sep 5, 2024 •

edited

Loading

ocornut commented Sep 5, 2024

alektron commented Sep 5, 2024 •

edited

Loading

ocornut commented Sep 9, 2024

alektron commented Sep 10, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

colordiver commented Sep 11, 2024

ocornut commented Sep 16, 2024

ocornut commented Sep 17, 2024

Add native UTF8 support for InputText and remove ImWchar buffer #7925

Add native UTF8 support for InputText and remove ImWchar buffer #7925

Conversation

alektron commented Aug 27, 2024 • edited Loading

Overview

Details

NOTE: Docking branch

ocornut commented Sep 5, 2024

ocornut commented Sep 5, 2024 • edited Loading

ocornut commented Sep 5, 2024

alektron commented Sep 5, 2024 • edited Loading

ocornut commented Sep 9, 2024

alektron commented Sep 10, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

ocornut commented Sep 11, 2024

colordiver commented Sep 11, 2024

ocornut commented Sep 16, 2024

ocornut commented Sep 17, 2024

alektron commented Aug 27, 2024 •

edited

Loading

ocornut commented Sep 5, 2024 •

edited

Loading

alektron commented Sep 5, 2024 •

edited

Loading