Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native UTF8 support for InputText and remove ImWchar buffer #7925

Merged
merged 7 commits into from
Sep 11, 2024

Conversation

alektron
Copy link
Contributor

@alektron alektron commented Aug 27, 2024

!! Please see the notes on the docking branch issue at the bottom !!

Overview

To implement common text editing operations, Dear ImGui uses a modified version of stb_textedit under the hood. stb_textedit currently only works for fixed size character encodings (e.g. ASCII or UTF16). Since Dear ImGui uses UTF8 for its API a lot of conversion between UTF8 and UTF16 are being done under the hood.
This PR is making the necessary adjustments to stb_textedit and Dear ImGui to support native UTF8 editing and get rid of the additional UTF16 buffer and conversions.

Details

  • CurLenW and the TextW buffer in ImGuiInputTextState got removed completely. TextA and CurLenA are still there (I also kept the 'A' postfix. I am not sure if there is any good reason to keep it). I didn't put too much thought into it yet but it may or may not be possible to even remove the TextA buffer and get rid of copying the user buffer entirely? Even if that is the case, it might be desirable to still keep it around for potential future implementations of the ImGuiInputTextFlags_NoLiveEdit flag
    (InputText, InputInt, InputFloat: add ImGuiInputTextFlags_NoLiveEdit flag #701). Also note that the comments for CurLenA and TextA might not be entirely fitting anymore. I adjusted them a little bit but I am not sure how much of the information there is still true.

  • The changes to stb_textedit are for the most part based on the suggestions from this related issue. Some more changes were necessary. Including a change to stb_textedit_key that adds the decoded UTF8 character byte sequence to the parameters. This means we now have UTF8 specific code in stb_textedit.
    Valid values can be passed for other encodings but it really only makes sense for UTF8. One could maybe think of a better, more encoding agnostic, solution but to be honest, I am not sure if that is worth the additional complexity. I guess it depends on how much we are willing to deviate from the original stb_textedit.

  • There was an outstanding FIXME-OPT about having to copy/convert the buffer to UTF16 for read-only InputText to allow selection/cursor usage. This should now not be an issue anymore and I removed it.

  • The test suite ran succesfully. Of course some tests had to be adjusted slightly since TextW doesn't exist anymore. I am not sure how many UTF8 specific edge cases are even tested so that may not be all too meaningful.
    I did of course also do some manual testing and encountered no issues with mixes of European and Japanese glyphs.
    Mouse clicking/dragging, arrow key navigation (in rows and between rows), arrow key whole word navigation, all seem to be working as expected. The callback demos in the imgui_demo work as well (except for some minor issues that have been there before e.g. an 'ü' can not be made uppercase by the barebones demo callback implementations)

  • Performance was not tested. As far as I can tell there isn't any more rescanning happening now than it was before. Here and there there is some forward/backwards scanning e.g. to check for multi byte sequences adjacent to the cursor but nothing crazy. If anything I'd expect a performance improvement due to all the conversions that are now gone.

  • There might be a lot more code that could be simplified now. I did not get too far into it. For now I would like to keep the changes manageable so they can be reviewed easier. But I do expect this to open the door for some more improvements.

NOTE: Docking branch

Due to an oversight on my end all changes were made to the docking branch instead of master.
A low level change like this would probably be prefered to be implemented in the master branch and then be merged into docking.
Pulling the changes into master and doing the necessary adjustments is not impossible but it's also not trivial. If necessary I would be willing to do it. For now the pull request is only for the docking branch. Some feedback on this would be appreciated.

@ocornut
Copy link
Owner

ocornut commented Sep 5, 2024

Thank you very much. This looks pretty good and I will try to review it shortly. I have already rebased the branch locally to look at it. It's been indeed the much needed first step toward a fuller refactor of InputText() so I appreciate it.

The truth is, rokups has done this work already but as part of a larger refactor.
I will need to look at both work. Yours here has the advantage that it is limited to that one step and therefore for me is an attractive merge. Rokas's one has the advantage of (I assume) more naturally fitting with his remaining changes. I'll figure things out.

@ocornut ocornut added this to the v1.92 milestone Sep 5, 2024
@ocornut
Copy link
Owner

ocornut commented Sep 5, 2024

Here are some changes I made:

  • I introduced a stb_textedit_text() function. legacy stb_textedit_key() path can still call stb_textedit_text() but imgui calls this directly. Added equivalent ImGuiInputTextState::OnCharPressed().
  • tweaked InputTextCalcTextSize() so it doesn't need to call a function for ASCII. it's a potential debug-mode perf regression for large buffers. using similar code as ImFont::CalcTextSizeA() and added note that ideally should would be shared (no hurry with that).
  • I added a note that STB_TEXTEDIT_GETCHAR() was at this point more of a STB_TEXTEDIT_GETBYTE() and verified that it is used accordingly (ascii compares or block copy).
// With our UTF-8 use of stb_textedit:
// - STB_TEXTEDIT_GETCHAR is nothing more than a a "GETBYTE". It's only used to compare to ascii or to copy blocks of text so we are fine.
// - One exception is the STB_TEXTEDIT_IS_SPACE feature which would expect a full char in order to handle full-width space such as 0x3000 (see ImCharIsBlankW).
// - ...but we don't use that feature.
  • If while working on this (before or after) you can come up with any extra test to add to test suite please let me know! even if you don't want to add them yourself, i'll log every idea.
  • We can remove the A suffixes later after this is merged and settled.

This needs further works:

  • Couldn't IMSTB_TEXTEDIT_GETPREVCHARINDEX_IMPL() simply use ImTextFindPreviousUtf8Codepoint() ? much less code and likely faster.
  • STB_TEXTEDIT_MOVEWORDLEFT_IMPL, STB_TEXTEDIT_MOVEWORDRIGHT_MAC, STB_TEXTEDIT_MOVEWORDRIGHT_WIN as we don't handle utf-8 codepoints in there?
  • insert mode using STB_TEXTEDIT_DELETECHARS(str, state->cursor, 1); is likely broken?
  • the InputTextReconcileUndoStateAfterUserCallback() path is sufficiently obscure and rare that I believe we should design some tests for it.
  • rebase on master.

@ocornut ocornut changed the base branch from docking to master September 5, 2024 14:43
@ocornut ocornut force-pushed the TextEditNativeUtf8_docking branch 2 times, most recently from e91acc2 to 82ed98d Compare September 5, 2024 18:47
ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 5, 2024
ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 5, 2024
@ocornut
Copy link
Owner

ocornut commented Sep 5, 2024

I think I have fixed all the things except InputTextReconcileUndoStateAfterUserCallback().
Would you mind reviewing my commits?

If for some reason you need to run or modify tests, the modified test is in a public branch: https://github.com/ocornut/imgui_test_engine/tree/features/input_text_7925

About InputTextReconcileUndoStateAfterUserCallback(), notice the comment:

// Find the shortest single replacement we can make to get the new text from the old text.
// Important: needs to be run before TextW is rewritten with the new characters because calling STB_TEXTEDIT_GETCHAR() at the end.
// FIXME: Ideally we should transition toward (1) making InsertChars()/DeleteChars() update undo-stack (2) discourage (and keep reconcile) or obsolete (and remove reconcile) accessing buffer directly.

The part about needing to run before TextW is rewritten is essentially broken now, for the good reason that we were using this buffer as an "old buffer". Solving this the same way would annoyingly requires us to make a copy of the full text before running the callback. aka possibly worse case every frame since the Always callback is technically allowed to modify character...

PLAN A

That seems all possible to implement. The slight annoyance is that it means multiple calls to e.g. InsertChars() would lead to multiple undo steps, so that's a little bit of a regression now. It feels nicer if that's not the case...

PLAN B

  • We introduce a new callback data e.g. BeginDirectBufEdit(). InsertChars/DeleteChar can call it, and we instruct people modifying text in callback to call this first instead of setting BufDirty=true. There we can do a copy.

PLAN C

  • We get lazy and always do a full copy + reconcile.

@alektron
Copy link
Contributor Author

alektron commented Sep 5, 2024

First of all, I am very glad my PR is helpful even if it ended up requiring more work on your end than I expected.
Unfortunately I will not be able to work on this until at least Monday. I will get back to you in more detail then.
Some quick notes:

Couldn't IMSTB_TEXTEDIT_GETPREVCHARINDEX_IMPL() simply use ImTextFindPreviousUtf8Codepoint() ? much less code and likely faster.

I was somehow thinking, I only want to handle correctly formed byte sequences when in reality ImTextFindPreviousUtf8Codepoint is not only much simpler, it also behaves better for ill formed sequences.

insert mode using STB_TEXTEDIT_DELETECHARS(str, state->cursor, 1); is likely broken?

Indeed, I must have missed that.

STB_TEXTEDIT_MOVEWORDLEFT_IMPL, STB_TEXTEDIT_MOVEWORDRIGHT_MAC, STB_TEXTEDIT_MOVEWORDRIGHT_WIN as we don't handle utf-8 codepoints in there?

I believe the functions do not technically have to handle that directly since ImTextFindPreviousUtf8Codepoint inside is_word_boundary_from_right/is_word_boundary_from_left will basically make it work even when the outer loop is walking through a multibyte sequence byte by byte. It will however do the same work multiple times per multibyteseq which is indeed a bit wasteful.

If while working on this (before or after) you can come up with any extra test to add to test suite please let me know! even if you don't want to add them yourself, i'll log every idea.

I'll see what I can do

For everything else I will have to get back to you next week, sorry.
Especially the whole InputTextReconcileUndoStateAfterUserCallback() topic. I already had some trouble understanding the purpose of it when working on this. I'll have to read into it in more detail first.

@ocornut
Copy link
Owner

ocornut commented Sep 9, 2024

I believe the functions do not technically have to handle that directly since ImTextFindPreviousUtf8Codepoint inside is_word_boundary_from_right/is_word_boundary_from_left will basically make it work even when the outer loop is walking through a multibyte sequence byte by byte. It will however do the same work multiple times per multibyteseq which is indeed a bit wasteful.

It definitively broke a few of the new test cases I added (basic testing of prev/next words on mulit-byte character).
Note that I forgot to actually push that change the other day, now it is pushed.

I have rebased the branch today following unrelated changes which conflicted dearimgui/dear_bindings#47 (comment)

@alektron
Copy link
Contributor Author

Correct me if I'm wrong but we have been copying/converting on every frame before as well, right?
In here:

// Apply ASCII value
if (!is_readonly)
{
    state->TextAIsValid = true;
    state->TextA.resize(state->TextW.Size * 4 + 1);
    ImTextStrToUtf8(state->TextA.Data, state->TextA.Size, state->TextW.Data, NULL);
}

Not saying that we should not strive to be better but if we are not worse than before I'm almost tempted to go with plan C for now.

This could still be seen as a win:

  • The copying could be limited to InputTexts that use callbacks, which would at least spare us the copying for non-callback widgets, which might still be substantial.
  • Memory usage will stay similar for those cases but the code complexity (which may not even be that complex) would at least be very local at the callback handling site. Compared to the TextA/TextW conversions sprinkled everywhere in InputText() that we had before
  • Copying is now a memcpy instead of a conversion
  • ReadOnly InputText does not need any copying anymore (I think?)

That being said, being unsatisfied with the unnecessary copies is the main reason I started this PR, so I'm not too happy with that solution either.
However it would still be the path of least resistance and the least user code breaking change.
It would also allow the rest of the changes to settle and the issue could be independently picked up at a later time.


It definitively broke a few of the new test cases I added (basic testing of prev/next words on mulit-byte character).

You're right. It generally worked but some nuances in regard to whitespaces (should the cursor end up before or after the whitespace) were behaving differently.

@ocornut
Copy link
Owner

ocornut commented Sep 11, 2024

You are right that Plan C wouldn't be a regression and can always be a step to better. I will work on this for now. It's easy to implement but mostly I would like to implement some testing.

@ocornut
Copy link
Owner

ocornut commented Sep 11, 2024

ReadOnly InputText does not need any copying anymore (I think?)

Unfortunately it does because of this: (but only when there's a callback)
8a946b6

ocornut added a commit to ocornut/imgui_test_engine that referenced this pull request Sep 11, 2024
…text_callback_replace" with undo tests aimed to excercise reconcile code.

for ocornut/imgui#7925
@ocornut
Copy link
Owner

ocornut commented Sep 11, 2024

  • Added new tests ocornut/imgui_test_engine@7341b8b.
  • Implemented Plan C a118e54.
  • While running tests discovered a new bug. I was puzzled because it only happened on the first test run, but it turns out that callback_buf / callback_data.Buf pointers may be be invalidated because of the resize:
if (callback_data.BufTextLen > backup_current_text_length && is_resizable)
   state->TextA.resize(state->TextA.Size + (callback_data.BufTextLen - backup_current_text_length)); // Worse case scenario resize
memcpy(state->TextA.Data, callback_data.Buf, callback_data.BufTextLen);

As it turns out none of this is necessary anymore! Since the callback ran over our buffer anyhow. We just need to update TextA.Size.

  • Fixed a crash in DebugNodeInputTextState(). I am not sure what your change was for but it is incorrect since unused records are not initialized.

Future improvements:

  • Implement Plan B (explicit edit requests for callback handlers not using DeleteChars()/InsertChars()).
  • DeleteChars()/InsertChars() have been using different code path because they always dealt with UTF-8 buffer, so that may also be simplified.
  • There's likely quite a few things to further clean up. e.g. renaming fields.

@ocornut
Copy link
Owner

ocornut commented Sep 11, 2024

  • Fixed infinite loop in clipboard paste filtering 1af5884
  • Optimized and simplified some search loops, now can use memchr/strchr 0020036

That's large optimization is extremely valuable, and couldn't be done before because we had to support ImWchar and they can be either 16 or 32 bytes.

Quick test pasting a ~900 KB file, VS2022 x64 Debug Mode:

BEFORE
large, top, ~3.65 ms
large, bottom ~4.0 ms
select all, top ~3.7 ms
select all, bottom, ~6.0 ms

BRANCH BEFORE LAST COMMIT (memchr/strchr)
large, top ~2.2 ms
large, bottom ~2.5 ms
select all, top ~2.2 ms
select all, bottom ~4.5 ms

BRANCH AFTER LAST COMMIT
large, top ~0.87 ms
large, bottom ~1.2 ms
select all, top, ~0.9 ms
select all, bottom, ~1.9 ms

This quite good.

I am pretty sure they are other good opportunities that will now become easier to see and take advantage of.
Among other, the multi-line renderer already established top and bottom visible lines, and this work is duplicated by the AddText() call.

alektron and others added 7 commits September 11, 2024 14:30
… the wchar buffer. (ocornut#7925)

WIP (requires subsequent commits for fixes)
It seems sensible to push this change in stb_textedit repo eventually.
…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (ocornut#7925)
+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().
…ixes character filtering. (ocornut#7925)

Refer to imgui_test_suite for tests.
@ocornut ocornut merged commit 8807b01 into ocornut:master Sep 11, 2024
6 checks passed
ocornut pushed a commit that referenced this pull request Sep 11, 2024
… the wchar buffer. (#7925)

WIP (requires subsequent commits for fixes)
ocornut added a commit that referenced this pull request Sep 11, 2024
It seems sensible to push this change in stb_textedit repo eventually.
ocornut added a commit that referenced this pull request Sep 11, 2024
…xtCalcTextSize() to use similar debug-friendly logic as ImFont:CalcTextSizeA(). misc small tidying up. (#7925)
ocornut added a commit that referenced this pull request Sep 11, 2024
+ replace IMSTB_TEXTEDIT_GETPREVCHARINDEX code with ImTextFindPreviousUtf8Codepoint().
ocornut added a commit that referenced this pull request Sep 11, 2024
…ixes character filtering. (#7925)

Refer to imgui_test_suite for tests.
@ocornut
Copy link
Owner

ocornut commented Sep 11, 2024

Merged into main branch. Thanks for your help!

@colordiver
Copy link

Hello!🌞
I read your messages on mail, when I start testing, need time for creating step-by-step, I n00b, need more time.

ocornut added a commit that referenced this pull request Sep 16, 2024
…chr() shaves 0.2 ms on 865k multi-line text case. Approximately 20%. (#7925)
@ocornut
Copy link
Owner

ocornut commented Sep 16, 2024

BRANCH AFTER LAST COMMIT
large, top ~0.87 ms
large, bottom ~1.2 ms
select all, top, ~0.9 ms
select all, bottom, ~1.9 ms

Pushed a small optimization which shaves a further 0.2 ms in debug mode on the large text case.
Also one in the inactive case (where we were doing a naive loop in InputTextCalcTextLenAndLineCount(), could have been optimized before this PR).

Among other, the multi-line renderer already established top and bottom visible lines, and this work is duplicated by the AddText() call.

This seems very possible and valuable but requires a little more work than expected so I am putting that aside.

ocornut added a commit that referenced this pull request Sep 16, 2024
ocornut added a commit that referenced this pull request Sep 17, 2024
@ocornut
Copy link
Owner

ocornut commented Sep 17, 2024

Found another bug from Initial commit abd07f6
image

I luckily noticed it because it broke the memory editor which used callback_data.SelectionEnd.

Fixed by f7ba645!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants