Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some cleanup, fixing, and enhancements to TextFormatter #3275

Closed
dodexahedron opened this issue Feb 27, 2024 · 41 comments
Closed

Some cleanup, fixing, and enhancements to TextFormatter #3275

dodexahedron opened this issue Feb 27, 2024 · 41 comments
Assignees
Labels
bug enhancement testing Issues related to testing v2 For discussions, issues, etc... relavant for v2 work-in-progress
Milestone

Comments

@dodexahedron
Copy link
Collaborator

While doing a cleanup and performance pass on TextFormatter, the loops in the FindHotKey method made go hmmm.

They're basically Rune-ified IndexOf operations, looking for the hotkey specifier Rune and then the following Rune as the hotkey, so long as it isn't FFFD.

But the state machine doesn't account for the character following the hotkey specifier being FFFD.

When that happens, the state machine keeps running until it finds the next non-specifier and non-FFFD Rune and calls that the hotkey. That's of course not expected, but what makes it even more dangerous is that the index provided in the hotPos out parameter will be the index of the specifier, as with normal cases, but the following index is not the hotkey that it chose.

The naive fix is simple: Just have it reset curHotPos to -1 each time it finds Rune.ReplacementChar.

I've added several test cases that cover various sequences containing the hotkey specifier in at least one position and FFFD in at least one position.

I also see some simple ways to squeeze almost free performance out of that method, so I'm doing that as well (this was in the middle of performance work, after all). Mostly some span stuff and avoiding as many loop iterations as I can figure out without spending too much time on just that method.

@dodexahedron dodexahedron added bug enhancement testing Issues related to testing v2 For discussions, issues, etc... relavant for v2 labels Feb 27, 2024
@dodexahedron dodexahedron added this to the v2 milestone Feb 27, 2024
@dodexahedron dodexahedron self-assigned this Feb 27, 2024
@dodexahedron
Copy link
Collaborator Author

So...

As I'm analyzing this, I'm noticing there are some invalid test cases (which I think was probably already known, based on comments around those tests).

In particular, the cases that include codepoint 0xE0B0 are incorrect.

Those shouldn't be returning as symbols, because that codepoint is in a private use area. If you feed that value to Rune.IsSymbol(thatRune), you'll get false, and if you feed it to Rune.GetUnicodeCategory(thatRune), you'll get UnicodeCategory.PrivateUse.

Those test cases should expect a false result, and we should probably have tests that include codepoints from each category, as well as combinations, both for scalars and for surrogate pairs, to ensure these methods are actually behaving properly. I'll at least be sure to fix the cases or the code, as appropriate. For that situation, it's both code and expected test result that need fixing.

I have what appears, so far, to be at least a more "correct" implementation (MUCH faster, too, for the majority of likely typical cases). I'll describe it in a follow-up comment here.

@dodexahedron
Copy link
Collaborator Author

The new implementation of the FindHotKey method that I have is partially complete, in that it works seemingly correctly for BMP values, but still needs me to make the fixes to non-BMP cases.

The only loop that has been addressed, so far, is the first one in the method.

The first thing it does is checks if it can simply do an ordinal char scan of the string or if it has to resort to using the Rune enumerator, by checking if the hotkey specifier Rune is in the BMP. If it is, we can do an ordinal scan for that character, even if the string also contains surrogate pairs (that particular point is made in the Microsoft Learn Supplementary API Remarks docs for working with strings and makes sense based on how it's all defined anyway.

Anyway...

  1. If specifier is in the BMP, proceed to 1a, performing the loop beginning at index 0 of the string and ending one character before the end of the string (since it has to be a pair of chars to be valid):
    1a. If at end of the loop, proceed with 2. Otherwise, find IndexOf the specifier, beginning at index, and set index to the resulting value. Proceed to 1b.
    1b. If index was < 0 (not found) or we would go out of bounds, proceed to 1b1. Else proceed to 1c.
    1b1. Set curHotPos to index. Terminate loop and proceed with 2.
    1c. Get the rune that starts at index + 1 as potentialHotKeyRune. Proceed to 1d.
    1d. If all of the following are true of potentialHotKeyRune, proceed to 1e. Else increment index and proceed from 1a.
    • Is not FFFD
    • Is not same as specifier
    • Is in the BMP
    • Is a letter, digit, symbol, or punctuation
      1e. Set curHotKey to that rune and curHotPos to index. Terminate loop and proceed with 2.
  2. If either of the following are true, perform the currently existing full unicode loop. Otherwise, skip that and just proceed with the rest of the method, as we've done what needs to be done up to that point:
  • specifier is not in the BMP
  • curHotPos > 0 and the rune that starts at curHotPos is not in the BMP (meaning either the previous loop didn't run (specifier not in BMP) or it did run but found a non-BMP rune after the specifier and bailed to enable fallback to this method).

That last step, for handling cases of either non-BMP specifier or when the specifier is BMP but the character following it is not, is currently still the original implementation, which I need to fix as described earlier. Just haven't gotten to that point yet.

When the first loop can run, this is dozens of times faster, in debug builds, and almost 400x faster in release builds, even before addressing the other loops that may need to run.

@dodexahedron
Copy link
Collaborator Author

Ugh. Sorry the formatting didn't come out like I wanted with that. Some things didn't get indented which should have and some got indented which shouldn't have.

Here is the code for the first loop, which is the main new part. Note most of the ifs are inverted to avoid nesting:

ReadOnlySpan<char> stringSpan = text.AsSpan ();
if (hotKeySpecifier.IsBmp)
{
  for(int index = 0;index < stringSpan.Length; index++)
  {
    index = text.IndexOf ((char)hotKeySpecifier.Value, index);

    if (index < 0 || index + 1 >= text.Length)
    {
      curHotPos = index;

      break;
    }

    if (!Rune.TryGetRuneAt (text, index + 1, out Rune potentialHotKeyRune))
    {
      continue;
    }

    if (potentialHotKeyRune == Rune.ReplacementChar
      || potentialHotKeyRune == hotKeySpecifier
      || !potentialHotKeyRune.IsBmp
      || (!Rune.IsLetterOrDigit (potentialHotKeyRune)
        && !Rune.IsPunctuation (potentialHotKeyRune)
        && !Rune.IsSymbol (potentialHotKeyRune)))
    {
      continue;
    }

    curHotKey = Rune.GetRuneAt (text, index + 1);
    curHotPos = index;

    break;
  }
}

@dodexahedron
Copy link
Collaborator Author

And here is the slightly modified second part, which is the fallback to full unicode, with the index reset fix applied:

if (!hotKeySpecifier.IsBmp || (curHotPos >= 0 && !Rune.GetRuneAt (text,curHotPos).IsBmp))
{
  int index = 0;
  foreach (Rune c in stringSpan.EnumerateRunes ())
  {
    if (c == Rune.ReplacementChar)
    {
      curHotPos = -1;
    }
    else if (c == hotKeySpecifier)
    {
      curHotPos = index;
    }
    else if (curHotPos > -1)
    {
      curHotKey = c;

      break;
    }

    index++;
  }
}

I'm not done with that, yet, so it may still have issues.

Note that using either or both of these with the current tests will fail on a few of them, so it's not worth bothering with that unless you're just curious. This is just so y'all can see what I'm thinking with this in case any of it raises any red flags.

@dodexahedron
Copy link
Collaborator Author

I'm totally going to benchmark this against the TextElementEnumerator, too, now that we have a cross-platform guarantee of consistent and correct behavior with it (which we didn't have when supporting .net Framework or .net < 5.0, making it pretty much useless before).

I would almost put money on it outperforming the Rune implementation and being more correct, implicitly, since it handles all code points, including surrogates, in multiple encodings, out of the box. And with the way there was SIMD love given to a lot of text-related stuff in .net 8, I even bet it might give my little BMP semi-optimization a run for its money, too.

@dodexahedron
Copy link
Collaborator Author

Also, about TextElementEnumerator and the StringInfo class...

It uses ICU, now, which means it uses the current "gold standard" unicode implementation, and will evolve with Unicode, unlike certain methods and properties of char and Rune which may be incorrect, incomplete, non-standard, or not agree with each other, even for the same codepoints (yes, even still in .net 8).

@BDisp
Copy link
Collaborator

BDisp commented Feb 28, 2024

That sounds interesting and I'm curious.

@tig
Copy link
Collaborator

tig commented Feb 28, 2024

TL;DR.

However, i scanned it and <3.

That said, it's dorky that HotKeySpecifier is a Rune. The only valid values are those typable on a keyboard with a single key press.

IMO, HotKeySpecifier should be Key.

@BDisp
Copy link
Collaborator

BDisp commented Feb 28, 2024

At the time I remember that I warned about this and made some changes in v1 to not use FFFD. I'm glad you're moving to Key. Thanks.

@dodexahedron
Copy link
Collaborator Author

dodexahedron commented Feb 28, 2024

TL;DR.

However, i scanned it and <3.

That said, it's dorky that HotKeySpecifier is a Rune. The only valid values are those typable on a keyboard with a single key press.

IMO, HotKeySpecifier should be Key.

Agreed.

In that case, the whole thing can just operate on a Span<char>.

Could be a tangible speedup on first load of something, especially if there are a ton of views.

@dodexahedron
Copy link
Collaborator Author

dodexahedron commented Feb 29, 2024

Ok, so...

With that restriction, what constitutes a match only has to satisfy the following conditions:

  • Must be specifier character immediately followed by a valid hotkey
  • Valid specifiers and hotkeys are BMP, not in the PUA, and must be in any of the following unicode categories:
    • Letter
    • Digit
    • Punctuation
    • Symbol

One behavior I'm torn on is the current code's treatment of the hotkey specifier in the string, at all, especially if there are multiple specifier characters in the string. Right now, the behavior is to find the first specifier followed by a different character than the specifier, and then ignore the rest for hotkey purposes, but still remove them for display purposes.

A couple of issues with that:

  • The code to find the width of the text subtracts the hotkey specifier width if it is found at all in the string, unconditionally, even if it didn't actually become a hotkey because it didn't pass the above conditions.
  • That's not intuitive.
  • Why not let the consumer use the same character for the specifier and hotkey?
    • This is potentially especially troublesome if the text is generated programmatically or from some data source, and that text has the hotkey specifier character in it.

So, I'd like to change the behavior to allow that character to be displayed, where it isn't in the actually selected hotkey position. But I'd leave the selection behavior the same (first one found is it and the search stops there).

All that really changes is that strings such as "__Text", which would show up as "_Text" right now and have a hotkey of T would still be displayed the same way (though the 0th character was deleted, not the 1st) and the hotkey would be underscore.

@dodexahedron
Copy link
Collaborator Author

Also

I think the behavior should be modified so that, if the HotKey property is already set (ie not KeyCode.Null), none of this even runs, so that explicitly setting the hotkey always wins.

@BDisp
Copy link
Collaborator

BDisp commented Feb 29, 2024

All that really changes is that strings such as "__Text", which would show up as "_Text" right now and have a hotkey of T would still be displayed the same way (though the 0th character was deleted, not the 1st) and the hotkey would be underscore.

I concours with the displaying of the specifier if more than once is provided because that it's the intention of the user.

@tig
Copy link
Collaborator

tig commented Feb 29, 2024

Also

I think the behavior should be modified so that, if the HotKey property is already set (ie not KeyCode.Null), none of this even runs, so that explicitly setting the hotkey always wins.

I concur.

@dodexahedron
Copy link
Collaborator Author

Coolio.

All that stuff should be pretty easy to do, then.

As I mentioned, I had started that branch off intending to just do opportunistic performance stuff for TextFormatter.

But, I think I'm going to separate at least some of it to a separate branch/PR just to contain scope a little bit. The thing is that I can't totally separate it all without intentionally doing superfluous work to make clean intermediate stages. So I guess we'll see what I have when I'm finished. 😅

@dodexahedron
Copy link
Collaborator Author

One thing is we repeat a bunch of operations quite a bit.

So I'm trying to unwind some of that in TextFormatter and avoid at least some of that duplication of effort.

One that has the potential to be particularly heinous in a few ways is that most property setters in that class unconditionally set NeedsFormat, even if no value change occurred.

That of course can lead to unnecessary re-draws, which is a potentially large performance hit, especially with some of the work that whole process involves.

There's also a ton of implicit memcopy work in there, and the EnableNeedsFormat method is a non-trivial source of it, since it ends up mostly handling structs, both passed and returned by value, and then assigned to a field. I count at least 3, but possibly as many as 6 copies between the public API surface and NeedsFormat's backing field, just from following a few possible code paths. That's easily fixed though, by reference passing, so no big deal any more there.

But what worries me more is that the class is very not thread-safe. And it even has comments that claim it is, probably from old code. But that code isn't there and, in fact, there's literally no synchronization that has any effect anywhere in that class.

As far as NeedsFormat goes... While getting or setting a boolean is an atomic operation, in .net-land, the big problem (which isn't performance-related) is a race condition that can prevent necessary re-draws, if a property is set (causing NeedsFormat to get flipped to true) at any time before the re-draw completes and sets it false again.

And especially since consumer code can extend stuff or inject behaviors in various ways, the impact and result of that race condition is unbounded and unpredictable.

I'm going to end up at least implementing some basic synchronization in that class in a few places, because there are some serious problems. And some of those are things I've mentioned before but haven't gotten around to working on, yet, soooo... I guess now here it goes!

While I usually am fine with doing a little bit of extra work to make an easy-to-grok change progression across commits, this stuff is going to be beyond that threshold in several spots. So, There's a good chance I'll just end up re-titling this issue to cover what it ends up as. 😅

As a heads-up, probably a good idea to minimize or avoid entirely any significant modifications to TextFormatter.cs, if that doesn't hinder work items you've got going too much. If that's too much of a pain, then don't worry about it. I'll figure out how to make it all play nicely when it's time for a PR. 🤷‍♂️

Aaaannnnnnyway... Back to it!

If anyone has any wishlist items for TextFormatter, now is a great time to speak up!

@dodexahedron
Copy link
Collaborator Author

Actually... At least to avoid one source of potential conflicts...

I'm going to cherry pick a few commits from what I've done so far that aren't breaking changes but are structurally quite different right now, and then rebase on that and go from there (dropping the cherry-picked commits from the rebase, of course).

That way, if anyone does need to make changes to those files, they'll at least be much easier to follow.

@dodexahedron
Copy link
Collaborator Author

There we go.

#3282

That PR is a pretty minimal set of changes and some are also things that I myself have done multiple times but that somehow kept not making it into my PRs. 😅

tig added a commit that referenced this issue Mar 1, 2024
…_I_go_to_town

Pre-work for #3275 - csproj & R# tweaks, and separating some code
@dodexahedron dodexahedron changed the title Improper parsing in TextFormatter.FindHotKey if hotkey specifier followed by unicode replacement character(s) Some cleanup, fixing, and enhancements to TextFormatter Mar 3, 2024
@dodexahedron
Copy link
Collaborator Author

Renamed this to reflect the expanded scope better.

@dodexahedron
Copy link
Collaborator Author

#3287 is relevant to this, but posted as a separate issue for visibility, since it's a pretty core type.

That's being dealt with in this work.

@dodexahedron
Copy link
Collaborator Author

Hey @tig @BDisp

TextFormatter.FindHotKey has a bool parameter to enable old behavior of using first capital letter instead of the character following a hotkey specifier.

Do we want that or can it be removed?

I vote remove for many reasons, not the least of which is that this is a pretty small thing compared to many other breaking changes. 🤷‍♂️

If it's left in, we do have to have this method support unicode, as well as anything that supports hotkeys, or else codepoints outside BMP that are classified as capitals won't be handled properly.

@dodexahedron
Copy link
Collaborator Author

About that...

As it turns out, nothing internal to the library actually even uses it explicitly except for RadioGroup, which passes true unconditionally, which looks like old behavior that's almost certainly not intended, now, as far as I can tell.

@dodexahedron
Copy link
Collaborator Author

Also for what it's worth, that change only required removing that code path from FindHotkey itself, removing the two test methods explicitly for that behavior, and removing the parameter from the calls in RadioGroup to compile and pass all tests, so it really wasn't being used or respected anywhere else, anyway.

But I did that in a specific commit. So, if y'all still want it, I can just revert that commit to bring it back.

@BDisp
Copy link
Collaborator

BDisp commented Mar 7, 2024

I think @tig already decided to force specifying the hot key specifier for all situations. I don't know why RadioGroup wasn't considered too.

@dodexahedron
Copy link
Collaborator Author

Cool. I already went forward with that assumption anyway.

Right now I'm trying to avoid as many enumerations of the string as possible within the scope of that method, without requiring other changes, and am benchmarking a few of the options available that can handle the fact that the string might be unicode, but that we only want values under U+D7FF for both the specifier and the hotkey.

In earlier discussion, we were thinking the use of Rune wasn't necessary, here. Unfortunately, unless we also impose the <= U+D7FF restriction on the actual text, we still have to care because the position will be different if any other characters before it are beyond the BMP. So, even if the parameter is changed to a char, it still ends up needing to be a Rune at some point or else there's unnecessary casting that the compiler unfortunately actually does emit, even in optimized builds.

There are multiple unicode-aware ways to do it now, including the old method of always going through it with the StringRuneEnumerator, and each one has different considerations/consequences for performance and code complexity. Since most should be fairly short strings, it's probably best to approach it with that assumption, optimizing for the most common case.

While I'd like to make other changes that would have even bigger impact, such as retaining the values and only performing the search again if the text is altered, I'm holding off on that kind of thing, for now, as those things will involve other code in more places and will need more extensive testing, just in case there are unforeseen side effects of that kind of thing.

@dodexahedron
Copy link
Collaborator Author

Whoops. Finger tapped and clicked the button as I was scrolling. 🤦‍♂️

@dodexahedron
Copy link
Collaborator Author

I also have a major love/hate relationship with Rune (more hate than love).

Why Microsoft thought it would be cool to add in a UTF-32 (which nobody really uses) type that - oh yeah - is only aware of scalars (which misses the point of huge parts of what unicode offers), among plenty of other issues, to deal with text, when the rest of the world was already on the UTF-8 train and dotnet already had and understood the UTF-8 encoding is just.... 🤯

@tig
Copy link
Collaborator

tig commented Mar 7, 2024

Do we want that or can it be removed?

Nuke from orbit please.

@dodexahedron
Copy link
Collaborator Author

Already done. :)

And I have a new implementation for this method, now, which:

  • Is zero-allocation for all inputs.
    • Not even an enumerator allocation.
  • Doesn't create any additional Runes while it runs.
  • Fixes existing unit test cases that were marked as bugged, so those test cases now look for the proper value (and pass with it).
  • Has the new desired behavior of letting the specifier key be a legal value for the hotkey, if it finds two of them consecutively.

My code isn't even special - it just adapts for what does the real unicode-related magic, which is a specific overload of a static StringInfo method that is intended for exactly this sort of use case/pattern, but which is non-obvious and not particularly well-documented, IMO. I actually came across its existence and applicability by chance from a performance-related question issue filed on the dotnet runtime repo that I almost didn't even look at.

Now to expand the test cases for more positive proof for that unit and this will be ready to go.

Remember this branch contains other mostly cleanup and performance-related stuff, as well, that was part of some rabbit holes I dove down in the process.

I rebased it on v2_develop before I got back to work on it, today, so it's all up to date and conflict-free, with one (already resolved) exception. There was a dead/unused method I had removed that got modified in another PR, and I just resolved that by letting it use the code from v2_develop and didn't bother removing it again, since it was just opportunistic and isn't in scope.

@dodexahedron
Copy link
Collaborator Author

Performance-wise, avoiding heap allocations is major, in any code, when possible.

Crude benchmarks showed this new implementation to be slightly better than on-par with an already-improved version I had written, in the worst case, but usually a LOT faster than that implementation, and faster in all cases vs the original with the capital search removed.

@dodexahedron
Copy link
Collaborator Author

Oh yeah. I also have some work to finish that I started earlier in the branch around implementing INotifyPropertyChanged.

I mean I guess I can submit it as-is and finish that on another branch if y'all want, since it isn't broken - just doesn't actually raise the event for all properties, yet. 🤷‍♂️

@dodexahedron
Copy link
Collaborator Author

Oh, that's right!

That's what actually initially led to me finding the other things I've addressed so far in this issue other than FindHotKey itself. So I guess it's all related enough I should just finish it in the same branch.

@dodexahedron
Copy link
Collaborator Author

This was a fun rebase.

Until I realized the entire conflict I was resolving across a few dozen files was just because I had already refactored that diagnostic flag enum in a different way before your changes to it. 🤦‍♂️

Oh well. All I had to do was drop my commit during the rebase and it's like it never happened, and no more conflict. 🥳

Still in the middle of the biggest part of this work but progress marches on.

@dodexahedron
Copy link
Collaborator Author

I'm working on a source generator, right now, relevant to this and other work items I've got in progress, but which also applies to/is intended for use with any enum type defined in the project. It is opt-in, only doing anything if the appropriate attribute decorates an enum. Its specific behavior is also configurable, to a degree, via properties exposed on the attribute for each enum.

The purpose of it is to standardize code related to enums (such as extension methods) and to provide significant performance benefits for various common and simple, yet deceptively expensive operations, especially for Flags enums and enums defined with unsigned underlying types.

I have other source generators planned, but this one is the only one I'm working on now, since it is highly relevant to the code I'm currently working on in this and related issues.

The source generator code itself has to live in its own dll project (otherwise that would be a chicken-or-egg problem/self-referencing dependency), but is only relevant at design/build time of the library itself, as with any other code generator. In other words, it's not an additional file in the compiled artifacts of the Terminal.Gui csproj. The only results of the generator visible in the compiled code are the code it generates - again, just like any other source generator.

I also would like to, as a future enhancement, add automatic unit test generation for generated code, for sake of formality and completeness, even though the generator itself will already have unit tests proving validity of its own generated code on sample enums.

@dodexahedron
Copy link
Collaborator Author

dodexahedron commented Mar 18, 2024

This first generator is also only for extension methods.

The next one that I intend to make is also enum-related, but is for generation of struct types associated with specific enum types, especially Flags enums, again for standardization, performance, easier use, functionality enums don't have (such as mutually exclusive flag value enforcement), singletons, wider type conversion capability, etc.

Both that and the one I'm working on for extension methods generate partial classes, by default (can be overridden via the attribute), to allow further customizations of generated code when and where necessary.

@dodexahedron
Copy link
Collaborator Author

I'm also writing it with InternalsVisibleTo the whole Terminal.Gui namespace, as well as to TerminalGuiDesigner, as it's quite likely to be useful there, as well.

@dodexahedron
Copy link
Collaborator Author

Analyzers are written and working for the new GenerateEnumExtensionMethodsAttribute, and the actual source generator is partially written.

The analyzer currently performs two diagnostics, injecting a compile error if they do not pass. Those are for ensuring the attribute is only used on enums the generator is designed to support, and are, for the initial version:

  • The enum must not be defined in the global:: namespace
  • The enum must have an underlying type of int or uint

Currently, the generator outputs a copy of the attribute class, so it can be used without being explicitly defined (and will always be what the generator expects), as well as the static class for the extension methods it's supposed to generate. That class is empty, though, at the moment, as I'm still wiring up the scanning of metadata needed to generate the methods. Once I'm done putting that together, generating the code should be fairly straightforward, as that process basically consists of writing the code as if manually implementing those methods and then just copying it into templates for the generator to use.

@tig
Copy link
Collaborator

tig commented Apr 15, 2024

I do believe this was fixed in #3275. Closing.

@tig tig closed this as completed Apr 15, 2024
@dodexahedron
Copy link
Collaborator Author

Not really. They were/are related, but have quite a bit of different work involved. This issue has a broader scope.

I have a rather large body of outstanding work that I'm just now coming back to.

Apologies, as I thought I had given notice a few weeks ago that I would be busy for a bit. I apparently did not, however, which sucks. My bad, guys. 😔

One of the source generators I mentioned is finished for a few enum extension methods that can be used in place of some pretty hot built-ins that are non-ideal. All that I really need to do to finish that particular part up is add/finish tests for it. I'm going to open a separate issue for that, though, once I'm closer to done with it, since it did stray quite a bit from the original intent of this issue and is its own unit all by itself. However, the results of the generator(s) will be used in the final implementation of the rest of the work I have outstanding.

I have plenty more I want to add to the generators, but I'd rather get it in place as soon as I'm done with this unit, so everyone can benefit from it sooner rather than later and so I don't have to keep rebasing. 😅

I've also written a PowerShell module for convenience, to work around the deficiency Visual Studio still has around referencing Analyzers/Generators in projects in the same solution. It's optional, basically just performs pre-builds of the analyzer project, and only needs to be run once. Or you can just open the solution, build the generators project, close visual studio, and open the solution again. VS just can't unload/reload an analyzer it hosts, once it's loaded.

That PowerShell module would probably make sense to merge with any other PowerShell script(s), to be both cleaner and more idiomatic PowerShell, too.

@dodexahedron
Copy link
Collaborator Author

Also, these generators replace some of the other work I had done for this as well as similar code for other types, so I'll probably end up mostly scrapping my current work and just cherry-picking a few key commits into a new branch, rather than trying to reconcile all that plus this plus all the other great work that's been done while I was MIA.

@dodexahedron
Copy link
Collaborator Author

Wow this was a big one and went all over the place.

Wasn't really completed, though.

And yes, it is the one that corresponds to at least most of the code I dug up that was talked about in your PR @tig.

Regardless, I think this one should remain closed in favor of #3617 and/or a new one (or probably multiple) for several reasons, not the least of which is there have been a TON of changes since this issue was originally written.

Many of the root concepts from the code I wrote mostly still apply, but the code itself largely doesn't, and it's not constrained to just TextFormatter, both because of TextFormatter being everywhere but also because other things needed supporting changes as well.

Here's a short list1 of some of the most impactful things we can do/use, related to performance, accuracy, maintainability, and probably other concerns as well that don't need to get enumerated right now:

  • Use of Rune can (and should) be significanly reduced.
  • For proper unicode handling, significantly less code in many places, and a net performance increase, especially in some hot paths, StringInfo is our friend.2
  • Similarly, TextInfo and CharUnicodeInfo are also very useful for other specific purposes.
  • Our already pretty small but quite hot dependency on that wcwidth package can and ideally should be reduced or eliminated, for several reasons, with several available options at our disposal, from easy to still pretty easy but time-consuming.

In any case, I guess we'll get to it when we get to it.

I'll likely just opportunistically slip a couple of the lowest hanging fruit things in with other work, since a few are literally one-liners, some even with no changes necessary to callers. The bigger stuff is going to be a lot easier to tackle when things aren't such rapidly moving targets.

So... I'll just drop a tag for myself for later.

@dodexahedron

Footnotes

  1. Yes, I have a muuuuch longer thing typed up for when it's time to do the work, but it doesn't belong right here right now.

  2. The remarks on that doc are very out of date, and the mention that it uses Unicode 8 is not true of .net 8. It uses ICU, so whatever ICU is on the system is what you get. Currently, that looks to be Unicode 15.1, which is current. That also means there's reduced risk of us or consumers attempting to do something in code that isn't legal for the target system, because the responsibility for compliance is shifted to ICU, which is the gold standard these days. We would not need to be checking codepoints for specific values for range validation and such, any more. It's on the consumer and their system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement testing Issues related to testing v2 For discussions, issues, etc... relavant for v2 work-in-progress
Projects
None yet
Development

No branches or pull requests

3 participants