Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parts of xterm.js should run in a web worker #1515

Open
2 tasks done
Tyriar opened this issue Jun 15, 2018 · 36 comments
Open
2 tasks done

Parts of xterm.js should run in a web worker #1515

Tyriar opened this issue Jun 15, 2018 · 36 comments
Labels
area/performance type/enhancement Features or improvements to existing features

Comments

@Tyriar
Copy link
Member

Tyriar commented Jun 15, 2018

I didn't see an issue tracking this yet.

Blocked by:

Related:

@jerch
Copy link
Member

jerch commented Jun 17, 2018

Any drafts/thoughts yet into this direction? From my limited parser centric view it seems the renderer and the data input + parsing are fighting over cpu time so my first wild guess is to move the all core parts (data input + parser + most of the terminal class) into a webworker and let only the DOM related stuff reside in the main thread. This will require some event translation, but thats already part of the new layered proposal. Tricky part will be to get the buffer data moved to the main thread without much runtime delay (SharedArrayBuffer might solve this in the future without a copy but is currently deactivated in all engines due to Spectre, so we gonna have to do it "the hard way").

@Tyriar
Copy link
Member Author

Tyriar commented Jun 17, 2018

My thinking is to move core/base into a worker and then ui/base/public runs in the UI thread. How buffer state is synchronized still needs some thinking.

@mofux
Copy link
Contributor

mofux commented Jun 17, 2018

One idea is to maintain the buffer only in the worker, and then have a more renderer friendly view model in the main thread that can be incrementally updated by the worker.

We could also rewrite core in rust as a WebAssembly module, but I think that is out of scope for now 😅

@Tyriar
Copy link
Member Author

Tyriar commented Jun 19, 2018

One idea is to maintain the buffer only in the worker, and then have a more renderer friendly view model in the main thread that can be incrementally updated by the worker.

Yeah that would be my preference, I think monaco needs a more complete view of the buffer though.

We could also rewrite core in rust as a WebAssembly module, but I think that is out of scope for now 😅

Yeah, seems like quite the effort to undertake 😄. Cleaning up the TS beats this out at least for now imo

@mofux
Copy link
Contributor

mofux commented Jun 20, 2018

I gave this some more thoughs, here's a rough overview of how it could work. The UI could subscribe a view and can specify the amount of rows it wants to see, and also the offset of rows to either top or bottom. The core will than track buffer changes that affect that viewport and emit updates with update instructions. The goal should be that the UI does not need to store any state, it should simply consume the updates and draw / redraw based on them.

screen shot 2018-06-20 at 12 20 04

// subscribe to a new view
const view = terminal.createViewport({ rows: 20, offset: 0, sticky: true });

// listen for view updates and modify the UI accordingly
view.onUpdate((changeDecorations /*changes*/, terminalState /*infos*/) => {

  // TODO: 
  // specify how changeDecorations could look like, maybe like this:
  for (let change of changeDecorations) {
     switch (change.type) {
        case ChangeType.DELETE: {
           // some rows have left the viewport
        }
        case ChangeType.ADD: {
           // some rows have been added to the viewport
        }
        case ChangeType.MODIFY: {
           // exstinig viewport rows have been modified
        }
     }
  }
});

// terminal viewport is scrolled in the UI, update the viewport offset
myViewportElement.addEventListener('wheel', (evt) => {
  // calculate the new offset
  let offset = ...

  // NOTE: this will likely trigger the onUpdate callback if things have changed
  view.setOffset({ offset, sticky: offset > 0 });
})

@jerch
Copy link
Member

jerch commented Jun 20, 2018

@mofux Hmm not sure about this - in theory the inner terminal core does not need to hold any data of lines, that are beyond the actual terminal height, to work properly. Only exception to this is resizing, which will occur not very often. Maybe we can turn this into an advantage and only hold the "hot parts" of the buffer in the core and delegate the scrollback data buffering to the renderer? It will need a backpropagation for resize events though. I have not thought deeper into this, just some random thoughts so far for me.

@mofux
Copy link
Contributor

mofux commented Jun 20, 2018

@jerch The scrollback has to be maintained somewhere, and I think it is better to have it in the core for this reason:
At some point I want addons like the linkifier to live in the core and subscribe to buffer changes and then create decorations on it. A decoration is like a sticky range (startRow, startColumn, endRow, endColumn) that provides rendering information to the frontend renderer (via the onUpdate listener). Cell styles like background, foreground, bold etc. would also end up as a decoration. This gives us some very big advantages: They only need to be calculated if a hot buffer line is updated. And since these calculations will eventually run in the WebWorker, it will unblock the UI thread.

@mofux
Copy link
Contributor

mofux commented Jun 20, 2018

Maybe this design makes more sense: We only maintain the buffer for the height of the pty, and then we hold our decorations for the pty screen height + scrollback:

screen shot 2018-06-20 at 13 22 28

@jerch
Copy link
Member

jerch commented Jun 20, 2018

@mofux Yupp you are right, totally forgot the addons. From an API design perspective it also seems more convenient to hold scrollback data in the offscreen part. What bugs me with the current buffer design is the fact, that we mostly hold rendering info in the buffer for every cell, even for the scrollback data, which cant be accessed by normal "terminal means" anymore (thus wont change anymore, except resize once it should support reflow). This is a very memory hungry layout.

With your second layout sheet we might be able to create some less expensive scrollback buffer (the purple part) that is closer to rendering needs (merged attributes and cells maybe?), while the orange part still gives quick access to all attributes on a cell by cell basis. I am still worried about the amount of data that we have to send between the worker threads to get something new shown.

Also not sure yet, how addons will play into this, some gut feeling tells me that we might have to build different API level entries if we want critical parts to be customizable from outside (like a core, pre and post render API).

@mofux
Copy link
Contributor

mofux commented Jun 25, 2018

One more thought regarding the architecture (and the title of the issue): Instead of just running core in a webworker, it would be nice (and in most cases even more useful) if we consider to run core in a server-side node process (close to where the pty process is), which then talks to the UI (browser) via RPC / websockets. Technically, there shouldn't be much of a difference between communicating with a webworker vs. communicating with a server process, it's only a different transport layer.

screen shot 2018-06-25 at 14 56 04

@jerch
Copy link
Member

jerch commented Jun 27, 2018

@mofux Imho this a groundshaking thing, there are several pros and cons for it, what comes into my mind:

pros:

  • unicode versions can be synced more easily between pty and terminal core, affects wcwidth calculation, grapheme and bidi handling (not yet supported)
  • restoring old terminal sessions can be implemented much easier
  • browser memory consumption is no concern anymore
  • frontend part can be really slim, in extreme down to a pixel image as terminal representation (like a screenscraper)

cons:

  • xterm.js depends on a server part, nomore browser only deployment possible unless someone ports the nodejs core part back to browser engines again
  • negative impact on server performance (cpu and mem), this is not an issue for local apps like hyper or vscode, but will hurt remote server apps with 1k+ terminals really bad - note: frontend power is kinda free for a service provider
  • depending on the "stubiness" of the frontend the server-client comm will explode (we gonna have to request and transmit view data over and over for small changes)
  • latency for remote apps will go up
  • Why bothering with a self written core part at all? Reinventing the square wheel? Just take some well established c terminal lib, scrape the output and forward it to the browser with some key and mouse handling there as interaction layer. Done.

Dont take the last point to serious, this is more rant than an objective argument. I still tend to the client core thingy for a simple reason: creating a transport layer for bytestrings is much easier/faster than handling different API states across components with different machine locations, imho. But I might be wrong with that.

@mofux
Copy link
Contributor

mofux commented Jun 28, 2018

@jerch Thanks for your thoughts, really appreciated! Please consider the ideas mentioned above as a testbed for discussion in order to develop a vision for future enhancements.

The main goal I'm seeing is to separate the ui from the core, so we can offload the work performed by the parser to a separate thread, unblocking the UI thread.

The second goal is to shape an interface (contract) that allows core and ui to talk to each other via some kind of IPC (be it worker.postMessage or a WebSocket, or even a traditional direct communication if both are running in the same thread like it is now). Challenge here is to keep the memory footprint minimal (we don't want to maintain the buffer + scrollback twice) and the communication overhead and payload as low as possible. If we get this right, it also opens up the opportunity to support different renderers more easily (e.g. monaco-editor)

xterm.js depends on a server part, nomore browser only deployment possible unless someone ports the nodejs core part back to browser engines again

I think we misunderstood here. I'm thinking of core as being a DOM independent piece of JS code that can run in the Browser, a WebWorker or Node.js without any porting. It should definitely be up to the developer to decide where core should run (depending on his use-case). My thought was to decouple core and ui as much as possible, so we could potentially support all these scenarios.

depending on the "stubiness" of the frontend the server-client comm will explode (we gonna have to request and transmit view data over and over for small changes)

It depends. If we only send incremental updates to the ui this shouldn't be much of a problem.

latency for remote apps will go up

At the moment we are sending all the pty data to the Browser frontend (with the same latency burden), which is really painful in scenarios where xterm.js is running in a browser that is far away from the server. I've seen situations where the data stream from the pty would just spam the websocket so hard that it would eventually disconnect because pings didn't come through anymore. Catching all the data at the server and only sending view updates to the browser (maybe throttled) could improve this situation. The thing is, we can't skip processing parts of the pty data stream because it has to consistently update the buffer - we have to eat it all. But we can skip / throttle updates of a view that reads from a consistent buffer state. And that's where I see the big advantage of running core at the server side. We're not forced to send the whole pty stream to the client anymore, we only send view updates. Commands like ls -lR / that rotate the buffer + scrollback multiple times a frame won't hurt the UI anymore.

@jerch
Copy link
Member

jerch commented Jul 7, 2018

I think we misunderstood here. I'm thinking of core as being a DOM independent piece of JS code that can run in the Browser, a WebWorker or Node.js without any porting. It should definitely be up to the developer to decide where core should run (depending on his use-case). My thought was to decouple core and ui as much as possible, so we could potentially support all these scenarios.

👍

It depends. If we only send incremental updates to the ui this shouldn't be much of a problem.

Yupp. There is a small but though - if we only send incremental updates the frontend part needs a way to hold the old data and merge the new onto. Oh - and the backend part needs some abstraction to filter updated content from old stuff. Maybe we can establish some buffer cache key thing on row level or even better (for amount of comm) / worse (for runtime + memory usage) on cell level (a real "write-through" cell data thingy).

About the latency / server-client-comm update thing:
This depends much on the granularity of updates we aim for. The sexiness of the current approach is the simplicity - we dont need a special protocol, we just pump the pty bytestream. Once we decide to go for a higher API transport thing we have to build a protocol layer to support this and that. I am not against such an approach, still it is some work to layout it in a way that third party users can integrate it easily into their very different environments.

Last but not least I think possible changes from #791 should be part of the considerations, some of my suggestions there might raise the burden to get easy and cheap updates delivered (esp. my storage & pointer approaches there might end up being contradictional).

@Tyriar
Copy link
Member Author

Tyriar commented Jul 9, 2018

Yupp. There is a small but though - if we only send incremental updates the frontend part needs a way to hold the old data and merge the new onto.

The canvas renderer already does this for the viewport:

private _state: GridCache<CharData>;

@Tyriar
Copy link
Member Author

Tyriar commented Oct 7, 2019

Closing as out of scope in the interest of keeping the issue list small as this probably won't happen for years if at all.

@Tyriar Tyriar closed this as completed Oct 7, 2019
@Tyriar
Copy link
Member Author

Tyriar commented Nov 5, 2020

Reopening this as it would be a nice direction to go and ensure the main thread stays responsive during heavy workloads.

I played with web workers a bit lately and I would imagine it work work by using a shared array buffer if it's supported, and if hand the write buffer data over to the worker (and not persist in the main thread). The main challenge imo is how you're meant to easily get embedders to leverage the workers as there are multiple files required then, xterm.js may also get bundled into a different position.

@jerch
Copy link
Member

jerch commented Nov 6, 2020

@Tyriar Some investigations I did in that field:

Shared array buffer (SAB) with own atomic read/write locks is the fastest way to get data "moved" between worker threads. Its still loosing 20-30% for thread sync / locks compared to single threaded promise code moving data around (tested without any workload on the data, so those numbers are just for the data "moving" part). Downside of this solution: hard to get done right (and maintain?), may not work in all browsers due to Spectre security patches.

Normal object transfer is okish and the only valid fallback if SAB + atomics are not available. Runs ~50% slower than a SAB solution, thus might penalize screen update latency (no clue if this will be perceivable in the end).

Since the buffer is made of typed arrays a SAB solution could be easily done. Still a fallback might be needed to cover engines without SABs. Last but not least the worker-mainthread interface will be challenging since we have so many events/callbacks into browser code, that need to be plugged without race conditions.

@Tyriar
Copy link
Member Author

Tyriar commented Nov 6, 2020

Yes we'd definitely need a fallback as Safari doesn't have SAB for example. My explorations in workers and SAB is that they're pretty awesome, you just need to think about how fallbacks and packaging would work (2 build targets? rely on dynamic imports? how do embedders bundle?).

Good point with callbacks and events, what exactly we would pull into the worker is not clear cut at all. You can see by zooming into a profile that arguable parsing isn't the expensive bit:

image

So maybe in an ideal world the buffer should also be a shared array buffer owned by the worker thread and only events that indicate explain which ranges changed or something would be sent over to the main thread.

Another thing to think about is how does transferControlToOffscreen fit into all this. It would be extremely cool if we have a renderer thread, a parse/buffer thread and the main thread barely does anything, keeping the application extremely responsive even when heavy parsing/rendering is going on.

It's definitely clear that this would be a pretty radical change though, no way I'll have time to play with this any time soon but it's always fun to talk about. We can think about this as we shape the architecture and maybe do small parts of it first (like having the webgl renderer optionally run in a worker). I also want an issue to point duplicates at for VS Code dropping many frames when the terminal is busy.

@jerch
Copy link
Member

jerch commented Nov 6, 2020

Another thing to think about is how does transferControlToOffscreen fit into all this. It would be extremely cool if we have a renderer thread, a parse/buffer thread and the main thread barely does anything, keeping the application extremely responsive even when heavy parsing/rendering is going on.

Indeed. This could even be made lockfree for the normal PRINT action (which in conjunction with Terminal.scroll covers like 95% of heavy terminal activity) if we resort to copy-on-write for a line with hooking the updated line as one genuine atomic action (possible as long as we stick to a single writer - single/multiple reader pattern). Locks will prolly still be needed for other actions like ED, resize and such (things that manipulate more than one line a time).

Yes the webgl renderer seems to be a perfect fit to test out new paths into this direction as only webgl context is currently allowed for the offcanvas. Not even sure if we can gain anything for the DOM renderer here - Would it be faster to pre-construct the DOM strings in a worker and move the content as a big update batch to the mainthread? Idk, never tested anything like that.

Edit:
On code level I wonder if we could get away with a decorator pattern, that "maps" functionality to different thread targets. Something like that:

@bufferWorker
@main
class Terminal ... {
  // not decorated things get spawned on all thread targets listed on the class
  public doXY(...) {...}
  
  // only in bufferWorker
  @bufferWorker
  public doSomethingOnBuffer(...) {...}
}

It is just a rough idea atm, not even sure if decorators can be mis-used for that type of macro-precompilation stuff. It certainly would introduce another precompile step, still it would make those definitions on code level much easier. Oh well, it also would break with IntelliSense, so might not be a good idea at all. Is there anything in TS to do macrolike precompile tasks?

@arsinclair
Copy link

Since all related issues in vscode repository are closed and they point to this issue, I am going to ask here: is there any workaround for vscode to be able to print long lines without freezing?

This problem sometimes makes vscode unusable, as some packages print their debug information in a sequence of a very long lines.

@jerch
Copy link
Member

jerch commented Dec 3, 2020

@arsinclair This is def. a bug in some terminal buffer consuming handler in vscode, xterm.js itself does not have that issue. Imho the right way to fix this would be to address the issue in the slowpoke function.

Hacky workaround:
You could try to identify the slowpoke and remove it from the event. But do this only if you know what you are doing, as it might break vscode terminal integration up to weird unwanted side effects (which might turn out really bad if the terminal is connect to a real system). I dont know the handlers there, thus cannot tell if it is safe to simply remove any of them.

@Tyriar
Copy link
Member Author

Tyriar commented Dec 3, 2020

@arsinclair you're probably seeing microsoft/vscode#100338

@arsinclair
Copy link

arsinclair commented Dec 3, 2020

@Tyriar, I don't think so - there are no URLs in my output. It is just a set of comma separated UUIDs.

@jerch, I'm not sure what is a slowpoke function that you're referring to. Could you elaborate?

@jerch
Copy link
Member

jerch commented Dec 4, 2020

@arcanis With slowpoke function I mean some code that eats your precious CPU cycles for non obvious reasons. The linkifier would be a candidate, if you have very long auto wrapping lines. It does not matter if there are any links in your output, identifying them itself may take pretty long.

@reduardo7
Copy link

What about it?!

@arsinclair
Copy link

arsinclair commented Dec 16, 2020

if you have very long auto wrapping lines

Yes, this is exactly the case.

It does not matter if there are any links in your output, identifying them itself may take pretty long

The problem is, once I scroll to the long lines block, the CPU load jumps to 100% and it never goes down. I've tried waiting for several hours to see if it ever finishes.

I am neither xterm.js, nor vscode developer, so I am not familiar with the codebase and I don't know why it is happening and how to fix it.
Since I didn't have time to deal with it, I just redirected all output to a .log file and then was reading that file separately.

@Tyriar
Copy link
Member Author

Tyriar commented Sep 25, 2022

I was thinking a bit about this today. This is a very high level diagram of what we have today:

image

So what parts would we want to move to the worker? We could for example move CoreTerminal in its entirety, which is essentially moving xterm-headless to the worker and then having a mostly async API on the browser side. While this is somewhat elegant as xterm would essentially use xterm-headless directly, it does complicate the API quite significantly.

Here's my proposal which was covered somewhat above but without pretty pictures:

image

The small buffer line SharedArrayBuffers would be shared on every update, I do not expect this to be expensive. This allows us to avoid any object transfer overhead as the worker thread only needs to accept a few pointers.

We would allow the parser protocol to also run in the main thread (via dynamic import) when SAB/workers are not supported or not enabled:

image

So with the above we would come up with some strictly defined "parser protocol", and then allow it to be wrapped in a web worker (eg. WorkerParser and MainThreadParser, both wrapping Parser). This MainThreadParser is a refactor we could work towards over time and then eventually it would be trivial to create WorkerParser.

Using this architecture we could also reduce latency by hooking the threads up directly to the pty:

image

The refactor would involve:

  • Creating a single Parser class which is the entry point, it will eventually accept all input and send out all output.
  • Remove all service usage inside Parser and all its dependents (WriteBuffer, InputHandler, etc.). For example instead of relying on options service, the options are sent over to Parser whenever they're updated which maintains a parallel copy. This would also involve exposing many events, essentially an event for anything incoming data can do that is not represented in IBufferLine.
  • Figure out how to handle extended attributes and combining characters which store additional data on top of IBufferLine._data. This could for example move to an auxiliary SharedArrayBuffer which is shared between all lines*.

Then later:

  • Create MainThreadParser which wraps Parser
  • Switch IBufferLine to use SharedArrayBuffers conditionally depending on options/environment
  • Create WorkerParser which wraps Parser in a web worker.

* I also explored combining all IBufferLine data into a single ArrayBuffer in Tyriar/xterm.js#array_buffer_list. This has some promising performance benefits for resizing the terminal, however it is quite tricky to integrate this because of how ICircularList works.

@Tyriar
Copy link
Member Author

Tyriar commented Sep 25, 2022

Here's a little proof of concept that uses a worker to change parts of the first buffer line: https://github.com/Tyriar/xterm.js/tree/worker_poc

@Tyriar
Copy link
Member Author

Tyriar commented Sep 25, 2022

Also after thinking about HTMLCanvasElement.transferControlToOffscreen I don't think we should do this as it would result in a "tearing" effect where some parts of the terminal (eg. decorations) would be out of date while the cells updated underneath

@jerch
Copy link
Member

jerch commented Sep 25, 2022

Imho to answer the question "what should go into a worker" should be carefully chosen from runtime metrics we see so far. From there we can identify a working goal for xterm.js:

  • What's actually a nuisance on the main thread?
  • What can be workered w'o too much code bloat?
  • What must stay on mainthread?

The last question can be answered pretty quick - everything thats needs browser DOM API has to stay on main. Currently thats mostly final IO stuff - "I" for input from keyboard and mouse (thus all DOM event handlers), and "O" for everything that makes its appearance on the screen. Final here means - everything from the first (input) or last (output) stages, while we could slice into the functionalities sooner or later (from callstack perspective) and put things either on worker or main side. (Sidenote - I am not sure whether offscreen canvas is yet in a usable stage to move output completely off from main, my last tests regarding this for the image addon was not so promising, as engines diverge too much in supported features.)

Regarding the first question things cannot be answered that easy, as it depends alot on the used sequences in the incoming data. For pretty common ls -l output I remember roughly these numbers on the input chain (with webgl renderer):

  • 50 - 55% for WriteBuffer._innerWrite (total input chain, including all below in callstack), consisting of:
    • 3-5% for Utf8ToUtf32.decode
    • 5-8% for EscapeSequenceParser.parse
    • 20-24% for InputHandler.print (total, thus incl. buffer writes)
    • 17-20% for InputHandler.lineFeed (total, thus incl. Terminal.scroll with line recycling)

For very data intensive payload (e.g. image data for the image addon) utf8 decoding and escape sequence parsing will jump to ~25%, but meanwhile it stays far below 20% of the workload. From an isolation of concerns perspective both (utf8 decoding and sequence parsing) could be easily made faster by different approaches (up to 4 times faster with wasm), but the overall benefit would be small, if we dont get the "buffer writing" any better, as that causes ~50% of the runtime for common ls data.

And thats the point where I want to question your schematics above, where the buffer still resides on mainthread side. Imho we wont get much of a relief on the mainthread, if we go that route. Additionally the parser protocol in between will create work on top eating alot of the benefit. (Sidenote: I currently remove the image decoder worker thread for that reason from the image addon sixel path, as it puts too much work on top making everything worse, while sixel in wasm is so fast with much shorter "data lines").

So how about that:

  • worker: every aspect of input chain up to terminal buffer gets moved in a worker (incl. a possibility to directly feed chunks to worker, which lifts the heavy burden of driving the websocket chunk ingestion/event loop blocking from mainthread)
  • mainthread: maintains all event driven things (mouse keyboard input), and the final screen output via DOM, canvas and webgl renderer (the latter can maybe moved to worker in the future once browser engines got their disputes sorted out)
  • worker - main communication: forward user interaction stuff (main --> worker), forward screen updates (buffer slices from worker to mainthread, might be possible non-blocking if carefully designed on buffer level)
  • addons/parser hooks: well thats a very tricky topic, we prolly would have to build a 2 stages addon interface for both parser/buffer worker + mainthread stuff, this gets further complicated by the need of synchronization primitives (atomics, mutex etc)

(Note thats only a quick writeup not covering the details yet. And as always - those details kinda would answer the 2nd question, as things moved to the "other side" might get really hard to maintain/expose.)

  • I also explored combining all IBufferLine data into a single ArrayBuffer in Tyriar/xterm.js#array_buffer_list. This has some promising performance benefits for resizing the terminal, however it is quite tricky to integrate this because of how ICircularList works.

In the beginning of the buffer rewrite back in 2018 I started with exactly that. This is really much better performance-wise than the current single bufferline abstraction. (Cannot quite remember, but I think the speed benefit was ~2.5x for normal buffer actions, incl. resize.) In the end we did not go that path back then because of the need for strict custom memory management, which felt way too C-ish, and we had no clean story how to ensure that from ICircularList (the line recycling there is kinda the rest of that allocation/gc lowering approach, which made it into usable code).

@Tyriar
Copy link
Member Author

Tyriar commented Sep 26, 2022

What's actually a nuisance on the main thread?

The 2 big things that happen on the main thread are processing input via write buffer/input handler/etc. and rendering. I'm pretty convinced at this point that the rendering should remain on the main thread in order to ensure all rendering occurs in the same frame. For one we have it performing pretty well now, and secondly issues like microsoft/vscode#145801 will happen again if we moved it, but AFAICT be impossible to fix. This class of problem is much more noticable on lower specced hardware but even on good hardware it's too big of a hit on the UX imo not having synchronized rendering. Additionally, I think moving the rendering over to another thread will end up being too much of a and never end up happening because of that.

So that just leaves processing of input. Note that a big part of this comes in when the embedder does a lot of other things, such as has many other terminals that may be busy, other UI components fighting for CPU time (monaco, VS Code's workbench, etc.).

And thats the point where I want to question your schematics above, where the buffer still resides on mainthread side. Imho we wont get much of a relief on the mainthread, if we go that route. Additionally the parser protocol in between will create work on top eating alot of the benefit. (Sidenote: I currently remove the image decoder worker thread for that reason from the image addon sixel path, as it puts too much work on top making everything worse, while sixel in wasm is so fast with much shorter "data lines").

Yes, this plan would only make sense if the additional parser protocol would not cause additional overhead. I believe we could pull it off by having most interactions with the buffer (and flags, etc.) done via SharedArrayBuffers which are essentially zero cost, we just need to send over a signal that a change has been made.

Let's dig into some of the expensive examples you gave what would live on worker and what would happen on main. Let me know if this doesn't sound right so I can fix my understanding:

Utf8ToUtf32.decode

100% worker (-3-5% off cpu)

EscapeSequenceParser.parse

100% worker (-5-8% off cpu)

InputHandler.print

Worker for everything except:

  • Fire onA11yChar when screenReaderMode is on
  • IOscLinkService.addLineToLink, this creates marker/links used by OscLinkProvider
  • Some signal that a scroll, splice, general update etc. occurred. This may create the IBufferLine's SAB on the worker side

IDirtyRowService doesn't seem to actually communicate outside #4147, which also applies to InputHandler.lineFeed.

InputHandler.lineFeed

Worker for everything except:

  • A signal that a scroll occured in the buffer, again I think most of this could be done on the worker side via a SAB.
  • A signal to change the buffer's x/y. This sort of thing along with flags could also be done entirely via SAB:
    this._bufferInfoBuffer[Offsets.X]++;
    this._bufferInfoBuffer[Offsets.Y]++;
    I think this would mean a signal that a change happened which would refire an event would be needed.

So given the above what we would communicate back are events like:

interface IProtocolMessage {
    events: IProtocolEvent<unknown>[];
}

interface IProtocolEvent<T extends ProtocolEvents> {
    type: T
    payload: ProtocolEventTypes[T]
}

const enum ProtocolEvents {
    Scroll,
    AddLineToLink.
    CursorChange,
    A11yChar
}

interface ProtocolEventTypes {
  [ProtocolEvents.Scroll]: { lineBuffer: SharedArrayBuffer, more? };
  [ProtocolEvents.AddLinkToLine]: { id: number, y: number };
  [ProtocolEvents.CursorChange]: undefined;
  [ProtocolEvents.A11yChar]: { char: string };
}

worker: every aspect of input chain up to terminal buffer gets moved in a worker

I did consider this, but I think that would end up turning a bunch of the API asynchronous?

addons/parser hooks: well thats a very tricky topic, we prolly would have to build a 2 stages addon interface for both parser/buffer worker + mainthread stuff, this gets further complicated by the need of synchronization primitives (atomics, mutex etc)

This is where the proposal above shines as it feels like it could actually happen. If we're only moving the input processing, I don't think there is any impact on the existing API where everything is synchronous and easy to use.

What can be workered w'o too much code bloat?

I wrote a nicely abstracted IPC layer for Luna Paint to go to/from the extension host to a webview. That's where I figured out the neat type mapping trick above (ProtocolEventTypes[T]). I don't think it would be particularly large. One related benefit is the main thread code bundle will be much smaller, which including the webpack requiring (not sure what this is exactly) ends up at about 100ms:

Screen Shot 2022-09-25 at 5 14 31 pm

We already dynamically import xterm.js to avoid this on startup unless absolutely required, reducing it will help with UI responsiveness and unblock other features that may be initializing if the terminal was open on launch.

In the beginning of the buffer rewrite back in 2018 I started with exactly that. This is really much better performance-wise than the current single bufferline abstraction. (Cannot quite remember, but I think the speed benefit was ~2.5x for normal buffer actions, incl. resize.)

That's a similar speed up to what I was seeing resizing larger drop to when I was trying to estimate impact.

In the end we did not go that path back then because of the need for strict custom memory management, which felt way too C-ish, and we had no clean story how to ensure that from ICircularList (the line recycling there is kinda the rest of that allocation/gc lowering approach, which made it into usable code).

I don't mind going more C-ish for stuff like this as that lets the JIT do its thing much more effectively, plus it's pretty much the only way to squeeze more big gains out of JS without going WASM. We're basically doing this low level stuff all over the place anyway with the buffer management. I think the ArrayBufferList came out quite nicely as to the outside it works mostly like a 2d array, just that an event fires when the view expires (I also thought about having Pointer<Uint32Array>'s which would handle this as well).

I put that little exploration down as something interesting we may want to do later, it doesn't need to happen for the web worker stuff as all the worker needs to know about are the mutable ~15+ buffer row arrays. So a message over can be assembled quickly like:

postMessage({
  rows: [sab, sab, sab, ...]
});

I guess these will typically already be on the worker side anyway so we maybe only need to send them main -> worker on startup?

@jerch
Copy link
Member

jerch commented Sep 26, 2022

@Tyriar Oh well, sorry I totally overread the SAB notion - yes that would make the buffer protocol a no-brainer and allow direct writes. It has another intriguing advantage - in the worker we can use all sync code up to atomic waits, thus the parsing should be much faster again. Currently the stack unwinding is really painful (and the reason why the worker approach in the image addon takes so long), which is not the case anymore with the parser in a worker - we can simply place a "HALT" atomic flag there, which stops at certain positions w'o screwing up stack state. (This could also be used to indicate a pending hook from outside waiting for completion.)
Since its pretty late here I gonna read the other details tomorrow.

@jerch
Copy link
Member

jerch commented Sep 26, 2022

I'm pretty convinced at this point that the rendering should remain on the main thread

Yes I agree. To me it still seems to be almost impossible to achieve reliable off main rendering across all engines. (Firefox only supporting webgl ctx type, Safari lacking offscreen canvas in total, and proper webgl in particular).

For the interface you stubbed - looks about right. At this point I think we really should invest the time to check if we get ICircularList to work on a single blob SAB, simply for the fact, that new SAB exchange between main <--> worker for single lines might eat alot performance (the engines have to place several mutex guards around those, which again adds runtime, if we do that for every line separately). Ideally we have only one big SAB for a terminal buffer with several housekeeping entries at the beginning (like state flags, cursor pos etc), and line contents at later offsets. The extended attrs and the combined complicate the setup abit, still both can be expressed on the same array buffer. I think it is possible to shape ICircularList that way, if we close all holes, where a line ref might go out of scope (thats where we'd have to a place a dtor). Then we almost automatically get a buffer interface on both ends capable to work concurrently with only tiny locking needs. (We still will face typical concurrent race conditions, if we have to lock buffer parts for screen rendering, but that can be avoided by other means...)

The messaging to/from the worker is still somewhat painful - we have to take care, that there is no message congestion possible (doing most via SAB state flags will avoid most of that). And last but not least - we need to adopt parser hooks to the worker concept - a simple straight forward way maybe just places a "HALT" on the parser worker, executes the hooks on mainthread as done currently and unlocks the worker afterwards. A more involved interface would allow to place hooks in the worker context (maybe a future extension).

@Tyriar
Copy link
Member Author

Tyriar commented Sep 26, 2022

Good point with parser hooks. Some ideas:

  • Make all parser hooks async
  • Come up with some clever way to install compatible sync hooks into the worker
    • Not so clever example: give it an eval() string that returns a value?
  • Give a setting to disable workers (this would happen anyway) for when things like this cause an issue

The hand over from the worker was quite fast when I tested it, like on the order of < 1ms, at least for my very simple PoC (would be worse in a busy event loop). So I'm not actually sure this would impact the UX much with all async hooks, unless you used a lot of hooks.

@jerch
Copy link
Member

jerch commented Sep 27, 2022

Yes, processing foreign hooks in main while is the worker is "HALT"ed might impose bad perf again, if the hooks are often (like for every single line). Though imho it still will be much faster if done with a HALT-atomic on sync worker code, than currently possible with the stack unwinding. Certainly being able to place the hooks into the worker will be best performance-wise, but comes with several restrictions:

  • as you already noted, code must be executable from eval (so no locally defined obj refs from a higher scope are allowed etc.)
  • no async code allowed again (otherwise we'd have to do the same stack unwinding again, which we really should try to avoid)

This reduces the usability of worker-added hooks to tiny notifiers or pure reentrant inline functions operating only on symbols defined in the worker context - not very useful if you ask me. Imho the HALT-flag and execution on main side is here the more likely use pattern for most foreign hooks. Imho we should measure the impact of a HALT atomic, I think the perf overhead is pretty small (but true, it will create a tiny perf overhead for its synchronization needs).

About the HALT flag - this could be implemented schematically as follows:

// worker side - within sequence parser:
...
if (customHooksRegistered(currentSeq)) {
  notifyMain(currentSeq);
  Atomics.store(bufferSab, haltFlagPos, 1);
  Atomics.wait(bufferSab, haltFlagPos, 1);
}
...

// mainthread
parserWorker.onSequence(currentSeq => {
  // walk all registered hooks and exec
  ...
  // finally
  Atomics.store(bufferSab, haltFlagPos, 0);
  Atomics.notify(bufferSab, haltFlagPos, 1);
});

Big advantage here - if we ensure to always pull values from the SAB (no local variables in the parser) we dont have to repair any stack variables, thus the blocking/unblocking works directly by issuing these 4 atomics instructions. (Another advantage: a hook can still be async, the final part can also be issued by a thenable)

@Tyriar
Copy link
Member Author

Tyriar commented Sep 27, 2022

@jerch 👍 I haven't played with Atomics yet but seems promising. I also just realized the behavior of onCursorMove/onLineFeed/onScroll/onWriteParsed would change if we didn't do those sync, so we could change the behavior to batch those events (eg. onLineFeed @ [x, y, z], onScroll 4 times) or use atomics to do it sync.

Some of the events like onCursorMove arguably shouldn't fire for every single move anyway. I don't see any issue in deferring onWriteParsed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

5 participants