text/v2: Draw performances using DrawImage #2976

Zyko0 · 2024-05-01T17:48:21Z

Note: We're talking about a CPU bottleneck mostly
Note2: Old related issue #1880

Currently, text.Draw makes a call to DrawImage for each glyph of a string which requires a traversal of the internal images' DrawTriangles pipeline (transformations, copy, triangles merging mechanisms) before resolving to a draw command.

This can be an issue with large texts, as the performance cost would scale based on the number of glyphs to render.

Since the glyphs share a source, they could benefit from a higher level atlas (owned as an image by text/v2 or each source).
Then, triangles can be built and submitted all at once in a DrawTriangles call instead of many DrawImage calls, which would trigger the internal draw pipeline only once.

Potential hidden but unmeasured benefit (discussed on Discord):
It could become difficult for Ebitengine to reliably affect each glyph sharing the same text command to the same single atlas, because at this stage Ebitengine's internals are not aware of the glyphs' meaning and their parent context.
For example, this could potentially happen if: many images are created by a user, as well as many glyphs (from different font for example), or if some new glyphs from a registered font are used for the first time later in the game execution.

Here is a gist of a beginning of an implementation (profiling code is included because that's how I could measure): https://gist.github.com/Zyko0/413c536b625b7a7dc27ef2030ddfa027

Here is the output of pprof of the Game.Draw function for a 95sec profile:

(TextDraw is the triangles implementation from the gist above)

Extra notes:

The textv2-owned glyph atlas might not be trivial to implement and manage at runtime, and could make the overall triangles solutions harder to maintain / less beneficial than the DrawImage in terms of performances, if not done correctly.
If the string is too big and produces more indices than MaxIndicesCount, the triangles commands from this function would have to be split maybe?

The text was updated successfully, but these errors were encountered:

tinne26 · 2024-05-06T08:00:36Z

which requires a traversal of the internal images' DrawTriangles pipeline

Can you point to the code? Intuitively, it looks to me like each potential draw command in progress could cache a few values that could be compared quickly to know if we are possibly breaking the draw command or not, I don't see why there's any need to iterate. E.g., we can quickly access the image's atlas pointer to know if it's the same we have had throughout the draw or not? I'm probably missing many things, but it would be nice to contextualize first. What cases make traversals really necessary?

Zyko0 · 2024-05-06T14:02:49Z

@tinne26

By "traversal", I mean the path each DrawImage is taking (involving loops, slices copies, many branches):

ebiten: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/image.go#L213-L266
ui: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/internal/ui/image.go#L80-L117
mipmap: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/internal/mipmap/mipmap.go#L68-L128 (possibly "skipped" for images, but still)
buffered: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/internal/buffered/image.go#L186-L212
atlas: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/internal/atlas/image.go#L420-L537
graphicscommand: https://github.com/hajimehoshi/ebiten/blob/v2.8.0-alpha.1/internal/graphicscommand/image.go#L133-L146
graphicscommand (triangles merging logic): https://github.com/hajimehoshi/ebiten/blob/main/internal/graphicscommand/commandqueue.go#L108-L166

So, I assume the performance penalty comes from traversing this for each glyph of a big text, even though they get merged in the end.

By doing DrawTriangles you also go through this, but only once.

I made another local test by having the glyphs slice already available for both:

         .     11.34s    194:   TextDraw(screen, glyphs) // Single draw triangle with len(glyphs) triangles
         .          .    195:
         .     55.88s    196:   TextDrawImage(screen, glyphs) // len(glyphs) DrawImage loop

Zyko0 · 2024-05-06T14:09:09Z

So, in the end I don't think DrawImage performances need to be improved, or maybe it does, but if it does, then again it will also benefit DrawTriangles which takes the same path, meaning single DrawTriangles will still be slightly faster.

What I'm suggesting, is that in such scenarios where it seems possible to have the control over the triangles to be drawn and the source image, it should be done this way (since DrawImageS** doesn't exist, it's not just about triangles here).

I just see it as a "more advanced" usage of Ebitengine API, which can happen in user projects as well (think many sprites for example).
text/v2.Draw implementation does not depend on internals, so if it doesn't change at ebitengine level, a user can still roll its own solution.

Note: Also again, we should not forget about the fact that glyphs' source atlas cannot be controlled since Ebitengine treat them as random images. Which means it's possible that if a user is manipulating many source images as well as some glyphs, that glyphs images wouldn't be on the same source atlas => which means that a single text.Draw might have to be split into multiple draw commands due to the necessity of binding another source atlas texture containing the required glyphs.

edit: Just another tldr clarification, it's not triangles vs image, it's "many images at once" vs "many times 1 image". Just that the only way to submit many images at once is by using DrawTriangles+a source atlas

Zyko0 · 2024-05-06T15:02:56Z

E.g., we can quickly access the image's atlas pointer to know if it's the same we have had throughout the draw or not? I'm probably missing many things, but it would be nice to contextualize first. What cases make traversals really necessary?

As for this question specifically, I think @hajimehoshi would have a better answer, but mipmap, buffered (context restoration on android?), ui (first internal implementation), atlas (atlas => 1 image wrapping many), graphicscommand (merging commands, storing them for replay (?) and delayed execution) all seem mandatory and by design.
Not sure if that's what you were asking though

hajimehoshi · 2024-05-06T15:24:40Z

My understanding is that @tinne26's question is 'why not merging DrawImage calls in an upper layer rather than graphicscommand'. Is that correct?

hajimehoshi · 2024-05-06T15:32:14Z

Assuming this is the question, yeah this is possible, but I don't want to do that since the merging logic would be split and distributed. If the performance would be much improved, we could consider this.

tinne26 · 2024-05-06T17:15:58Z

Ok, a few miscellaneous observations:

DrawTriangles() and manual atlasing is being around twice as fast as DrawImage() in these results. We are missing two critical things here to make a proper performance assessment: first, we really need a graph showing performance of DrawImage vs DrawTriangles throughout an increasing number of DrawImage calls to see how does this scale (e.g., start at "M" and end with "My big sentence [... 1000 glyphs more ...]"). Second, we would need to contextualize these times in % of frame time budget. Let's say 60TPS, no screen clearing and skipping draws without update changes. This would make everything pretty crystal clear.
So, testing performance on Ebitengine is rather tricky. If we really want to make progress on these issues, maybe we should eventually invest in building a bench subpackage for Ebitengine or something, because otherwise making meaningful measurements and having others try to reproduce on their end is a big practical obstacle.
If there's any function in Ebitengine deserving optimization effort, then DrawImage() is probably one of the leading contenders... but at the same time, I also agree that we don't want things to become too complex and messy. In fact, the current pipeline is already very deep and complex. If anything, before attempting any optimization, I'd rather get a document/diagram explaining all the required steps and special cases and conditions involved in the process of going from "draw image" to "quad appended to graphics command". Because what's already there is already too much for most mortals to visualize, and there are tons of conditions and branches for technical details.

So, I'm not saying we try to do any of this, but I definitely feel like meaningful progress on this type of issues would require much more groundwork. Ok, Hajime would be able to do it on his own, sure, but my main point is that if only he is able to do it / visualize it / follow it / understand it, then it's probably not an optimization we want to add (obviously, talking about DrawImage() here, not DrawText()).

Regarding the DrawText() optimization, I think we need more investigation into the DrawImage() vs DrawTriangles() before being able to make a call. Actually, there are some additional issues, like this optimization forcing an extra atlas usage, which might not be ideal neither on very small games (where everything could fit into a single atlas otherwise), nor big games that have to be more meticulous about the resources used. It's also far from impossible to overflow the atlas, especially if you try to be conservative with its size (e.g. at 1024x1024, an application with zoom or where the user can select fonts from the system list, + non perfect atlas filling, +4 subpositions per glyph... you can get in trouble earlier than expected). Otherwise you might be wasting resources. While I tend to agree that the optimization is reasonable, the implications in practical contexts are not so clear to me, and there are a fair bit more complications than the gist lets on.

hajimehoshi · 2024-05-06T18:02:25Z

Ebitengine should be dead simple for users, but unfortunately not so simple for implementers 😄

If anything, before attempting any optimization, I'd rather get a document/diagram explaining all the required steps and special cases and conditions involved in the process of going from "draw image" to "quad appended to graphics command".

I'm fine to draw a big diagram to explain the flow how an image is rendered from DrawImage to the graphics driver. Let me have time.

Zyko0 · 2024-05-06T19:51:32Z

@tinne26

We are missing two critical things here [...]

For your first point, yeah I took a few shortcuts obviously, but I think the data in its current form is already relevant. I was hoping for Hajime or someone else to assess if I missed something, by having the gist shared and for people wanting to join to check against it.
It is true that, if the context is vsync=on and 10 strings on screen, everything is irrelevant because performances would be acceptable in any case.

So, testing performance on Ebitengine is rather tricky. If we really want to make progress on these issues, maybe we should eventually invest in building a bench subpackage

For your second point, I suggested this once, but it can probably be made by a user (I wanted to do it) as an external repository to test scenarios against each version of ebitengine.

If anything, before attempting any optimization, I'd rather get a document/diagram explaining all the required steps and special cases and conditions involved in the process of going from "draw image" to "quad appended to graphics command". Because what's already there is already too much for most mortals to visualize, and there are tons of conditions and branches for technical details.

I agree that the way this issue is written is mostly in destination to Hajime, because I believe I'm starting to have some knowledge regarding the things you mention, so this is addressed to either Hajime or people knowing what I'm refering to.
But I agree with the fact that there should be a document, in fact when Hajime talked about his desire to write more documentation on Ebitengine, that's the exact path (DrawImage) that I said would be beneficial to contributors.

So, I'm not saying we try to do any of this, but I definitely feel like meaningful progress on this type of issues would require much more groundwork. Ok, Hajime would be able to do it on his own, sure, but my main point is that if only he is able to do it / visualize it / follow it / understand it, then it's probably not an optimization we want to add (obviously, talking about DrawImage() here, not DrawText()).

I mentioned optimizing DrawText over DrawImage because (I linked a previous issue at the top of this issue post regarding DebugPrint), optimizations have already been made on DrawImage and the function itself seems quite "stacked" already.

Regarding the DrawText() optimization, I think we need more investigation into the DrawImage() vs DrawTriangles() before being able to make a call. Actually, there are some additional issues, like this optimization forcing an extra atlas usage, which might not be ideal neither on very small games (where everything could fit into a single atlas otherwise)

And this point, is actually the point of the issue, imo the only consideration is: how difficult would it be to manage an extra atlas at text level.
Regarding "extra atlas usage", what I'm talking about is not an "internal atlas", but rather a simple image (managed) where the glyphs would stay on (that could itself be part of the single source atlas from ebitengine you're talking about).
^ => Just how a user would create its own user atlas to ensure that they control the source of their images, and have them all at the same place with uses of SubImages from their user-owned atlas (doesn't have to be unmanaged again, I'm talking about a logical user atlas).

And I know there would be complexity involved in managing an extra atlas at an higher level, and I'm not really knowledgeable on the topic, I know @hajimehoshi implemented this logic internally, so that's why I left this unsolved (and actually mentionned about it in the original post), I wanted his input on this because I know this can be a critical point (if managing a text-level image atlas of glyphs taxes more resources than simply using DrawImage, but I said that in the OP already).

So, just to be completely clear: whether DrawImage could be optimized or not, just know that DrawTriangles takes the same path, so they have the same "issue" of going through too many steps (if that's an issue).
The only thing is that we can batch it at software level thanks to DrawTriangles and a user-owned atlas, which is not possible with a DrawImage call since it will only resolve to a single destination (single triangle) => so it has to be called many times which in the case of text might not be suitable.

I made another profiling, by having the glyphs cached (text.AppendGlyphs) and doing a simple for loop with DrawImage and a simple for loop with DrawTriangles (this time two triangles by image, and one call by image):

We can see that the DrawImage loop is actually faster (not by much), probably thanks to mipmap skipping, but this is not what matters.
When looping through each sprite/glyphs, you have to traverse everything, when this could be batched at "user" (text) level, like an advanced ebitengine usage.
To me, this is a reason why DrawTriangles exists for example, ebitengine is "dead simple" but still gives you ways to have more control (unmanaged image is another example) and do more complex things/optimizations.
To me, text and big texts (in a UI app) that could consist in more glyphs than sprites a traditional 2D rpg game could have, would be too many images, and could benefit from this extra control.

This is just a general suggestion, I can already implement my own text.Draw, supposing I figure a decent glyph atlas solution, and I will probably do if the situation doesn't change.
So, it's not critical for ebitengine not to adopt this, but I think it should be considered for the long-term health of the Ebitengine library.
DrawTriangles is not only a geometry feature, to me it's also a feature giving users (such as user text/v2) a way to control the vertices, the source texture, and rely/annoy ebitengine the least possible.

hajimehoshi · 2024-05-07T04:45:57Z

For a user atlas:

If you use a managed image, the user atlas will be a part of an internal atlas. This might not be efficient in terms of atlas usage, but the graphics commands are well batched.
If you use an unmanged image, the user atlas will be an independent image from other internal atlases. Then, the graphics commands are not batched when you render texts.

IMO we should not go with 2 unless we find a strong reason.

tinne26 · 2024-05-07T07:07:40Z

Regarding what Zyko said (agree with the rest of the comment on general context, so focusing only on the technicalities of this particular issue):

supposing I figure a decent glyph atlas solution

So, the solution for preallocating all necessary space for a font atlas would go like this: for each font "face" (combination of size + font, you don't really need all glyphs ever into a single atlas), you iterate from 0 to font.NumGlyphs() - 1, getting the font.GlyphBounds() of each glyph (you also need to take into account quantization, because you need x4 space if you are either using subpixel positioning or allowing changing it later). This is actually a very expensive "preloading" operation. If you don't do this you will eventually incur on other expensive reallocations of glyphs in the middle of drawing, which can be done relatively efficiently too, but it gets more annoying, and you have to deal with restarting draws and so on.

To be honest, precomputing font atlas size is not a good general solution, because in many cases you would end up creating a big "face" for only a few letters, and reserving a lot of space and iterating all glyphs and computing their bounds (which unlike general text positioning with advance and so on requires iterating the whole glyph outline) is too wasteful.

Partial recap:

We only need the masks of a single "face" to be on the same atlas.
We can compute the full font "face" atlas size ahead, but it's not a cheap process and it will be wasteful in many cases, both in terms of GPU memory and glyph "analysis".
So, that only leaves the idea of "please keep all glyphs of the same font "face" on the same image so we can draw them all together directly".

And for that last case, what we have is a general problem that should be solved with an orthogonal API, which can be externally written as a sub-atlas that basically does everything an Ebitengine atlas does, plus being able to reassign and reorganize itself if more space is needed, without breaking contents into multiple separate atlases, and then a QueueImageDraw(target, src, x, y) and a FlushDraws(target) pair of methods. Seems useful, it's pretty doable as a separate package, and it's the general underlying mechanism required to solve this problem and make many others easier to optimize. Textbook single region manual draw queueing.

Zyko0 · 2024-05-07T10:00:03Z

@tinne26 I agree with every concerns and ideas.

@hajimehoshi

IMO we should not go with 2 unless we find a strong reason.

Yeah, I had 1 in mind so that glyphs share the same source, but a text.Draw can also share the same source with sprites and other rendering operations (also not sure what would be the need for unmanaged, so regular would be the way to go).
What did you mean by "This might not be efficient in terms of atlas usage" for 1, if it needs to be grown, for example?

hajimehoshi · 2024-05-07T13:43:37Z

What did you mean by "This might not be efficient in terms of atlas usage" for 1, if it needs to be grown, for example?

If a user atlas is too big, there would be a big unused space on a user atlas and also on an internal atlas. If a user atlas is too small, you might have to recreate an atlas by extending it, and such extending an atlas is not so efficient: a new user atlas image would not belong to the original image for a while, since such an image is recognized as a newly created offscreen image for a rendering destination.

frolosofsky · 2024-06-28T12:56:35Z

I'm not sure if my issue is related because I didn't dig into it too much but my observation is that text rendering is too expensive in ebiten.

Running an empty window eats ~10% CPU.
Adding 100-1000 images in the Draw method is ~10-14% CPU.
Drawing 30-40 short text messages like "hello" on top of the screen (both, directly or via intermediate Image objects) results in 100-130% CPU.

I use Intel Mac with integrated GPU. According to profiling, 95-99% CPU time spent inside CGO, for the every case.

My solution/workaround so far is caching. I render every unique text once, cache the Images, and then draw them from the cache. It gives me regular ~10-14% CPU load.

tinne26 · 2024-06-28T14:05:19Z

@frolosofsky While you are all talking about text rendering performance, there are different issues here:

What Zyko was discussing was mainly to optimize many DrawImage calls into fewer, more direct DrawTriangles calls.
What you are talking about is general text rendering performance.

If you are worried about text rendering performance, I think a reasonable expectation would be that rendering text takes an amount of time in the same order of magnitude as N DrawImage calls, given as N matching the number of glyphs you have in the text (and assuming the glyphs have already been cached and the image sizes are similar). You seem to be saying that this is not happening, and that you are seeing it take x10 the time instead. If this is really the case, you should write some code that showcases this and open a separate issue presenting your findings, so we can test on different computers to confirm it.

Zyko0 added feature request labels May 1, 2024

hajimehoshi added this to the v2.8.0 milestone May 1, 2024

hajimehoshi added performance and removed feature request labels May 1, 2024

Zyko0 mentioned this issue Jul 27, 2024

PoC: text/v2 glyph atlas #3051

Draft

hajimehoshi modified the milestones: v2.8.0, v2.9.0 Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text/v2: Draw performances using DrawImage #2976

text/v2: Draw performances using DrawImage #2976

Zyko0 commented May 1, 2024 •

edited

Loading

tinne26 commented May 6, 2024

Zyko0 commented May 6, 2024

Zyko0 commented May 6, 2024 •

edited

Loading

Zyko0 commented May 6, 2024 •

edited

Loading

hajimehoshi commented May 6, 2024

hajimehoshi commented May 6, 2024

tinne26 commented May 6, 2024

hajimehoshi commented May 6, 2024

Zyko0 commented May 6, 2024 •

edited

Loading

hajimehoshi commented May 7, 2024 •

edited

Loading

tinne26 commented May 7, 2024 •

edited

Loading

Zyko0 commented May 7, 2024 •

edited

Loading

hajimehoshi commented May 7, 2024 •

edited

Loading

frolosofsky commented Jun 28, 2024

tinne26 commented Jun 28, 2024

text/v2: Draw performances using DrawImage #2976

text/v2: Draw performances using DrawImage #2976

Comments

Zyko0 commented May 1, 2024 • edited Loading

tinne26 commented May 6, 2024

Zyko0 commented May 6, 2024

Zyko0 commented May 6, 2024 • edited Loading

Zyko0 commented May 6, 2024 • edited Loading

hajimehoshi commented May 6, 2024

hajimehoshi commented May 6, 2024

tinne26 commented May 6, 2024

hajimehoshi commented May 6, 2024

Zyko0 commented May 6, 2024 • edited Loading

hajimehoshi commented May 7, 2024 • edited Loading

tinne26 commented May 7, 2024 • edited Loading

Zyko0 commented May 7, 2024 • edited Loading

hajimehoshi commented May 7, 2024 • edited Loading

frolosofsky commented Jun 28, 2024

tinne26 commented Jun 28, 2024

Zyko0 commented May 1, 2024 •

edited

Loading

Zyko0 commented May 6, 2024 •

edited

Loading

Zyko0 commented May 6, 2024 •

edited

Loading

Zyko0 commented May 6, 2024 •

edited

Loading

hajimehoshi commented May 7, 2024 •

edited

Loading

tinne26 commented May 7, 2024 •

edited

Loading

Zyko0 commented May 7, 2024 •

edited

Loading

hajimehoshi commented May 7, 2024 •

edited

Loading