-
-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Heavy 3D scene (25M+ primitive indices) performs significantly slower in 4.0 compared to 3.5 #68959
Comments
It's a pretty pathological / unoptimized scene, but I can confirm the observed performance difference. With an AMD Radeon RX VegaM on Linux (Mesa drivers), I get:
Godot 4 reports 26 million primitive indices in the scene, with 730 MeshInstances of the same heavy mesh. |
I compared the scenes as well. Initially I got similar results, but then I did two things:
Vsync throws off frame time measurement and makes it not accurate. Running a scene from the editor means that you are measuring performance of the scene running on top of the entire editor and whatever scene is visible in the Viewport. Right now the Vulkan 2D renderer is more performance intensive than the 3.x 2D renderer which may explain why your GPU is getting saturated running both the editor and the scene. That being said, after taking both of those steps 3.x still performed faster for me: This is a pretty big difference and looking at the visual profiler it appears that all the time is spent in draw calls (depth prepass and opaque pass). This highlights the 4.0 forward_plus scene shader being slower than the 3.x scene shader. Altogether a difference of 1 ms is not unexpected as the forward_plus renderer is designed to scale much better to high numbers of objects and high numbers of lights. As a trade-off it has a higher base cost so simple scenes may perform worse. I am not convinced that we are seeing a performance regression outside of the expected boundaries. |
Here is v2 with 9 lights in the scene, performance scaling is much worse, with 17 fps vs 54 fps (4.0 vs 3.5) As far as the scene goes, it's intentional to maximize the stress and make it as simple (tiny) to share and to make it as minimalistic as possible |
Try using the Vulkan Mobile renderer – it likely renders simple scenes faster, at the cost of rendering complex scenes slower. |
in the v2, that one runs at 9 fps, and that's only with 8 lights working... this scene falls in to the complex scene category. opengl in 4.0 runs as well as the clustered renderer, albeit with no shadows in the v2 |
I don't have a current master build (spent all my free time moving instead of programming recently) but it looks like that something is indeed wrong when using a slightly outdated master version. Hope someone with more time than I can dig deeper into this, I would love to know what exactly is going on. |
when switching the meshes to unshaded, still getting the massive difference, 50 fps vs 95 |
I have been doing porting tests of my games over to 4.0 to get a feel for performance issues. I am generally finding them to be about 30% slower in 4.0 as compared to 3.5. Not sure if anyone has tried porting the official demos and seeing a similar decrease in raw performance? Presumably there are performance optimisations planned for the renderer before 4.0 is released. |
If you have access to a GPU that supports both Vulkan and OpenGL profiling, please look into using your GPU vendor's profiling tool on the same scene on both While the capture files are not portable across GPUs, you can post several screenshots of the resulting graphs (or even record a video of yourself going through the captures). |
Adding one more set of data points for performance comparison with the MRPs provided here: OS: Fedora 37 1280×720
3840×2160
Script used for benchmarkingMake sure all four projects are configured to be in windowed (not fullscreen) and with V-Sync disabled. #!/bin/bash
set -xuo pipefail
IFS=$'\n\t'
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720 --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720 --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720 --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720 --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen --rendering-method gl_compatibility
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --resolution 1280x720
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --resolution 1280x720
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --fullscreen
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --fullscreen
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --resolution 1280x720 --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --resolution 1280x720 --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --fullscreen --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --fullscreen --video-driver GLES2 Footnotes |
I tried running a test and found that forward + gets 11 fps while the 3.5 version takes 17 fps, though in the forward mobile it gives the same 17 fps. Is it because of forward + cost or some unoptimization. And i tried it on clayjohn pr. |
his pr is afaik compatibility renderer only |
@Calinou you did very informative bechmark, but now it's outdated, considering that Godot 4.3 comes out soon, and many contributed changes could partially solve this trouble. Could you redo it with Godot 4.3 dev 5, testing DirectX 12 backend and testing Forward+ with depth prepass turned OFF (so: Vulkan, Vulkan no Depth Prepass, DirectX 12, DIrectX 12 no Depth prepass)? I believe it could show where performance regression is hidden, like if DirectX is faster than Vulkan, the trouble is in rendering driver. It's a bold assumption, but I believe performance might be worse in rasterization stage and bechmark will prove/disprove it. |
There's a lot of CPU-related bottlenecks I have in mind first that have been identified in the Forward+ and Mobile renderer rather than the lower level drivers. There will be a lot of work towards optimizing that. If anything out of the two, the D3D12 driver is likely to perform slower due to the fact it resolves barriers on its own, something the Vulkan driver doesn't. Once I'm done with other tasks it is very likely I'll prioritize taking a look at the CPU performance of the renderer. I don't know if this scene exposes said bottlenecks, but out of the two areas it's the one I'm running into limitations right now with the heavier projects. If this scene is GPU-bottlenecked instead, it might be safe to assume the cost is inside the rasterization components instead (e.g. associated buffers, shaders, pipeline config, uniforms, etc.) instead. As far as I understand 3.5 and 4.0 are completely different things in this scenario so you could very well have a higher base cost but have better scaling instead elsewhere due to supporting different features. |
This is a Heavily GPU bottlenecked scene, and it's just a spam of triangles and lights, no advanced features past that. |
@DarioSamo With this scene the entire cost is GPU and it comes from depth prepass and opaque rendering. So CPU optimizations won't help. Also worth noting, the entire scene is drawn in 1 draw call due to auto batching. Checking now on my laptop with integrated GPU the GPU time is 47 ms in the MRP (16ms from depth prepass and 30 ms from opaque rendering + 0.5 ms from tonemapping). Upgrading the meshes to use the compressed format that reduces to 30 ms (11 ms from the depth prepass, 19 from opaque and 0.5 from the tonemap). Clearly then, the MRP is bandwidth bound to begin with. It might still be bandwidth bound even after mesh compression. So we should look into the following:
|
That's good to know, so to answer @Capewearer's question, it doesn't sound like there's much to be gained by the latest improvements other than the mesh compression as bandwidth seems to be the main limitation. |
I just did a quick test using a mat3x4 for the instance transform instead of a mat4 and shaved another 1.5 ms GPU time off. This is definitely something to investigate further. We can shave off another 0.75 ms by deleting the branch for multimesh. In practice we will do that by making it a specialization constant instead of using a per-instance flag Keep in mind, these numbers are being recorded on a laptop integrated GPU, I do not expect to see similar performance gains across all devices |
I've tested this patch on the MRP (with meshes upgraded to 4.2 format), and I don't notice any performance difference on a RTX 4090 in 4K in Vulkan (380 FPS with and without the patch) and Direct3D 12 (330 FPS, also with and without the patch). What's strange is that Direct3D 12 will occasionally reach 440 FPS (and stay around that framerate) on On other GPUs, I got largely identical results:
|
Godot version
4.0 beta 5 & 3.5.1 stable
System information
Windows 11, Vulkan, GTX 1050 Ti 526.98
Issue description
The Stable release seems to run on avg about twice as fast as the 4.0 across various identical scenes, using only the features that are the same among the two branches (so no mesh lod, no ssao, ssr, occlusion culling or anything like that, just meshes and lights with no effects)
in the MRPs provided below, the converted and adjusted to match the original with 4.0 is getting 30-34 FPS depending on the renderer used @ 100% gpu utilization, while 3.5 is getting 60 fps @ 85% gpu utilization
Steps to reproduce
Test the Same Scenes across 3.5 and 4.0
Minimal reproduction project
4.0.zip
3.5.zip
The text was updated successfully, but these errors were encountered: