Final results of collaboration with Google and The Forge #90284

clayjohn · 2024-04-05T23:41:18Z

Edit: Just a quick note on the huge line count change. Most of the addition lines of code come from new third party dependencies. The actual change is about 4000 lines of code

As previously announced we have been working with Google and The Forge to bring a bunch of performance improvements and other enhancements to our mobile renderer.

The work has concluded so it is time to share the results, test it, and integrate it into upstream Godot. This PR is huge and touches a lot of areas. I expect that this branch will not be merged as-is. Instead this serves as a place for us to test the full work product, and cherry-pick safe commits. This branch was rebased on upstream recently and should not be too far out of date. Given its tremendous size though, I expect that it will fall out of date quickly. So we should aim to start merging what we can sooner rather than later.

Results

We tracked performance using two devices (Samsung S23 and Google Pixel 7) and using two test scenes (a modified version of the Sun temple scene and a modified version of the TPS demo)

TPS_demo.zip
Sun_temple.zip

Most importantly, GPU time has reduced across the board which is our biggest bottleneck on mobile devices.

Pixel 7

	Sun Temple OPT	Sun Temple (UNOPT)	Third Person Shooter (OPT)	Third Person Shooter (UNOPT)
Frame time	26.1 ms	29.1 ms	36.4 ms	37.1 ms
GPU frame time	25.81	28.60 ms	35.03 ms	35.81 ms
CPU frame time	4.85 ms	5.01 ms	3.67 ms	3.53 ms
Vk Memory	2487.53 MB	2488.47 MB	554.64 MB	555.93 MB

Samsung S23

	Sun Temple OPT	Sun Temple (UNOPT)	Third Person Shooter (OPT)	Third Person Shooter (UNOPT)
Frame time	16.0 ms	22.1 ms	38.3 ms	41.4 ms
GPU frame time	19.6	24.74	42.6 ms	46.6 ms
CPU frame time	4.2	5.27	3.72 ms	3.30 ms
Vk Memory	995MB	980MB	593.8 MB	593.18 MB

Conclusion

Thank you very much to Google and The Forge who made this work possible! The results are very exciting and we are looking forward to getting everything integrated into upstream Godot

- vulkan implementation to allocate with a persistent memory address - getter implementation - new member added for vulkan BufferInfo

…rtex and index buffer

…uffer creation to use them

… function pointers if found

…/ setobject name + a conversion function

…ance extension

…ice, not the instance

…being used on android

… all associated modules have been created

… or all

… in the main

darksylinc · 2024-04-06T18:12:49Z

That's because this branch is behind by over a thousand commits, nothing wrong, you need to check against the base of the branch

I get that "this is how git works" and "technically there is nothing wrong", but it makes the diff shown in Github impossible to review online. It needs to be done offline.

I'm asking if there is a way to publish online something user friendly that everyone can take a look at. Because right now Github (and git) says there's 300k LOC to review, but the actual amount that needs reviewing is nowhere remotely close to that.

AThousandShips · 2024-04-06T18:14:12Z

the actual amount that needs reviewing is nowhere remotely close to that.

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

darksylinc · 2024-04-06T18:23:24Z

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

If I type (using my fork which stripped thirdparty changes):

git merge-base master TF_final
# Yields bb6b06c81343073f10cbbd2af515cf0dac1e6549
git diff bb6b06c81343073f10cbbd2af515cf0dac1e6549..TF_final

I get lots of irrelevant stuff in the diff like this:

diff --git a/thirdparty/thorvg/update-thorvg.sh b/thirdparty/thorvg/update-thorvg.sh
index 2811a43339..7a754c09b9 100755
--- a/thirdparty/thorvg/update-thorvg.sh
+++ b/thirdparty/thorvg/update-thorvg.sh
@@ -1,6 +1,6 @@
 #!/bin/bash -e
 
-VERSION=0.12.5
+VERSION=0.12.9

In both latest master and latest TF_final VERSION=0.12.9 so I don't really care about reviewing this change.
And as expected, git-cola does not ask me to review this change when trying to merge (see picture I uploaded: Changes to update-thorvg.sh are nowhere to be seen).

The same problem happens if I type:

git diff master..TF_final

That change to update-thorvg.sh also appears, despite being just noise for reviewers.

I get that "technically" update-thorvg.sh was changed by both master and TF_final in the same way ever since they've diverged from their shared ancestor; but it's just noise. And there's a lot of noise to filter out.

darksylinc · 2024-04-06T18:33:10Z

I brute-forced the problem!!!

THESE ARE THE ACTUAL CHANGES THAT NEED REVIEWING.

75 files, 4366 additions, 717 deletions.

Much cleaner.

AThousandShips · 2024-04-06T18:38:37Z

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

darksylinc · 2024-04-06T18:43:01Z

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

I know that the link I posted can't be reviewed using Github's interface; but it gives an overview of what to expect: The changes are nowhere that scary. In fact very reasonable. But right now GH is showing 10:1 of noise to signal ratio.

AThousandShips · 2024-04-06T18:43:30Z

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

And that's why I asked for it for the purposes of reviewing :) I'm fully aware, that's why I asked for that rather than trying to break down this one

bruvzg

Missing thirdparty/README.md and COPYRIGHT.txt updates and documentation for the new methods. Since this is not intended to be merged as is, I'm not commenting on various code style issues. But it probably should be rebased at least once to run CI and make reviewing easier.

bruvzg · 2024-04-06T19:10:52Z

platform/ios/rendering_context_driver_vulkan_ios.mm

@@ -50,7 +50,7 @@
 	create_info.pLayer = *wpd->layer_ptr;

 	VkSurfaceKHR vk_surface = VK_NULL_HANDLE;
-	VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, nullptr, &vk_surface);
+	VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, get_allocation_callbacks(VK_OBJECT_TYPE_SURFACE_KHR), &vk_surface);


The same is missing from macOS and Wayland code.

bruvzg · 2024-04-06T19:11:49Z

servers/display_server.h

@@ -342,6 +343,11 @@ class DisplayServer : public Object {
 		INVALID_INDICATOR_ID = -1
 	};

+private:
+	ScreenOrientation native_orientation = ScreenOrientation::SCREEN_LANDSCAPE;


Seems to be unused.

bruvzg · 2024-04-06T19:13:20Z

core/os/memory.cpp

+	if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
+		return NULL;


Suggested change

if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)

return NULL;

if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == nullptr) {

return nullptr;

}

Any reason it's not using C11 aligned_alloc?

bruvzg · 2024-04-06T19:13:35Z

core/os/memory.cpp

+	if (p_memory == NULL)
+		return alloc_aligned_static(p_bytes, p_alignment);


Suggested change

if (p_memory == NULL)

return alloc_aligned_static(p_bytes, p_alignment);

if (p_memory == nullptr) {

return alloc_aligned_static(p_bytes, p_alignment);

}

bruvzg · 2024-04-06T19:24:28Z

modules/glslang/register_types.cpp


 	glslang::GlslangToSpv(*program.getIntermediate(stages[p_stage]), SpirV, &logger, &spvOptions);

 	ret.resize(SpirV.size() * sizeof(uint32_t));
 	{
 		uint8_t *w = ret.ptrw();
 		memcpy(w, &SpirV[0], SpirV.size() * sizeof(uint32_t));
+		char buffer[256];


bruvzg · 2024-04-06T19:25:56Z

platform/android/display_server_android.h

@@ -123,6 +123,7 @@ class DisplayServerAndroid : public DisplayServer {

 	virtual void screen_set_orientation(ScreenOrientation p_orientation, int p_screen = SCREEN_OF_MAIN_WINDOW) override;
 	virtual ScreenOrientation screen_get_orientation(int p_screen = SCREEN_OF_MAIN_WINDOW) const override;
+	virtual int screen_get_current_rotation(int p_screen) const override;


This should be implemented for the rest of platforms (should be easy to do an all except Web).

bruvzg · 2024-04-06T19:27:52Z

platform/android/detect.py

@@ -184,13 +184,16 @@ def configure(env: "SConsEnvironment"):
    if env["arch"] == "x86_32":
        # The NDK adds this if targeting API < 24, so we can drop it when Godot targets it at least
        env.Append(CCFLAGS=["-mstackrealign"])
+        env.Append(LIBPATH=["../../thirdparty/swappy-frame-pacing/x86"])


Binary libs in the tree? This should be either build in tree or optional external dependency.

bruvzg · 2024-04-06T19:34:02Z

core/core_bind.cpp

+#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_tracked_object_name"), &Engine::get_tracked_object_name);
+	ClassDB::bind_method(D_METHOD("get_tracked_object_type_count"), &Engine::get_tracked_object_type_count);
+#endif
+
+#if defined(VK_TRACK_DRIVER_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_driver_total_memory"), &Engine::get_driver_total_memory);
+	ClassDB::bind_method(D_METHOD("get_driver_allocation_count"), &Engine::get_driver_allocation_count);
+	ClassDB::bind_method(D_METHOD("get_driver_memory_by_object_type"), &Engine::get_driver_memory_by_object_type);
+	ClassDB::bind_method(D_METHOD("get_driver_allocs_by_object_type"), &Engine::get_driver_allocs_by_object_type);
+#endif
+
+#if defined(VK_TRACK_DEVICE_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_device_total_memory"), &Engine::get_device_total_memory);
+	ClassDB::bind_method(D_METHOD("get_device_allocation_count"), &Engine::get_device_allocation_count);
+	ClassDB::bind_method(D_METHOD("get_device_memory_by_object_type"), &Engine::get_device_memory_by_object_type);
+	ClassDB::bind_method(D_METHOD("get_device_allocs_by_object_type"), &Engine::get_device_allocs_by_object_type);
+#endif


This will mess up documentation generation, so should not be ifdefed, but instead always active, call methods from RenderingContext instance (RenderingDevice::get_singleton()->...) and have dummy implementations in the base RenderingContext class (since same can be later implemented for Metal and DX12).

bruvzg · 2024-04-06T19:35:00Z

core/config/engine.h

@@ -36,6 +36,11 @@
 #include "core/templates/list.h"
 #include "core/templates/vector.h"

+#if defined(VULKAN_ENABLED) && (defined(DEBUG_ENABLED) || defined(DEV_ENABLED))


Should be moved to drivers/vulkan.

I'm looking into this PR.

It seems like it can't be moved there.

It appears that the VK_TRACK_DEVICE_MEMORY is a debug-only macro (it's only defined in debug & dev builds).

When that macro is defined, objects from core inform the Vulkan subsystem what subsystem they belong to.

e.g. in core/core_bind.cpp we have things like:

#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY) String Engine::get_tracked_object_name(uint32_t typeIndex) const { return RenderingContextDriverVulkan::get_tracked_object_name(typeIndex); } uint64_t Engine::get_tracked_object_type_count() const { return RenderingContextDriverVulkan::get_tracked_object_type_count(); } #endif

Where get_tracked_object_name returns a string like "INSTANCE", "PHYSICAL_DEVICE", "DESCRIPTOR_SET_LAYOUT".

I don't know if there is a better way yet, but it's not a simple thing of moving that line to a different header.

bruvzg · 2024-04-06T19:36:49Z

servers/rendering/rendering_device.cpp

@@ -28,6 +28,8 @@
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/

+#pragma optimize("", off)


clayjohn · 2024-04-09T19:06:21Z

Was the benchmark compared to 4.3 or 4.2?

The reported performance numbers are compared to 4.3. Essentially it is Godot with/without these changes.

If it was compared to 4.3 then would the difference be bigger when tested with 4.2?

There are other performance improvements coming to 4.3, so yes, the difference is bigger when compared to 4.2 as it would include all the performance improvements made by others.

darksylinc · 2024-04-14T20:17:33Z

Shared Changes

Memory Allocation code. Lots of changes interact in one way with GPU memory allocation. The most prominent one is for debug tracking.

Specific Changes

Debug-only memory allocation tracking code. It adds quite a lot of lines of code but they're all trivial and only used in debug builds.
- I would suggest integrating this first, because it paves the way for shared memory allocation changes.
- It has controversial changes that have already been pointed out:
  - Docs generation would be messed up because of the ifdefs, so it'd be better if we provide a stub implementation when not using dev builds.
  - It causes otherwise-isolated components to be aware of Vulkan (abstractions are broken). Since it's Debug only, this is acceptable, while a stub version could be written that doesn't have to be aware of anything.
Device Fault extension: An extension to leave "breadcrumbs" to make diagnosing GPU crashes easier.
- This can be isolated and made into its own PR.
- Unlike the previous point, it's not debug only (although its main use is aiding debugging).
- I recommend including them in the same PR as "Debug-only memory allocation tracking code" because of their similarity: the lines of code live next to each other, and this is a trivial change.
- It would make "Device Lost" reports much more detailed.
Swappy integration for Android.
- This can be isolated and made into its own PR.
- Swappy has a bug where the ANativeWindow will be leaked if SwappyVk_destroySwapchain is called without unsetting the ANativeWindow handle first. This is very easy to workaround, but we will have to do it.
Optimization - Particle FX compute shader: They changed a compute_list_dispatch_threads( total_amount ) with a compute_list_dispatch_threads( (uint32_t)ceil((float)total_amount / 64.0) ). I haven't analyzed it, but sounds like a silly one-liner bug that after fixing would cause a 64x improvement.
- Update: TheForge's solution is a bug. The code is already dividing by 64 (that's what compute_list_dispatch_threads does internally). Dividing by 64 again would only create bugs.
Optimization - Transient / Lazy render targets: It improves memory (and possibly performance) by having RenderTargets that are never sampled (like the main window's depth buffer) have no actual backing memory (it only ever lives in the TBDR cache).
- This can be isolated and made into its own PR.
- Depends on allocation code changes.
Optimization - ASTC decode mode extension: A very simple change that potentially improves performance and/or battery consumption by requesting the GPU to use lower precision for decoding when available.
- This can be isolated and made into its own PR.
- Although it does not depend on allocation code changes, its code lives right next to these changes thus it could cause merge conflicts, which is virtually the same thing as saying it depends on alloc. code changes.
Optimization - Persistent Mapping: Adds the BUFFER_USAGE_PERSISTENT_BIT, BUFFER_CREATION_PERSISTENT_BIT and BUFFER_CREATION_LINEAR_BIT flags. Prefers memory pools the CPU can see but live in the GPU.
- To the best of my analysis, this is buggy. It cannot be integrated as is (see Bugs section much fruther below).
  - It'd be great to get feedback from TheForge on this.
  - They introduced some interesting ideas, so if TheForge refuses to fix this, I may be able to salvage some of it.
- Mostly intended for mobile (which are UMA).
- Can be turned on/off at runtime (or init time?), but right now the PR left it always on.
- Needs more research because I'm worried it will be used in NUMA too. Lately Vulkan drivers will expose memory pools the CPU can see and live in the GPU (when SAM / ReBar is enabled in BIOS, but sometimes 256MB of this pool is still exposed when disabled on BIOS). There are cases where this can improve performance, but we an also shoot ourselves in the foot.
  - If SAM / ReBAR is disabled and we try to consume all of the measly 256MB of VRAM available, bad things happens (specially because those 256MB are precious and shared by the driver and other apps GLOBALLY). We definitely don't want to enable BUFFER_USAGE_PERSISTENT_BIT in this case.
  - I also need to check if VMA is not already preventing these cases. VMA is the one who ultimately decides what memory pool to use and it is very difficult to follow its logic.
- Unsure how it can be split into its own PR yet. I need more time analyzing.
- Probably one of the most important changes IMO.
- Heavily depends on allocation code changes (the change itself is about allocations!).
Optimization - Immutable Samplers: It affects shader code, PSO generation, descriptor sets code, and samplers. Rather than setting them from C++ dynamically, they're hard-coded in the shader. This may boost performance on certain GPUs.
- This can be isolated and made into its own PR together with the next item.
- Depends on allocation code changes.
- This may not work out of the box because these changes were made before the merge of Reverse Z. However the fixes would be one-liners (i.e. change swap GREATER for LESS and viceversa)
Optimization - Linear allocation of descriptor set pools: TheForge identified various calls to uniform_set_create() and identified which ones can avoid setting the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag; and are kept in a different set of pools which get reset every frame. Hats off for that monumental task (TBH I'm trusting they did a good job at that).
- This can be isolated and made into its own PR together with immutable samplers, since both touch the same areas of code (it'd be possible to split them, but it makes merging tasks much harder unnecessarily).
- This code also deals with a workaround for Adreno 730 where linear pools are disabled because they leak.
- Personal remark: A code comment (probably from Juan says) VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT; // Can't think how somebody may NOT need this flag.. Ironically I think he just didn't understand how pools were supposed to be used. Because I can not think of ways one would WANT to use this flag on purpose unless you're dealing with retrofitting Vulkan to a legacy engine.
Optimization - PreTransformed Swapchains: An Android-specific optimization to avoid the phone's driver rotating the screen in phones that did not come with a HW rotator.
Optimization - Replace Push Constants with UBOs: It's what it says.
- This is higher level changes to non-Vulkan code.
- Affects shader code too.
- It depends on Optimization - Persistent Mapping and Optimization - Linear allocation of descriptor set pools. Possibly other low level changes.
Optimization - Batch Binding: TheForge added add_draw_list_bind_uniform_sets to bind multiple uniform_sets in a single API call. They also now avoid redundant sets being rebound.
- This looks like it can be split into its own PR
- The code is easy to understand, but I suggest to put it into its own commit for better bisect if it breaks, because changes like this have a tendency to break on random systems.
Optimization - Avoid empty passes: It can be turned on/off at runtime via render_pass_opts_enabled.
- This looks like it can be split into its own PR
- Because it can be turned off at runtime, any regression it causes should be easy to spot
When a pass is found to be empty, it skips some work. This is more likely for Dario to look at to see if it's correct.
Misc - Thermal Support: TheForge added monitoring of Thermal Events on Android, logging those changes and shows an alert via OS::get_singleton()->alert if the thermal state is > THERMAL_STATE_MODERATE.
- This can be isolated and made into its own PR. It doesn't even need a Vulkan expert. It has nothing to do with Vulkan (other than debugging "why is the perf going down?? ah... the device is getting hot").

I am strongly leaning in submitting these as a single PR due to how close they are to each other:

Optimization - Persistent Mapping
Optimization - Linear allocation of descriptor set pools
Optimization - Immutable Samplers
Optimization - Replace Push Constants with UBOs

For example "Persistent Mapping" could be made into its own PR. However such PR would be criticized for just adding being dead code; because it's not until "Replace Push Constants with UBOs" gets integrated that it would be used.

And "Replace Push Constants with UBOs" depends on the first three. Although it could be broken down; it would be a lot more effort.

Noteworthy or slightly controversial changes

Previously device lost were handled generically by ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);.
Now extra info is added and crashes immediately.

The purpose of print_lost_device_info is to print the breadcrumbs that will aid in debugging, which is important info.

   		device_queue.submit_mutex.lock();
		err = vkQueueSubmit(device_queue.queue, 1, &submit_info, vk_fence);
		device_queue.submit_mutex.unlock();

		if (err == VK_ERROR_DEVICE_LOST) {
			print_lost_device_info();
			CRASH_NOW_MSG("Vulkan device was lost.");
		}
		ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);

It think it may be controversial because Godot convention is to use ERR_FAIL_COND_V().

I'm not sure how Godot Team wants to handle integrating this block of code, but what is for sure is that really want to call print_lost_device_info(); on device lost. It would make bug reports much more detailed.

Bugs / Oddities

shader_destroy_modules

TheForge introduced RenderingDeviceDriverVulkan::shader_destroy_modules which is used in various places but:

In RenderingDeviceDriverVulkan::shader_free it performs the same operation twice and will cause a double free:

void RenderingDeviceDriverVulkan::shader_free(ShaderID p_shader) {
	ShaderInfo *shader_info = (ShaderInfo *)p_shader.id;

	for (uint32_t i = 0; i < shader_info->vk_descriptor_set_layouts.size(); i++) {
		vkDestroyDescriptorSetLayout(vk_device, shader_info->vk_descriptor_set_layouts[i], VKC::get_allocation_callbacks(VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT));
	}

	vkDestroyPipelineLayout(vk_device, shader_info->vk_pipeline_layout, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_PIPELINE_LAYOUT));

	for (uint32_t i = 0; i < shader_info->vk_stages_create_info.size(); i++) {
		vkDestroyShaderModule(vk_device, shader_info->vk_stages_create_info[i].module, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_SHADER_MODULE));
	}
	// <TF>
	// @ShadyTF unload shader modules
	// Was :
	//for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
	//	if (si.vk_stages_create_info[i].module) {
	//		vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
	//	}
	//}
	shader_destroy_modules( p_shader );
	// </TF>

	VersatileResource::free(resources_allocator, shader_info);
}

// <TF>
// @ShadyTF unload shader modules


void RenderingDeviceDriverVulkan::shader_destroy_modules(ShaderID p_shader) {
	auto *E = (RBSet<ShaderInfo>::Element *)p_shader.id;
	ShaderInfo &si = E->get();

	for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
		if (si.vk_stages_create_info[i].module) {
			vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
			si.vk_stages_create_info[i].module = VK_NULL_HANDLE;
		}
	}
	si.vk_stages_create_info.clear();
}

// </TF>

I think the vkDestroyShaderModule block of code in RenderingDeviceDriverVulkan::shader_free is meant to be replaced by shader_destroy_modules(). TBH it looks like a git merge error. Easy to fix.

Another problem is that the function shader_destroy_modules() they introduced calls vkDestroyShaderModule without the get_allocation_callbacks they introduced. Also easy to fix.

Update: Another oddity is that they added RenderingDevice::shader_destroy_modules to the API but it is only used in RendererCanvasRenderRD::RendererCanvasRenderRD and RendererCompositorRD::initialize. I need to ask them why they did that.

Unused dynamic uniform sets

They added:

RenderingDeviceGraph::add_draw_list_bind_uniform_set_dynamic
DrawListInstruction::TYPE_BIND_UNIFORM_SET_DYNAMIC
RenderingDeviceDriverVulkan::command_bind_render_uniform_set_dynamic
RenderingDeviceCommons::UNIFORM_TYPE_UNIFORM_BUFFER_DYNAMIC

But unless I made a mistake, it doesn't appear like it's actually being used. i.e. it's dead code.

Update: We got word from TheForge. The dynamic set code is/was meant as the final step in transitioning from push constants to UBOs. Current code (from TheForge) uses multiple cycled UBOs as a stepping corner for migration.

Unknown initialization of max_descriptor_sets_per_pool

TheForge added the following to RenderingDeviceDriverVulkan::RenderingDeviceDriverVulkan constructor:

max_descriptor_sets_per_pool = GLOBAL_GET("rendering/rendering_device/vulkan/max_descriptors_per_pool");

But this exact same line of code is already part of RenderingDeviceDriverVulkan::initialize.
It's not clear why they did it, nor if it's necessary. Perhaps it was meant to be moved out of RenderingDeviceDriverVulkan::initialize ?

Or perhaps it needs to live in both places?

Changed the default value of window_orientation in main.cpp

// Before:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_LANDSCAPE;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_LANDSCAPE)));

// After:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_SENSOR;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_SENSOR)));

I don't know why, I don't know if it matters (i.e. if it later gets overwritten anyway), I don't know what SCREEN_SENSOR is (this enum value predates TheForge's changes).

I also don't know if this is relevant for backwards compatibility.

Thermal

The PR adds functions like get_thermal_state and get_thermal_headroom which are not being used at all (i.e. dead code), but I feel like that's under-utilizing them.

The new screen_get_current_rotation() mixes internal state with public state

TheForge adds a new function:

int screen_get_current_rotation(int p_screen) const override;

User bruvzg mentioned:

This should be implemented for the rest of platforms (should be easy to do an all except Web).

Well, the problem is that this function is poorly named. It should be renamed to something like:

// For internal use.
int screen_get_internal_current_rotation(int p_screen) const override;

This is because the return value has deep impact in internal implementation details and in most other platforms (e.g. Windows, Linux, probably iOS too, etc) the return value must be 0.

Race condition in BUFFER_CREATION_PERSISTENT_BIT usage

The biggest for last.

Every time the new code does something like this, it is in principle, safe:

RID* directional_light_buffer;
directional_light_buffer = (RID *)memalloc(sizeof(RID) * RD::get_singleton()->get_frame_delay());
for (uint32_t i=0; i<RD::get_singleton()->get_frame_delay(); i++)
  directional_light_buffer[i] = RD::get_singleton()->uniform_buffer_create(..., 
RD::BUFFER_CREATION_LINEAR_BIT | RD::BUFFER_CREATION_PERSISTENT_BIT);

RD::get_singleton()->buffer_update(directional_light_buffer[RD::get_singleton()->get_current_frame_index()], ...);

In short: It creates N buffers (where N = get_frame_delay(), that value is usually 2 or 3), then every frame it cycles through them.
The idea is that while GPU is reading from buffer i, the CPU is writing to buffer i + 1. Otherwise it would cause a race condition because the CPU would be overwriting with new data a buffer that the GPU is still reading from.

This works very well for code that promises to wholy fill from scratch every frame. Thus for buffers like directional_light_buffer, this is ideal.

However, there are other sections of code updated by TheForge that do the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

Notice that only one uniform_buffer is created, not N ones. This is only OK in the following scenarios:

The buffer is only populated once, before being used (i.e. after creation).
Before the buffer is updated, we perform a stall.

Since these buffers rarely change, it's a bug that's probably going to go unnoticed for a long time; until there is a project that changes material data very often.

Note that this also affects RenderSceneDataRD::create_uniform_buffer. It probably goes unnoticed because it's mostly used for per-pass data, and the data filled every frame is very similar.

But it also affects volumetric_fog.volume_ubo and volumetric_fog.params_ubo which use RD::BUFFER_CREATION_PERSISTENT_BIT without RD::BUFFER_CREATION_LINEAR_BIT. Data updated there is a WAR (Write After Read) Hazard unless I missed a stall.

BUFFER_CREATION_LINEAR_BIT: Weird internal cycling

Important

UPDATE: TheForge acknowledged this as a bug and issued a fix. See my follow up comment for further details on that fix.

The flag BUFFER_CREATION_LINEAR_BIT is confusing me because it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both (I'll go into detail on this later).

When the new code does the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

As I mentioned in the previous point, this code causes a race condition. But it doesn't stop here.

If we dug deeper, we find out that there is an internal cycle already going on! (but that cycling... doesn't fix the race condition).

First, let me start with this highly misleading variable name:

RID_Owner<PersistentBuffer> persistent_buffer_owner;

Unlike what the name suggests, it doesn't hold the RIDs of all persistent buffers, it holds all RIDs of "linear" buffers (all linear buffers are persistent, but not all persistent buffers are linear).

Linear buffers are put into persistent_buffer_owner and have an internal cycle. There is a function RenderingDevice::persistent_uniform_buffer_advance in charge of doing this:

void RenderingDevice::persistent_uniform_buffer_advance(RID p_buffer) {
	PersistentBuffer *linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
	if (linear_buffer) {
		if (linear_buffer->buffers.size() <= (linear_buffer->usage_index + 1)) {
			Buffer buffer;
			buffer.size = linear_buffer->size;
			buffer.usage = linear_buffer->usage;
			buffer.driver_id = driver->buffer_create(buffer.size, buffer.usage, RDD::MEMORY_ALLOCATION_TYPE_GPU);
			buffer_memory += buffer.size;
			linear_buffer->buffers.append(buffer);
		}
		linear_buffer->usage_index++;
	}
}

Then when retrieving the buffer via RID, the usage_index is taken into account:

RenderingDevice::Buffer *RenderingDevice::_get_buffer_from_owner(RID p_buffer) {
	if( /*...*/ )
	{

	}
	else if (persistent_buffer_owner.owns(p_buffer)) {
		PersistentBuffer* linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
		DEV_ASSERT(linear_buffer->usage_index != -1);
		buffer = linear_buffer->buffers.ptrw() + linear_buffer->usage_index; // ! NOTICE: linear_buffer->usage_index.
		// </TF>
	}
}

RID RenderingDevice::uniform_set_create( ... )
{
  // ! NOTICE: linear_buffer->usage_index.
  // ! NOTICE: The MAX() call because usage_index can be -1.
  buffer = linear_buffer->buffers.ptrw() + MAX(0, linear_buffer->usage_index);
}

And finally RenderingDevice::persistent_uniform_buffers_reset is called at the start of each frame to set usage_index back to -1:

void RenderingDevice::persistent_uniform_buffers_reset() {
	List<RID> owned;
	persistent_buffer_owner.get_owned_list(&owned);
	for (const RID& curr : owned) {
		PersistentBuffer* curr_linear_buffer = persistent_buffer_owner.get_or_null(curr);
		curr_linear_buffer->usage_index = -1;
	}
}

Remember that I said..?

it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both

The two common problems are:

Preventing Race Conditions between multiple frames (i.e. normally solved by cycling 2 frames in double buffer, 3 frames in triple buffer).
Updating the same buffer twice within the same frame.

Common Problem I: Preventing Race Conditions between multiple frames

OK so if the main problem usage_index (i.e. the new feature when adding the BUFFER_CREATION_LINEAR_BIT flag) is solving is to internally perform double (or triple) buffering to avoid race conditions between multiple frames, then directional_light_buffer[i] could simply be changed back to directional_light_buffer.

However if that's what is trying to achieve, then the call to persistent_uniform_buffers_reset must be modified. Because right now, it means that every frame we will start by using linear_buffer->buffers[0] instead of linear_buffer->buffers[last_updated]:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0
draw(); // Draw reading from buff->usage_index = 0
// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0 OOPS!
draw(); // Draw reading from buff->usage_index = 0

Uh oh, in Frame 1 the buffer_update() call updated the buffer with usage_index = 0, which is still in use by frame 0.
This is the race condition I described in the previous section.

Common Problem II: Updating the same buffer twice within the same frame.

The other common problem is when we update the buffer twice within the same frame when we have already issued rendering commands (or a variation of this problem: knowing when we haven't yet issued rendering commands).

Consider the following scenario, all happening within the same frame:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, offset =   0, size =  512, data ); // buff->usage_index = 0
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 512, size = 1024, data ); // buff->usage_index = 1
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1024, size = 128, data ); // buff->usage_index = 2
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1152, size = 256, data ); // buff->usage_index = 3
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1408, size = 400, data ); // buff->usage_index = 4
draw(); // Draw reading from buff

// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1

Question for the audience!

At the end of frame 0 what do you expect to find in buff ?

Valid data in byte range [0; 1808)
Valid data in byte range [1408; 1808)

The answer is "2. Valid data in byte range [1408; 1808)"! As long as the GPU shader only needs that small region, everything will be good; but if it expects to also be able to read [0; 1408), then I'm afraid it will be either uninitialized garbage or old values from a previous buffer_update() call.

This is because when usage_index gets incremented, we get a different buffer! (see RenderingDevice::persistent_uniform_buffer_advance again, I already posted the code). This could be fixed if the new buffer were to be prepopulated with the contents of buffer[buff->usage_index-1] every time persistent_uniform_buffer_advance gets called, though that's probably slower.

The alternative is to not do any sort of internal cycling, but perform internal debug validation to make absolute sure that further calls to buffer_update() do not overlap within the same frame.

If the purpose of usage_index / BUFFER_CREATION_LINEAR_BIT is to prevent overlapped buffer updates within the same frame, then I guess it's doing the job, but it comes with the caveat that it assumes the GPU shader will not read the data already written... which is the problem it's trying to prevent!

Conclusion

Unless I made a mistake (which is plausible), I can only conclude the logic behind this code is buggy. And whatever was intended for usage_index / BUFFER_CREATION_LINEAR_BIT to do is broken.

wareya · 2024-04-20T01:41:37Z

I saw some conversation somewhere that was confused by the extremely large positive line count (340,000 ish) on this PR, and had to scroll down and skim the thread to learn that it's not actually that large of a changeset, the diff is just a bit busted because the main branch has diverged so much, and the real changeset is more like +4000. I think adding a note to this effect in the opening post (in bold!) would be a good idea and slow down confusion in the peanut gallery.

AThousandShips · 2024-04-20T15:56:17Z

Would be good to just get this rebased cleaned up so it's easier to review, it's unmergable in its current state anyway

darksylinc · 2024-04-20T16:08:43Z

I'm already splitting it into a new PR with all the misc stuff that can be merged right away.

I think it's going to be the right call. There's a lot of trivial stuff that can be analyzed separately, and once that's in, the mental load is a lot lower.

Furthermore my goal is that by having 2-4 separate commits, it becomes easier to bisect.
Otherwise any regression in these changes caused by one giant commit is going to be next to impossible to narrow down.

tavurth · 2024-05-07T17:08:03Z

servers/rendering/renderer_rd/storage_rd/material_storage.cpp

-			uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size());
-			memset(ubo_data.ptrw(), 0, ubo_data.size()); //clear
-		}
+	p_uniform_dirty = true;


Always dirty uniform?

Then:

https://github.com/godotengine/godot/pull/90284/files#diff-89b51fc83d56c557d031b1750f010a1846b685ceb9f13ceb71d431e349aace2cR1003-R1006

Well spotted! Thanks!

This appears to be a simple merge error. However I am a bit hesitant because suspiciously this bug would have hided another bug that was fixed by TheForge. I'll have to look into this more deeply.

tavurth · 2024-05-07T17:11:16Z

servers/rendering/renderer_rd/storage_rd/particles_storage.cpp

@@ -1615,7 +1622,7 @@ void ParticlesStorage::update_particles() {
 			RD::get_singleton()->compute_list_bind_uniform_set(compute_list, particles->trail_bind_pose_uniform_set, 2);
 			RD::get_singleton()->compute_list_set_push_constant(compute_list, &copy_push_constant, sizeof(ParticlesShader::CopyPushConstant));

-			RD::get_singleton()->compute_list_dispatch_threads(compute_list, total_amount, 1, 1);
+			RD::get_singleton()->compute_list_dispatch_threads(compute_list, (uint32_t)ceil((float)total_amount / 64.0), 1, 1);


Already marked as a bug (see my comment "Particle FX compute shader").

That must not be merged.

darksylinc · 2024-05-19T16:35:20Z

OK TheForge fixed the issues regarding "BUFFER_CREATION_LINEAR_BIT: Weird internal cycling". Now it's clear what it is supposed to do. As I suspected the actual fix was only a couple of lines.

BUFFER_CREATION_LINEAR_BIT is supposed to cycle within an internal pool of buffers (ideally one per frame, cycling 3 buffers when using triple buffer and 2 buffers when using double buffer).

In other words, it solves the "Common Problem I: Preventing Race Conditions between multiple frames" from my previous report.

Now that BUFFER_CREATION_LINEAR_BIT is clear we have the following oddities:

directional_light_buffer[i] change is redundant.

Having RD::get_singleton()->get_frame_delay() directional_light_buffer[] buffers is pointless. Because we can keep just one and let BUFFER_CREATION_LINEAR_BIT do its job.

Updating the same buffer twice within the same frame is not safe

This is "Common Problem II" and remains unaddressed.

I can implement two solutions:

Forbid updating the same buffer twice in the same frame. Assert when this happens. If we're lucky, this is enough.
- If the assert triggers, the assert must be specifically disabled. When disabled, ALL the data range must be completely uploaded again by caller (or else we still assert).
Track dirty regions and only assert when the updated regions within the same frame overlap.
- If they overlap, assert.

Option 1 is the easiest to implement so I will go with that. However Option 2 is the most efficient if many (non-overlapping) partial updates happen too often per frame.

Always `p_uniform_dirty = true;`

User @tavurth discovered this problem.

It would seem a simple merge error that should be undone (i.e. not merged). However suspiciously, if we mark the buffer as always dirty, it workarounds the old buggy code BUFFER_CREATION_LINEAR_BIT that has now been fixed.

This will need deeper debugging from me to ensure everything works as expected.

huwpascoe · 2024-07-12T01:13:49Z

Possible Fix Queue Synchronization could fix #94177

stuartcarnie · 2024-10-13T02:07:44Z

@darksylinc is my understanding correct that we can call the buffer_get_persistent_address, in order to write directly to the GPU buffer.

This will be a huge win for the batch 2D rendering optimisations I've been working on, when UMA is available, as I'll be able to write the instance data directly to the buffer and avoid:

write instance data to CPU buffer
call buffer_update, which schedules potentially multiple BLITs from CPU to GPU

This is a potentially significant win for CPU (and GPU) when running on UMA systems.

darksylinc · 2024-10-13T06:31:23Z

I had to splice the persistent mapping stuff into its own branch because it was riddled with race conditions.

I still want to add it but it's not easy
I've discussed it with other Godot team members and the problem boils down to:

Safety guarantees that are needed for the kind of arbitrary uses that Godot users can perform require too much overhead. Negating any performance improvement.
I'm evaluating adding a "YOLO mode" where persistent mapping acces would be enabled for internal components (but not available to users). However this still needs a lot of debug mode tracking because Godot design is too flexible, which means even experienced devs could make mistakes and introduce race conditions when writing in YOLO mode, that would be very hard to track down (and Vulkan Validation layers can NOT detect these errors). A good interface would blow in debug mode, to indicate YOLO mode is incorrectly used.

By race conditions I mean CPU writes to areas of memory currently in use by the GPU.

At best such data races cause small glitches, at worst it cause device lost (e.g. imagine if a shader gets stuck in an infinite loop because its input data is in an inconsistent state)

stuartcarnie · 2024-10-13T08:49:46Z

However this still needs a lot of debug mode tracking because Godot design is too flexible, which means even experienced devs could make mistakes and introduce race conditions when writing in YOLO mode

Those scenario sounds entirely reasonable for YOLO mode – an advanced feature with appropriate warnings. I wouldn't expect any guard rails, and understand there would be limited tooling to assist me when I use the APIs incorrectly. The upside is potentially significant performance benefits.

Race conditions aren't new to Godot devs, who are exposed to those issues with GDScript and multi-threading.

Are you are eluding to something more subtle which I am missing?

darksylinc · 2024-10-13T16:02:02Z

Race conditions aren't new to Godot devs, who are exposed to those issues with GDScript and multi-threading.

Are you are eluding to something more subtle which I am missing?

CPU race conditions will result in wrong data that can be observed with simple printf().
If it crashes, you can use a debugger to see what's going wrong.

Some CPU race conditions are really hard to find because the error manifested very far away from when the problem occurred, but ultimately you can trace your way backwards.

There are also tools like Clang's ThreadSanitizer.

Another important thing is that nobody doubts CPU HW if faulty or compilers are faulty. They do happen, but they're rare and when it does happen, it is pretty obvious the crashes come from a specific platform/compiler or from a specific CPU model.

Another important issue is that if you are able to turn multi-threaded code into single-threaded and the problem goes away, then you know the problem is a race condition.

On CPU -> GPU race conditions, it's MUCH harder to spot:

For starters driver and HW bugs are extremely common. So blaming the vendor instead of our own code is far more common. Specially if the code happens to work e.g. on AMD and NVIDIA due to scheduling or cache coherency woes, but breaks on Intel GPUs: we'll end up blaming Intel when the fault is on us.
A race condition that results in a GPU crash will look like every other crash: You get a device lost error with no other information of what went wrong. You'll end up in ticket 71929 (I'm not linking to it to avoid adding noise to it) which has become a dumpster for generic GPU crashes. We recently added support for breadcrumbs and device lost reports (which came in fact from TheForge's work, it's already merged in 4.4), which should help a lot in understanding what the GPU was doing when it crashed; however it will be hard to track down due to the next point.
You have 4 places to look for: the CPU code, the GPU code, the CPU -> GPU synchronization, and the GPU -> GPU synchronization. The GPU crash (or glitch) could be happening because the shader code is faulty. You have to analyze it, dissect it, understand it. And then when you find nothing wrong with it, you theorize "this could happen if this uniform variable num_lights is wrong". So you go to the CPU code. Analyze it, dissect it, understand it. Find ways to see how num_lights could end up with an invalid value. But you don't see how. That's because the problem is that you're writing to the buffer when you were not supposed to.
If you force aggressive synchronization (e.g. place vkDeviceWaitIdle everywhere) and the problem goes away, it's a synchronization problem. But is it a GPU -> GPU synchronization problem? Or is it a CPU -> GPU one? Luckily most GPU -> GPU sync errors can be detected by Vulkan's Validation layer (sync validation is off by default, you need to turn it on in vkconfig.exe), but CPU -> GPU are on you.

Ultimately the problem is that it boils down to 2 rules:

You can only write to the buffer once per frame.
You must fill it from scratch every time.
You must fill the buffer every frame (lifting this rule is possible but it requires a ton of work not done yet), even if nothing changed. You can skip this rule if you're not planning to issue draws or compute dispatches that use this buffer at all.

If you follow those three rules, then you're safe.

But Godot is incredibly flexible. If the user added extra passes, or stereo rendering is involved; we may break the first rule (write to the buffer once); but it will only happen in specific scene settings.

It's even harder to approach because Godot by default in many places tries to only update dirty regions. For example in material or mesh updates, it only updates materials/meshes that were modified. This breaks the second rule.

stuartcarnie · 2024-10-14T19:11:29Z

Firstly, my questions are not doubting your reasoning – primarily out of curiosity and to better understand the context and the problem.

Are we only talking about UMA architecture?

You can only write to the buffer once per frame.

Presumably we can write to the buffer incrementally, but this rule is stating you can't write to it after you've submitted the command that uses the buffer?

You must fill it from scratch every time.

Not questioning it, but why is this the case? What would be overwriting the memory that requires this rule?

You must fill the buffer every frame

Seems like 2. and 3. are related, suggesting that the contents of the buffer are invalidated by something at some point after the frame? Curious why that is?

darksylinc · 2024-10-15T15:19:03Z

Are we only talking about UMA architecture?

Technically no, but I understand your confusion, they'll be addressed with the next answer.

You can only write to the buffer once per frame.

Presumably we can write to the buffer incrementally, but this rule is stating you can't write to it after you've submitted the command that uses the buffer?

You must fill it from scratch every time.

Not questioning it, but why is this the case? What would be overwriting the memory that requires this rule?

You must fill the buffer every frame

Seems like 2. and 3. are related, suggesting that the contents of the buffer are invalidated by something at some point after the frame? Curious why that is?

OK apologies for not explaining correctly.

Technically you could have a buffer shared by CPU and GPU, fill it once or multiple times, and "as long as you're careful" everything will be alright.

The problem is that "as long as you're careful" is very brittle, and there are various solutions to the problem. Once of them is to impose specific rules (imposed by us, not the HW) to ensure everything goes alright.

The scheme is simple: If we have 1MB of data, we create a 2MB buffer (on double buffer), or 3MB buffer (if triple buffered). On frame 0 we write to [0; 1024), on frame 1 we write to [1024; 2048), on frame 2 to [2048; 3072). On frame 3 we cycle back to [0; 1024).

This ensures the GPU is reading to a whole region that is not being touched by the CPU. But this also means you can't do partial uploads. Because everything is offsetted every frame.

So having this in mind:

You can only update it once

Once you're done writing to it and use it in a draw/compute command, we must advance the offset.

Trying to write to it again means you'd have to be careful to only write to a region you didn't already write. This is hard to guarantee or verify. Because the buffer is now "hot", it may be being read by the GPU.

You must upload it from scratch

Because it's a different region every time.

You must write to it every frame (it's possible to lift this)

This has to do more with descriptors. Godot bakes descriptors. So let's say we have 1 UMA buffer. We only need 3 baked descriptor sets.
Now let's say we have 2 UMA buffers in a descriptor and you want to update them at different frequencies. We need 9 descriptor sets (3^2). One simple solution is to force all of them to update every frame, so we only need 3 baked descriptor sets.

This restriction can be lifted if we use Vulkan's dynamic descriptor slots, but Godot doesn't support them (TheForge tried to add it and left some code, but left it unused or abandoned).

ShadyTF added 30 commits January 12, 2024 10:25

New: buffer_get_persistent_address to return UMA address when possible

0fc0325

- vulkan implementation to allocate with a persistent memory address - getter implementation - new member added for vulkan BufferInfo

possible usage scenarios for persistently mapped buffers

b25b49e

adding new flag BUFFER_USAGE_PERSISTENT_BIT for uma mapped buffers

16fe4ee

initial implementation for linearly allocated persistent buffers

c1250b8

add a flag to globally toggle persistent buffers + remove usage in ve…

6cf9b26

…rtex and index buffer

missed in previous commit

b611cf8

usage example for persistent buffers for RenderSceneDataRD

3315786

globally enable persistent buffers

39bb921

bitfirelds for buffer creation bits, and update uniform and storage b…

c9d9294

…uffer creation to use them

usage of persistent buffers with particles, skeletons

5b2369f

persistent buffer for material storage uniforms

45e9cc5

update owner of persistent buffer to match changes from 4.3

bdc5a1d

persistent buffer usage with material storage

2718c27

persistent buffer for vertex, index and uniforms

c2b1658

perf report function, can be used to show metrics on screen

f103234

Support for lazily allocatted buffers

eecc81f

request debug marker extension along side debug utils, and assign the…

84c4eaf

… function pointers if found

debug marker extenstion usages for begin_label/end_label/ inser_label…

aa4ed9b

…/ setobject name + a conversion function

VK_EXT_DEBUG_MARKER_EXTENSION_NAME is a device extension, not an inst…

35961df

…ance extension

fixes for debug marker extensions, they should be coming from the dev…

b02cf06

…ice, not the instance

adding a function to vma to verify that lazily allocated buffers are …

07ecf9d

…being used on android

function to destroy shader modules, should be called on shaders after…

062d8f1

… all associated modules have been created

adding function on device level to destroy a specific shader modules,…

7ea1c92

… or all

usage examples for destroy shader modules

9600903

fix android build <unused variables>

390d12e

adding memory advice library

4f30636

linking with memory advice, initialize it in the glue code, and usage…

09f9cd0

… in the main

report percent available when memory approaches limit

3bb0366

thermal support on the main level of engine

c64defe

Kotlin activity and java glue call back for thermal status reporting

4151bac

bruvzg reviewed Apr 6, 2024

View reviewed changes

Only reset linear buffers for the current frame

8529a3b

tavurth reviewed May 7, 2024

View reviewed changes

m4gr3d mentioned this pull request Jun 2, 2024

Android pre-rotation in Vulkan #48049

Closed

m4gr3d mentioned this pull request Jun 23, 2024

Add pre-rotation optimization for Vulkan rendering on Android godotengine/godot-proposals#1515

Open

This was referenced Oct 30, 2024

Automatically resolve initial and final action for draw lists. #98670

Merged

Add dependency detection improvements to the render graph. #98701

Merged

Add multiple specialization constants to Forward+ and Mobile. #98825

Merged

DarioSamo mentioned this pull request Nov 12, 2024

Improve performance of shader lighting code in Forward renderers. #99136

Merged

		if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
		return NULL;

		if (p_memory == NULL)
		return alloc_aligned_static(p_bytes, p_alignment);

Final results of collaboration with Google and The Forge #90284

Are you sure you want to change the base?

Final results of collaboration with Google and The Forge #90284

Conversation

clayjohn commented Apr 5, 2024 • edited Loading

Results

Conclusion

darksylinc commented Apr 6, 2024 • edited Loading

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024 • edited Loading

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024 • edited Loading

bruvzg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darksylinc Apr 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clayjohn commented Apr 9, 2024

darksylinc commented Apr 14, 2024 • edited Loading

Shared Changes

Specific Changes

Noteworthy or slightly controversial changes

Bugs / Oddities

shader_destroy_modules

Unused dynamic uniform sets

Unknown initialization of max_descriptor_sets_per_pool

Changed the default value of window_orientation in main.cpp

Thermal

The new screen_get_current_rotation() mixes internal state with public state

Race condition in BUFFER_CREATION_PERSISTENT_BIT usage

BUFFER_CREATION_LINEAR_BIT: Weird internal cycling

Common Problem I: Preventing Race Conditions between multiple frames

Common Problem II: Updating the same buffer twice within the same frame.

Conclusion

wareya commented Apr 20, 2024

AThousandShips commented Apr 20, 2024

darksylinc commented Apr 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darksylinc commented May 19, 2024

directional_light_buffer[i] change is redundant.

Updating the same buffer twice within the same frame is not safe

Always p_uniform_dirty = true;

huwpascoe commented Jul 12, 2024

stuartcarnie commented Oct 13, 2024

darksylinc commented Oct 13, 2024

stuartcarnie commented Oct 13, 2024

darksylinc commented Oct 13, 2024 • edited Loading

stuartcarnie commented Oct 14, 2024 • edited Loading

darksylinc commented Oct 15, 2024

clayjohn commented Apr 5, 2024 •

edited

Loading

darksylinc commented Apr 6, 2024 •

edited

Loading

darksylinc commented Apr 6, 2024 •

edited

Loading

AThousandShips commented Apr 6, 2024 •

edited

Loading

darksylinc Apr 13, 2024 •

edited

Loading

darksylinc commented Apr 14, 2024 •

edited

Loading

Always `p_uniform_dirty = true;`

darksylinc commented Oct 13, 2024 •

edited

Loading

stuartcarnie commented Oct 14, 2024 •

edited

Loading