-
-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final results of collaboration with Google and The Forge #90284
base: master
Are you sure you want to change the base?
Conversation
- vulkan implementation to allocate with a persistent memory address - getter implementation - new member added for vulkan BufferInfo
…rtex and index buffer
…uffer creation to use them
… function pointers if found
…/ setobject name + a conversion function
…ice, not the instance
…being used on android
… all associated modules have been created
I get that "this is how git works" and "technically there is nothing wrong", but it makes the diff shown in Github impossible to review online. It needs to be done offline. I'm asking if there is a way to publish online something user friendly that everyone can take a look at. Because right now Github (and git) says there's 300k LOC to review, but the actual amount that needs reviewing is nowhere remotely close to that. |
What are you basing that on? It's right there, the changes in the list here is the changes, it's not different |
If I type (using my fork which stripped thirdparty changes): git merge-base master TF_final
# Yields bb6b06c81343073f10cbbd2af515cf0dac1e6549
git diff bb6b06c81343073f10cbbd2af515cf0dac1e6549..TF_final I get lots of irrelevant stuff in the diff like this: diff --git a/thirdparty/thorvg/update-thorvg.sh b/thirdparty/thorvg/update-thorvg.sh
index 2811a43339..7a754c09b9 100755
--- a/thirdparty/thorvg/update-thorvg.sh
+++ b/thirdparty/thorvg/update-thorvg.sh
@@ -1,6 +1,6 @@
#!/bin/bash -e
-VERSION=0.12.5
+VERSION=0.12.9 In both latest master and latest TF_final The same problem happens if I type: git diff master..TF_final That change to I get that "technically" |
I brute-forced the problem!!! THESE ARE THE ACTUAL CHANGES THAT NEED REVIEWING. 75 files, 4366 additions, 717 deletions. Much cleaner. |
Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that |
The PR will likely not be merged "as is" but rather split up into smaller chunks for integration. I know that the link I posted can't be reviewed using Github's interface; but it gives an overview of what to expect: The changes are nowhere that scary. In fact very reasonable. But right now GH is showing 10:1 of noise to signal ratio. |
And that's why I asked for it for the purposes of reviewing :) I'm fully aware, that's why I asked for that rather than trying to break down this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing thirdparty/README.md
and COPYRIGHT.txt
updates and documentation for the new methods. Since this is not intended to be merged as is, I'm not commenting on various code style issues. But it probably should be rebased at least once to run CI and make reviewing easier.
@@ -50,7 +50,7 @@ | |||
create_info.pLayer = *wpd->layer_ptr; | |||
|
|||
VkSurfaceKHR vk_surface = VK_NULL_HANDLE; | |||
VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, nullptr, &vk_surface); | |||
VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, get_allocation_callbacks(VK_OBJECT_TYPE_SURFACE_KHR), &vk_surface); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is missing from macOS and Wayland code.
@@ -342,6 +343,11 @@ class DisplayServer : public Object { | |||
INVALID_INDICATOR_ID = -1 | |||
}; | |||
|
|||
private: | |||
ScreenOrientation native_orientation = ScreenOrientation::SCREEN_LANDSCAPE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be unused.
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL) | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL) | |
return NULL; | |
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == nullptr) { | |
return nullptr; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason it's not using C11 aligned_alloc
?
if (p_memory == NULL) | ||
return alloc_aligned_static(p_bytes, p_alignment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (p_memory == NULL) | |
return alloc_aligned_static(p_bytes, p_alignment); | |
if (p_memory == nullptr) { | |
return alloc_aligned_static(p_bytes, p_alignment); | |
} |
|
||
glslang::GlslangToSpv(*program.getIntermediate(stages[p_stage]), SpirV, &logger, &spvOptions); | ||
|
||
ret.resize(SpirV.size() * sizeof(uint32_t)); | ||
{ | ||
uint8_t *w = ret.ptrw(); | ||
memcpy(w, &SpirV[0], SpirV.size() * sizeof(uint32_t)); | ||
char buffer[256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused.
@@ -123,6 +123,7 @@ class DisplayServerAndroid : public DisplayServer { | |||
|
|||
virtual void screen_set_orientation(ScreenOrientation p_orientation, int p_screen = SCREEN_OF_MAIN_WINDOW) override; | |||
virtual ScreenOrientation screen_get_orientation(int p_screen = SCREEN_OF_MAIN_WINDOW) const override; | |||
virtual int screen_get_current_rotation(int p_screen) const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented for the rest of platforms (should be easy to do an all except Web).
@@ -184,13 +184,16 @@ def configure(env: "SConsEnvironment"): | |||
if env["arch"] == "x86_32": | |||
# The NDK adds this if targeting API < 24, so we can drop it when Godot targets it at least | |||
env.Append(CCFLAGS=["-mstackrealign"]) | |||
env.Append(LIBPATH=["../../thirdparty/swappy-frame-pacing/x86"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary libs in the tree? This should be either build in tree or optional external dependency.
#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY) | ||
ClassDB::bind_method(D_METHOD("get_tracked_object_name"), &Engine::get_tracked_object_name); | ||
ClassDB::bind_method(D_METHOD("get_tracked_object_type_count"), &Engine::get_tracked_object_type_count); | ||
#endif | ||
|
||
#if defined(VK_TRACK_DRIVER_MEMORY) | ||
ClassDB::bind_method(D_METHOD("get_driver_total_memory"), &Engine::get_driver_total_memory); | ||
ClassDB::bind_method(D_METHOD("get_driver_allocation_count"), &Engine::get_driver_allocation_count); | ||
ClassDB::bind_method(D_METHOD("get_driver_memory_by_object_type"), &Engine::get_driver_memory_by_object_type); | ||
ClassDB::bind_method(D_METHOD("get_driver_allocs_by_object_type"), &Engine::get_driver_allocs_by_object_type); | ||
#endif | ||
|
||
#if defined(VK_TRACK_DEVICE_MEMORY) | ||
ClassDB::bind_method(D_METHOD("get_device_total_memory"), &Engine::get_device_total_memory); | ||
ClassDB::bind_method(D_METHOD("get_device_allocation_count"), &Engine::get_device_allocation_count); | ||
ClassDB::bind_method(D_METHOD("get_device_memory_by_object_type"), &Engine::get_device_memory_by_object_type); | ||
ClassDB::bind_method(D_METHOD("get_device_allocs_by_object_type"), &Engine::get_device_allocs_by_object_type); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will mess up documentation generation, so should not be ifdef
ed, but instead always active, call methods from RenderingContext
instance (RenderingDevice::get_singleton()->...
) and have dummy implementations in the base RenderingContext
class (since same can be later implemented for Metal and DX12).
@@ -36,6 +36,11 @@ | |||
#include "core/templates/list.h" | |||
#include "core/templates/vector.h" | |||
|
|||
#if defined(VULKAN_ENABLED) && (defined(DEBUG_ENABLED) || defined(DEV_ENABLED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved to drivers/vulkan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into this PR.
It seems like it can't be moved there.
It appears that the VK_TRACK_DEVICE_MEMORY
is a debug-only macro (it's only defined in debug & dev builds).
When that macro is defined, objects from core inform the Vulkan subsystem what subsystem they belong to.
e.g. in core/core_bind.cpp we have things like:
#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY)
String Engine::get_tracked_object_name(uint32_t typeIndex) const {
return RenderingContextDriverVulkan::get_tracked_object_name(typeIndex);
}
uint64_t Engine::get_tracked_object_type_count() const {
return RenderingContextDriverVulkan::get_tracked_object_type_count();
}
#endif
Where get_tracked_object_name
returns a string like "INSTANCE", "PHYSICAL_DEVICE", "DESCRIPTOR_SET_LAYOUT".
I don't know if there is a better way yet, but it's not a simple thing of moving that line to a different header.
@@ -28,6 +28,8 @@ | |||
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ | |||
/**************************************************************************/ | |||
|
|||
#pragma optimize("", off) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
The reported performance numbers are compared to 4.3. Essentially it is Godot with/without these changes.
There are other performance improvements coming to 4.3, so yes, the difference is bigger when compared to 4.2 as it would include all the performance improvements made by others. |
Shared ChangesMemory Allocation code. Lots of changes interact in one way with GPU memory allocation. The most prominent one is for debug tracking. Specific Changes
I am strongly leaning in submitting these as a single PR due to how close they are to each other:
For example "Persistent Mapping" could be made into its own PR. However such PR would be criticized for just adding being dead code; because it's not until "Replace Push Constants with UBOs" gets integrated that it would be used. And "Replace Push Constants with UBOs" depends on the first three. Although it could be broken down; it would be a lot more effort. Noteworthy or slightly controversial changesPreviously device lost were handled generically by The purpose of device_queue.submit_mutex.lock();
err = vkQueueSubmit(device_queue.queue, 1, &submit_info, vk_fence);
device_queue.submit_mutex.unlock();
if (err == VK_ERROR_DEVICE_LOST) {
print_lost_device_info();
CRASH_NOW_MSG("Vulkan device was lost.");
}
ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED); It think it may be controversial because Godot convention is to use I'm not sure how Godot Team wants to handle integrating this block of code, but what is for sure is that really want to call Bugs / Odditiesshader_destroy_modulesTheForge introduced In void RenderingDeviceDriverVulkan::shader_free(ShaderID p_shader) {
ShaderInfo *shader_info = (ShaderInfo *)p_shader.id;
for (uint32_t i = 0; i < shader_info->vk_descriptor_set_layouts.size(); i++) {
vkDestroyDescriptorSetLayout(vk_device, shader_info->vk_descriptor_set_layouts[i], VKC::get_allocation_callbacks(VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT));
}
vkDestroyPipelineLayout(vk_device, shader_info->vk_pipeline_layout, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_PIPELINE_LAYOUT));
for (uint32_t i = 0; i < shader_info->vk_stages_create_info.size(); i++) {
vkDestroyShaderModule(vk_device, shader_info->vk_stages_create_info[i].module, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_SHADER_MODULE));
}
// <TF>
// @ShadyTF unload shader modules
// Was :
//for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
// if (si.vk_stages_create_info[i].module) {
// vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
// }
//}
shader_destroy_modules( p_shader );
// </TF>
VersatileResource::free(resources_allocator, shader_info);
}
// <TF>
// @ShadyTF unload shader modules
void RenderingDeviceDriverVulkan::shader_destroy_modules(ShaderID p_shader) {
auto *E = (RBSet<ShaderInfo>::Element *)p_shader.id;
ShaderInfo &si = E->get();
for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
if (si.vk_stages_create_info[i].module) {
vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
si.vk_stages_create_info[i].module = VK_NULL_HANDLE;
}
}
si.vk_stages_create_info.clear();
}
// </TF> I think the Another problem is that the function Update: Another oddity is that they added Unused dynamic uniform setsThey added:
But unless I made a mistake, it doesn't appear like it's actually being used. i.e. it's dead code. Update: We got word from TheForge. The dynamic set code is/was meant as the final step in transitioning from push constants to UBOs. Current code (from TheForge) uses multiple cycled UBOs as a stepping corner for migration. Unknown initialization of max_descriptor_sets_per_poolTheForge added the following to max_descriptor_sets_per_pool = GLOBAL_GET("rendering/rendering_device/vulkan/max_descriptors_per_pool"); But this exact same line of code is already part of Or perhaps it needs to live in both places? Changed the default value of window_orientation in main.cpp// Before:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_LANDSCAPE;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_LANDSCAPE)));
// After:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_SENSOR;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_SENSOR))); I don't know why, I don't know if it matters (i.e. if it later gets overwritten anyway), I don't know what SCREEN_SENSOR is (this enum value predates TheForge's changes). I also don't know if this is relevant for backwards compatibility. ThermalThe PR adds functions like The new screen_get_current_rotation() mixes internal state with public stateTheForge adds a new function: int screen_get_current_rotation(int p_screen) const override; User bruvzg mentioned:
Well, the problem is that this function is poorly named. It should be renamed to something like: // For internal use.
int screen_get_internal_current_rotation(int p_screen) const override; This is because the return value has deep impact in internal implementation details and in most other platforms (e.g. Windows, Linux, probably iOS too, etc) the return value must be 0. Race condition in BUFFER_CREATION_PERSISTENT_BIT usageThe biggest for last. Every time the new code does something like this, it is in principle, safe: RID* directional_light_buffer;
directional_light_buffer = (RID *)memalloc(sizeof(RID) * RD::get_singleton()->get_frame_delay());
for (uint32_t i=0; i<RD::get_singleton()->get_frame_delay(); i++)
directional_light_buffer[i] = RD::get_singleton()->uniform_buffer_create(...,
RD::BUFFER_CREATION_LINEAR_BIT | RD::BUFFER_CREATION_PERSISTENT_BIT);
RD::get_singleton()->buffer_update(directional_light_buffer[RD::get_singleton()->get_current_frame_index()], ...); In short: It creates N buffers (where N = get_frame_delay(), that value is usually 2 or 3), then every frame it cycles through them. This works very well for code that promises to wholy fill from scratch every frame. Thus for buffers like directional_light_buffer, this is ideal. However, there are other sections of code updated by TheForge that do the following: bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
} Notice that only one
Since these buffers rarely change, it's a bug that's probably going to go unnoticed for a long time; until there is a project that changes material data very often. Note that this also affects But it also affects BUFFER_CREATION_LINEAR_BIT: Weird internal cyclingImportant UPDATE: TheForge acknowledged this as a bug and issued a fix. See my follow up comment for further details on that fix. The flag When the new code does the following: bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
} As I mentioned in the previous point, this code causes a race condition. But it doesn't stop here. If we dug deeper, we find out that there is an internal cycle already going on! (but that cycling... doesn't fix the race condition). First, let me start with this highly misleading variable name: RID_Owner<PersistentBuffer> persistent_buffer_owner; Unlike what the name suggests, it doesn't hold the RIDs of all persistent buffers, it holds all RIDs of "linear" buffers (all linear buffers are persistent, but not all persistent buffers are linear). Linear buffers are put into void RenderingDevice::persistent_uniform_buffer_advance(RID p_buffer) {
PersistentBuffer *linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
if (linear_buffer) {
if (linear_buffer->buffers.size() <= (linear_buffer->usage_index + 1)) {
Buffer buffer;
buffer.size = linear_buffer->size;
buffer.usage = linear_buffer->usage;
buffer.driver_id = driver->buffer_create(buffer.size, buffer.usage, RDD::MEMORY_ALLOCATION_TYPE_GPU);
buffer_memory += buffer.size;
linear_buffer->buffers.append(buffer);
}
linear_buffer->usage_index++;
}
} Then when retrieving the buffer via RID, the usage_index is taken into account: RenderingDevice::Buffer *RenderingDevice::_get_buffer_from_owner(RID p_buffer) {
if( /*...*/ )
{
}
else if (persistent_buffer_owner.owns(p_buffer)) {
PersistentBuffer* linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
DEV_ASSERT(linear_buffer->usage_index != -1);
buffer = linear_buffer->buffers.ptrw() + linear_buffer->usage_index; // ! NOTICE: linear_buffer->usage_index.
// </TF>
}
}
RID RenderingDevice::uniform_set_create( ... )
{
// ! NOTICE: linear_buffer->usage_index.
// ! NOTICE: The MAX() call because usage_index can be -1.
buffer = linear_buffer->buffers.ptrw() + MAX(0, linear_buffer->usage_index);
} And finally void RenderingDevice::persistent_uniform_buffers_reset() {
List<RID> owned;
persistent_buffer_owner.get_owned_list(&owned);
for (const RID& curr : owned) {
PersistentBuffer* curr_linear_buffer = persistent_buffer_owner.get_or_null(curr);
curr_linear_buffer->usage_index = -1;
}
} Remember that I said..?
The two common problems are:
Common Problem I: Preventing Race Conditions between multiple framesOK so if the main problem However if that's what is trying to achieve, then the call to // Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data ); // Send data to GPU. buff->usage_index = 0
draw(); // Draw reading from buff->usage_index = 0
// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data ); // Send data to GPU. buff->usage_index = 0 OOPS!
draw(); // Draw reading from buff->usage_index = 0 Uh oh, in Frame 1 the buffer_update() call updated the buffer with Common Problem II: Updating the same buffer twice within the same frame.The other common problem is when we update the buffer twice within the same frame when we have already issued rendering commands (or a variation of this problem: knowing when we haven't yet issued rendering commands). Consider the following scenario, all happening within the same frame: // Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, offset = 0, size = 512, data ); // buff->usage_index = 0
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 512, size = 1024, data ); // buff->usage_index = 1
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1024, size = 128, data ); // buff->usage_index = 2
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1152, size = 256, data ); // buff->usage_index = 3
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1408, size = 400, data ); // buff->usage_index = 4
draw(); // Draw reading from buff
// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1 Question for the audience! At the end of frame 0 what do you expect to find in
The answer is "2. Valid data in byte range This is because when The alternative is to not do any sort of internal cycling, but perform internal debug validation to make absolute sure that further calls to If the purpose of ConclusionUnless I made a mistake (which is plausible), I can only conclude the logic behind this code is buggy. And whatever was intended for |
I saw some conversation somewhere that was confused by the extremely large positive line count (340,000 ish) on this PR, and had to scroll down and skim the thread to learn that it's not actually that large of a changeset, the diff is just a bit busted because the main branch has diverged so much, and the real changeset is more like +4000. I think adding a note to this effect in the opening post (in bold!) would be a good idea and slow down confusion in the peanut gallery. |
Would be good to just get this rebased cleaned up so it's easier to review, it's unmergable in its current state anyway |
I'm already splitting it into a new PR with all the misc stuff that can be merged right away. I think it's going to be the right call. There's a lot of trivial stuff that can be analyzed separately, and once that's in, the mental load is a lot lower. Furthermore my goal is that by having 2-4 separate commits, it becomes easier to bisect. |
uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size()); | ||
memset(ubo_data.ptrw(), 0, ubo_data.size()); //clear | ||
} | ||
p_uniform_dirty = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted! Thanks!
This appears to be a simple merge error. However I am a bit hesitant because suspiciously this bug would have hided another bug that was fixed by TheForge. I'll have to look into this more deeply.
@@ -1615,7 +1622,7 @@ void ParticlesStorage::update_particles() { | |||
RD::get_singleton()->compute_list_bind_uniform_set(compute_list, particles->trail_bind_pose_uniform_set, 2); | |||
RD::get_singleton()->compute_list_set_push_constant(compute_list, ©_push_constant, sizeof(ParticlesShader::CopyPushConstant)); | |||
|
|||
RD::get_singleton()->compute_list_dispatch_threads(compute_list, total_amount, 1, 1); | |||
RD::get_singleton()->compute_list_dispatch_threads(compute_list, (uint32_t)ceil((float)total_amount / 64.0), 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic 64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already marked as a bug (see my comment "Particle FX compute shader").
That must not be merged.
OK TheForge fixed the issues regarding "BUFFER_CREATION_LINEAR_BIT: Weird internal cycling". Now it's clear what it is supposed to do. As I suspected the actual fix was only a couple of lines.
In other words, it solves the "Common Problem I: Preventing Race Conditions between multiple frames" from my previous report. Now that directional_light_buffer[i] change is redundant.Having Updating the same buffer twice within the same frame is not safeThis is "Common Problem II" and remains unaddressed. I can implement two solutions:
Option 1 is the easiest to implement so I will go with that. However Option 2 is the most efficient if many (non-overlapping) partial updates happen too often per frame. Always
|
Possible Fix Queue Synchronization could fix #94177 |
@darksylinc is my understanding correct that we can call the This will be a huge win for the batch 2D rendering optimisations I've been working on, when UMA is available, as I'll be able to write the instance data directly to the buffer and avoid:
This is a potentially significant win for CPU (and GPU) when running on UMA systems. |
I had to splice the persistent mapping stuff into its own branch because it was riddled with race conditions. I still want to add it but it's not easy
By race conditions I mean CPU writes to areas of memory currently in use by the GPU. At best such data races cause small glitches, at worst it cause device lost (e.g. imagine if a shader gets stuck in an infinite loop because its input data is in an inconsistent state) |
Those scenario sounds entirely reasonable for YOLO mode – an advanced feature with appropriate warnings. I wouldn't expect any guard rails, and understand there would be limited tooling to assist me when I use the APIs incorrectly. The upside is potentially significant performance benefits. Race conditions aren't new to Godot devs, who are exposed to those issues with GDScript and multi-threading. Are you are eluding to something more subtle which I am missing? |
CPU race conditions will result in wrong data that can be observed with simple printf(). Some CPU race conditions are really hard to find because the error manifested very far away from when the problem occurred, but ultimately you can trace your way backwards. There are also tools like Clang's ThreadSanitizer. Another important thing is that nobody doubts CPU HW if faulty or compilers are faulty. They do happen, but they're rare and when it does happen, it is pretty obvious the crashes come from a specific platform/compiler or from a specific CPU model. Another important issue is that if you are able to turn multi-threaded code into single-threaded and the problem goes away, then you know the problem is a race condition. On CPU -> GPU race conditions, it's MUCH harder to spot:
Ultimately the problem is that it boils down to 2 rules:
If you follow those three rules, then you're safe. But Godot is incredibly flexible. If the user added extra passes, or stereo rendering is involved; we may break the first rule (write to the buffer once); but it will only happen in specific scene settings. It's even harder to approach because Godot by default in many places tries to only update dirty regions. For example in material or mesh updates, it only updates materials/meshes that were modified. This breaks the second rule. |
Firstly, my questions are not doubting your reasoning – primarily out of curiosity and to better understand the context and the problem. Are we only talking about UMA architecture?
Presumably we can write to the buffer incrementally, but this rule is stating you can't write to it after you've submitted the command that uses the buffer?
Not questioning it, but why is this the case? What would be overwriting the memory that requires this rule?
Seems like 2. and 3. are related, suggesting that the contents of the buffer are invalidated by something at some point after the frame? Curious why that is? |
Technically no, but I understand your confusion, they'll be addressed with the next answer.
OK apologies for not explaining correctly. Technically you could have a buffer shared by CPU and GPU, fill it once or multiple times, and "as long as you're careful" everything will be alright. The problem is that "as long as you're careful" is very brittle, and there are various solutions to the problem. Once of them is to impose specific rules (imposed by us, not the HW) to ensure everything goes alright. The scheme is simple: If we have 1MB of data, we create a 2MB buffer (on double buffer), or 3MB buffer (if triple buffered). On frame 0 we write to [0; 1024), on frame 1 we write to [1024; 2048), on frame 2 to [2048; 3072). On frame 3 we cycle back to [0; 1024). This ensures the GPU is reading to a whole region that is not being touched by the CPU. But this also means you can't do partial uploads. Because everything is offsetted every frame. So having this in mind:
Once you're done writing to it and use it in a draw/compute command, we must advance the offset. Trying to write to it again means you'd have to be careful to only write to a region you didn't already write. This is hard to guarantee or verify. Because the buffer is now "hot", it may be being read by the GPU.
Because it's a different region every time.
This has to do more with descriptors. Godot bakes descriptors. So let's say we have 1 UMA buffer. We only need 3 baked descriptor sets. This restriction can be lifted if we use Vulkan's dynamic descriptor slots, but Godot doesn't support them (TheForge tried to add it and left some code, but left it unused or abandoned). |
Edit: Just a quick note on the huge line count change. Most of the addition lines of code come from new third party dependencies. The actual change is about 4000 lines of code
As previously announced we have been working with Google and The Forge to bring a bunch of performance improvements and other enhancements to our mobile renderer.
The work has concluded so it is time to share the results, test it, and integrate it into upstream Godot. This PR is huge and touches a lot of areas. I expect that this branch will not be merged as-is. Instead this serves as a place for us to test the full work product, and cherry-pick safe commits. This branch was rebased on upstream recently and should not be too far out of date. Given its tremendous size though, I expect that it will fall out of date quickly. So we should aim to start merging what we can sooner rather than later.
Results
We tracked performance using two devices (Samsung S23 and Google Pixel 7) and using two test scenes (a modified version of the Sun temple scene and a modified version of the TPS demo)
TPS_demo.zip
Sun_temple.zip
Most importantly, GPU time has reduced across the board which is our biggest bottleneck on mobile devices.
Pixel 7
Samsung S23
Conclusion
Thank you very much to Google and The Forge who made this work possible! The results are very exciting and we are looking forward to getting everything integrated into upstream Godot