Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-threading issue with using real Vulkan handles #1235

Conversation

mizhen
Copy link
Contributor

@mizhen mizhen commented Aug 23, 2023

The problem

With default GFXR setting, sometimes an app will destroy some Vulkan handles in one thread (A) and create same type of Vulkan handle in another thread (B). There is a gap of time in between when the real handle is destroyed, and when its wrappers were deleted from map in thread A. If during this time period, thread B was able to run, and creates same type of handles, and if any of the newly-created handles had the same value of those destroyed by thread A, capturing will crash because handle rapper handling error for the new created handle.

The solution

The fix fixed the issue by acquiring shared lock for all wrapped handle creation and acquiring exclusive lock for range from destroying real Vulkan handle to end of deleting related wrapper.

Result
Tested target title full trace and trim trace capture/playback with the build of the pull request, all tests passed, the commit fixed the issue.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 28867.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3114 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3114 passed.

@bradgrantham-lunarg
Copy link
Contributor

bradgrantham-lunarg commented Aug 28, 2023

Please keep PRs as separated as possible. Could you take a minute to remove the copyright updates that aren't also in files adding this bug fix into a separate PR?

I see - you changed the copyright in the generator in order to update the one updated file and so it also changed in 40 other files... I understand why that would be necessary.

@bradgrantham-lunarg bradgrantham-lunarg added the P1 Prevents an important capture from being replayed label Oct 30, 2023
@panos-lunarg
Copy link
Contributor

panos-lunarg commented Nov 6, 2023

Good catch noticing the race condition.

CaptureManager is already defining a shared_mutex with very similar functionality used to apply locking upon each api call entry and differentiate between shared and unique access.
My suggestion is to investigate if you can declare there a new shared_mutex for the purpose of resource creation/destruction and similarly to the existing AcquireSharedApiCallLock() / AcquireExclusiveApiCallLock() define two functions to create and return unique/shared locks using the new mutex. This way you should not need to create two separate source files to isolate the new mutex.

Since you will need to use the new mutex in framework/encode/vulkan_handle_wrapper_util.h I am suspecting that you will face compilation errors due to cyclic dependencies of the headers. See if you can work around those and if not then resolve into creating the new source files. If that's the case then use the ScopedDestroyLock in both places that use the shared_mutex

@mizhen
Copy link
Contributor Author

mizhen commented Nov 6, 2023

Good catch noticing the race condition.

CaptureManager is already defining a shared_mutex with very similar functionality used to apply locking upon each api call entry and differentiate between shared and unique access. My suggestion is to investigate if you can declare there a new shared_mutex for the purpose of resource creation/destruction and similarly to the existing AcquireSharedApiCallLock() / AcquireExclusiveApiCallLock() define two functions to create and return unique/shared locks using the new mutex. This way you should not need to create two separate source files to isolate the new mutex.

Since you will need to use the new mutex in framework/encode/vulkan_handle_wrapper_util.h I am suspecting that you will face compilation errors due to cyclic dependencies of the headers. See if you can work around those and if not then resolve into creating the new source files. If that's the case then use the ScopedDestroyLock in both places that use the shared_mutex

I'll try it out, will keep you updated. Thanks.

@mizhen
Copy link
Contributor Author

mizhen commented Nov 24, 2023

Good catch noticing the race condition.
CaptureManager is already defining a shared_mutex with very similar functionality used to apply locking upon each api call entry and differentiate between shared and unique access. My suggestion is to investigate if you can declare there a new shared_mutex for the purpose of resource creation/destruction and similarly to the existing AcquireSharedApiCallLock() / AcquireExclusiveApiCallLock() define two functions to create and return unique/shared locks using the new mutex. This way you should not need to create two separate source files to isolate the new mutex.
Since you will need to use the new mutex in framework/encode/vulkan_handle_wrapper_util.h I am suspecting that you will face compilation errors due to cyclic dependencies of the headers. See if you can work around those and if not then resolve into creating the new source files. If that's the case then use the ScopedDestroyLock in both places that use the shared_mutex

I'll try it out, will keep you updated. Thanks.

Yes, I did meet the compilation errors due to cyclic dependencies with the change of moving class ScopedDestroyLock to capture_manager.h ( 1357ec0 ). Unfortunately, I didn't find a workaround.

@panos-lunarg
Copy link
Contributor

I think I prefer 1357ec0 better.
The new class is defined inside the capture manager header. If you can make the existing mutex (api_call_mutex_) to use the new class it would be a 👍 from me

@locke-lunarg
Copy link
Contributor

locke-lunarg commented Nov 28, 2023

Thank you @mizhen for fixing it.

However, I think it's not enough if the Create protection is only in CreateWrappedDispatchHandle. We can look at an example in VulkanCaptureManager::OverrideCreateBuffer. After CreateWrappedHandle, the mutex is unlocked, and the buffer is available to be destroyed. But the BufferWrapper is still used after CreateWrappedHandle. Also, CustomEncoderPostCall could use the Wrapper. In this case, the Create protection has to be wider between CreateWrappedDispatchHandle and CustomEncoderPostCall. It's easier to lock it between OverrideCreateBuffer (or Vulkan function call) and CustomEncoderPostCall.

I have a small sample for the protection.
locke-lunarg@bf0a776
I posted an issue about it. #1189

Further, maybe we could give an individual mutex to every kind of handle, so CreateBuffer doesn't lock CreateImage. Or even further, one CreateBuffer should not lock another CreateBuffer, or CreateBuffer lock only its DestroyBuffer. But we could leave it in the future.

@locke-lunarg locke-lunarg self-requested a review December 1, 2023 17:42
@mizhen
Copy link
Contributor Author

mizhen commented Jan 4, 2024

Thank you @mizhen for fixing it.

However, I think it's not enough if the Create protection is only in CreateWrappedDispatchHandle. We can look at an example in VulkanCaptureManager::OverrideCreateBuffer. After CreateWrappedHandle, the mutex is unlocked, and the buffer is available to be destroyed. But the BufferWrapper is still used after CreateWrappedHandle. Also, CustomEncoderPostCall could use the Wrapper. In this case, the Create protection has to be wider between CreateWrappedDispatchHandle and CustomEncoderPostCall. It's easier to lock it between OverrideCreateBuffer (or Vulkan function call) and CustomEncoderPostCall.

I have a small sample for the protection. locke-lunarg@bf0a776 I posted an issue about it. #1189

Further, maybe we could give an individual mutex to every kind of handle, so CreateBuffer doesn't lock CreateImage. Or even further, one CreateBuffer should not lock another CreateBuffer, or CreateBuffer lock only its DestroyBuffer. But we could leave it in the future.

Thank you for your review and providing the sample.

#1189 is the issue show up in the target title, I think the key of the issue is because driver return previously deleted handle (driver reuse destroyed resource handle). The issue only happens when driver return previously deletes handle and before GFXR delete the wrapper of that handle. So the pull request makes deleting Vulkan object and its related wrapper become an atomic operation for wrapper creation.

For the case in the comment which is related to VulkanCaptureManager::OverrideCreateBuffer, the behaviour of the target game and GFXR is as the following:
'''
[Target game]:

                          1> vkCreateBuffer (it's similiar for any other call handling)

[GFXR capture the call]:

                          2> CustomEncoderPreCall<format::ApiCallId::ApiCall_vkCreateBuffer>
                                 

                          3> VulkanCaptureManager::Get()->OverrideCreateBuffer
                                 

                          4> CustomEncoderPostCall<format::ApiCallId::ApiCall_vkCreateBuffer>
                                 

                          5> return result

[Target game]:

                          6> Target game get the returned buffer handle and use it (to access, delete or do something else)

'''

Therefore before the control return to target game, it cannot use the returned buffer handle or delete the buffer. In another word, if other thread of the game gets the control and delete a buffer, that will be some other buffer. For deleting other buffer during the creation handling of this buffer, it doesn't cause the multi-threading issue in the pull request.

During debugging/testing the pull request, also compared the performance with different lock scope, it shows different lock scope heavily affect the loading process in the running of target title, so the pull request tries to select a minimum scope.

Copy link
Contributor

@locke-lunarg locke-lunarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. It's impossible to happen that the buffer gets destroyed during creating. The PR looks good to me. Please rebase it, and it's ready to merge.

close: #1189

@mizhen
Copy link
Contributor Author

mizhen commented Jan 8, 2024

Yes, you are right. It's impossible to happen that the buffer gets destroyed during creating. The PR looks good to me. Please rebase it, and it's ready to merge.

close: #1189

Ok, will do, thanks.

mizhen added 6 commits January 8, 2024 14:31
With default GFXR setting, if target title destroy some Vulkan handles
in one thread (A) and create same type of handle in another thread (B).
At the time when is after real handles were destroyed and before their wrappers
were deleted from map in thread A, if thread B got CPU to create same type
of handles and if anyone of the new created handles was same value of
previously destroyed handles, capturing will fail because no correct
handle rapper for the new created handle. The commit fix this
multi-threading issue.
Update copyright info for all changed files. Please note that the change
in gencode.py is only for adding additional copyright info to auto
generated files.

Change-Id: I211e4433da32682804d6138aa8c0820feb1ce7e6
Changed the description of the mutex and related functions which are used
in the commit to address review comments.
Change locking mechanism of the mutex in the commit to RAII.
Add class ScopedDestroyLock to optimize the source code for lock/unlock
the corresponding mutex;
Move ScopedDestroyLock class in its own dedicated source file.
@mizhen mizhen force-pushed the ming-vk-fix-multi-threading-issue-with-real-handle branch from 00853fa to e8a3582 Compare January 9, 2024 21:14
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 113868.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3656 running.

@mizhen
Copy link
Contributor Author

mizhen commented Jan 9, 2024

Yes, you are right. It's impossible to happen that the buffer gets destroyed during creating. The PR looks good to me. Please rebase it, and it's ready to merge.
close: #1189

Ok, will do, thanks.

Done.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3656 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 113946.

@lunarpapillo
Copy link
Contributor

FYI, your LunarG CI build #3656 seems to have failed due to an unrelated GPU crash on one machine (passed on all others). I've restarted the run.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3658 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3658 passed.

@locke-lunarg locke-lunarg merged commit 278ceb7 into LunarG:dev Jan 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Prevents an important capture from being replayed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants