Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix reductions not working inside graph #50

Merged
merged 3 commits into from
Dec 13, 2022

Conversation

Bensuo
Copy link
Collaborator

@Bensuo Bensuo commented Dec 8, 2022

Graph submission now properly creates a host visible event on the command list allowing auxilliary resources to be cleaned up.
Code is taken from _pi_queue::executeCommandList, copy pasting a decent amount of code like that is not ideal but there isn't an immediately obvious solution to let us actually use that function when flushing.

Updated to remove code duplication, instead relying on the OKtoBatchCommand parameter to executeCommandList to let us determine when to actually submit the command list and when to block execution.

This will at least make the examples that use reductions run, though it doesn't solve related problems with buffers and other regular operations on the queue.

Closes #24

- Graph submission now properly creates a host visible event on the command list allowing auxilliary resources to be cleaned up
@reble
Copy link
Owner

reble commented Dec 8, 2022

@Bensuo Great! Thanks for fixing that. We should just add a note with a TODO to remove the code duplication.

Now I'm wondering why we should not just call executeCommandList(CommandList,false,false) and make sure that we opt in the event creation when in lazy mode? Like:
if( !Device->useImmediateCommandLists() || Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {

The idea was to decouple closing and executing the CommandList. But that is not what is happening now.

Copy link
Collaborator

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work figuring this out ⭐
Think we should merge this to fix current issues, and then try to rework as Pablo suggests as a follow-up

sycl/plugins/level_zero/pi_level_zero.cpp Outdated Show resolved Hide resolved
@EwanC EwanC added the Graph Implementation Related to DPC++ implementation and testing label Dec 9, 2022
@Bensuo
Copy link
Collaborator Author

Bensuo commented Dec 9, 2022

@Bensuo Great! Thanks for fixing that. We should just add a note with a TODO to remove the code duplication.

Now I'm wondering why we should not just call executeCommandList(CommandList,false,false) and make sure that we opt in the event creation when in lazy mode? Like: if( !Device->useImmediateCommandLists() || Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {

The idea was to decouple closing and executing the CommandList. But that is not what is happening now.

So I've been looking into this idea and I think I have something along those lines. I think we only need to modify the initial check in executeCommandList to also check if it's allowed to batch. We can somewhat assume that kernels will always be allowed to batch (and thus blocked from executing immediately) so if we're not allowed to batch we allow execution (so it works when called from the flush). It doesn't address any of the additional issues other than reductions and relying on kernels always being allowed to batch could be problematic if that changes in the future. It would remove all the code duplication from flush though.

I'll push it as an update to this PR but can always roll back to the already approved duplication approach if need be.

- Remove code duplication in piQueueFlush in favour of calling executeCommandList
- executeCommandList slightly modified to block execution only for command lists not allowed to be batched.
@Bensuo Bensuo requested a review from reble December 13, 2022 10:55
@EwanC EwanC merged commit c99bdca into sycl-graph-poc-v2 Dec 13, 2022
reble pushed a commit that referenced this pull request May 15, 2023
…callback

The `TypeSystemMap::m_mutex` guards against concurrent modifications
of members of `TypeSystemMap`. In particular, `m_map`.

`TypeSystemMap::ForEach` iterates through the entire `m_map` calling
a user-specified callback for each entry. This is all done while
`m_mutex` is locked. However, there's nothing that guarantees that
the callback itself won't call back into `TypeSystemMap` APIs on the
same thread. This lead to double-locking `m_mutex`, which is undefined
behaviour. We've seen this cause a deadlock in the swift plugin with
following backtrace:

```

int main() {
    std::unique_ptr<int> up = std::make_unique<int>(5);

    volatile int val = *up;
    return val;
}

clang++ -std=c++2a -g -O1 main.cpp

./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b
```

```
frame #4: std::lock_guard<std::mutex>::lock_guard
frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2
frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage
frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage
...
frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths
frame #27: lldb_private::SwiftASTContext::LoadModule
frame #30: swift::ModuleDecl::collectLinkLibraries
frame #31: lldb_private::SwiftASTContext::LoadModule
frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl
frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports
frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext
frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState
frame #38: lldb_private::Target::GetPersistentSymbol
frame #41: lldb_private::TypeSystemMap::ForEach                 <<<< Lock #1
frame #42: lldb_private::Target::GetPersistentSymbol
frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols
frame #44: lldb_private::IRExecutionUnit::FindSymbol
frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence
frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #48: llvm::LinkingSymbolResolver::findSymbol
frame #49: llvm::LegacyJITSymbolResolver::lookup
frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols
frame #51: llvm::RuntimeDyldImpl::resolveRelocations
frame #52: llvm::MCJIT::finalizeLoadedModules
frame #53: llvm::MCJIT::finalizeObject
frame #54: lldb_private::IRExecutionUnit::ReportAllocations
frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo
frame #56: lldb_private::ClangExpressionParser::PrepareForExecution
frame #57: lldb_private::ClangUserExpression::TryParse
frame #58: lldb_private::ClangUserExpression::Parse
```

Our solution is to simply iterate over a local copy of `m_map`.

**Testing**

* Confirmed on manual reproducer (would reproduce 100% of the time
  before the patch)

Differential Revision: https://reviews.llvm.org/D149949
@Bensuo Bensuo deleted the ben/reduction-fix branch August 7, 2023 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Implementation Related to DPC++ implementation and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants