Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor WaitForThreadExit #752

Merged
merged 8 commits into from
Jan 18, 2024
Merged

Refactor WaitForThreadExit #752

merged 8 commits into from
Jan 18, 2024

Conversation

pvelesko
Copy link
Collaborator

@pvelesko pvelesko commented Jan 13, 2024

In the case that a user creates a bunch of threads and main() exits without calling join() on those threads, we must delay backend destruction so that those other threads can finish their work and cleanup via thread_local ctor/dtor sequence.

Otherwise, it's possible that the main thread will attempt to destroy the device and context handles while they're still being used by other detached threads.

This case is an example of a bad program since main thread should call join() but it's best if we handle it more gracefully than let it segfault.

  • Add a test
  • Refactor using atomic operations in thread_local chipstar::Queue ctor/dtor

@pvelesko pvelesko marked this pull request as draft January 13, 2024 18:52
src/CHIPBackend.cc Show resolved Hide resolved
add TestThreadDetachCleanup test
remove some dead PerThread code
module use
init counter to 0

fmt

update UnitTests.cmake

unload oneapi/compiler before running

undo some changes
Description: This commit updates the thread tracking variables in the `CHIPBackend` class. The variable `NumPerThreadQueues` has been renamed to `NumQueuesAlive` for better clarity. The changes have been made in both the source file `CHIPBackend.cc` and the header file `CHIPBackend.hh`.
@pvelesko pvelesko marked this pull request as ready for review January 16, 2024 09:38
{
auto NumPerThreadQueuesActive = ::Backend->getPerThreadQueuesActive();
if (!NumPerThreadQueuesActive)
// go through all devices checking their NumQueuesAlive until all they're all
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// go through all devices checking their NumQueuesAlive until all they're all
// go through all devices checking their NumQueuesAlive until they're all

logWarn("Waiting for all per-thread queues to exit... This condition "
"would indicate that the main thread didn't call "
"join()");
sleep(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep here feels wrong/hacky. Could it be a condition variable invoked when the device's ref count goes to zero?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we do above - we have an atomic counter in the queue ctor/dtor.

The issue the main thread can exit before all the threads have reached the ctor. I agree that this is not optimal but I can't think of a better way to solve this. I tried creating another thread counter class and making it public thread_local hoping that as soon as a thread is created the constructor would get called but it still gets called later down in the callstack as indicated in the comment.

Perhaps there's a clang trick to force a certain object constructor to get execute before others? @pjaaskel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought you meant the first sleep statement but my question still stands.

As for this, this sleep is part of the look where we check the counter. I think we want this debug statement and if we remove the sleep here it will spam stdout. This section of the code is only run in the case that a user wrote a bad hip program where threads are not joined() explicitly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just meant to have a cleaner approach for the wait instead of sleep-polling with a rather long sleep time - a (pthread) condition variable kind of approach. Not critical if you think it's good enough.

@pvelesko
Copy link
Collaborator Author

Since sleep polling only happens in a very edge case - merging.

@pvelesko pvelesko merged commit b18a3af into main Jan 18, 2024
28 checks passed
@pvelesko pvelesko deleted the WaitForPerThread branch January 18, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants