-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signal timeout when locking task in getLongRunningOpCalls #8511
Conversation
Summary: If there are some problem (e.g. deadlock) with the mutex in `Task`, `getLongRunningOpCalls` would block forever waiting for it. In this case we should signal the error and let the caller to generate some form of alerts. This is important as we rely on `getLongRunningOpCalls` to detect blocking and the function itself should not block. Differential Revision: D53048088
✅ Deploy Preview for meta-velox canceled.
|
This pull request was exported from Phabricator. Differential Revision: D53048088 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta LGTM. Thanks!
@@ -194,7 +194,7 @@ void BlockingState::setResume(std::shared_ptr<BlockingState> state) { | |||
auto& driver = state->driver_; | |||
auto& task = driver->task(); | |||
|
|||
std::lock_guard<std::mutex> l(task->mutex()); | |||
std::lock_guard<std::timed_mutex> l(task->mutex()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta we do see the problem in the production? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's for preventive measure, first to rule out the possibility that we get deadlock here during the detection (collecting run time of op call)
/// Return false when the lock cannot be taken within the timeout, in that | ||
/// case the result is not populated. Return true if everything works well. | ||
bool getLongRunningOpCalls( | ||
std::chrono::nanoseconds lockTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timeout? Can't find where is it specific for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the timeout for locking the per task mutex. I will update the comment to make it more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what value do we use for this timeout? Where can I see that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be in D52921877 but that is still work in progress. I am planning to use 10s.
This pull request has been merged in 147ff30. |
Conbench analyzed the 1 benchmark run on commit There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
Summary:
If there are some problem (e.g. deadlock) with the mutex in
Task
,getLongRunningOpCalls
would block forever waiting for it. In this case weshould signal the error and let the caller to generate some form of alerts.
This is important as we rely on
getLongRunningOpCalls
to detect blocking andthe function itself should not block.
Differential Revision: D53048088