-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Executors have deterministic shutdown #5547
Conversation
Mutable references are now allowed in the style guide.
Also: * Rename to just Cancel since it doesn't return a boolean indicating success or not. * Make Cancel private, since it's an implementation detail of `DelayedOperation`.
The RunSynchronized approach was fundamentally flawed when wrapped around an arbitrary dispatch queue supplied by a user. Users could supply a concurrent queue, which wouldn't provide any mutual exclusion guarantee.
Also: * Add tests Schedule of Tasks * Remove tests for template class Schedule
* Fix data race between Execute and Cancel * Fix multiple execution completing on destroyed executors
ece98e5
to
3af0346
Compare
This avoids needing to pepper all the tests with additional hacks that avoid this through other means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review (didn't yet look at defer
and ExecutorLibdispatch
).
At a high level, this looks good. I do have some feedback on implementation details of Task
, though.
// An opaque, monotonically increasing identifier for each operation that does | ||
// not depend on their address. Where the `Tag` identifies the kind of | ||
// operation, the `Id` identifies the specific instance. | ||
using Id = uint32_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I think it's slightly strange that Executor::Id
refers to the id of an operation, not the id of the executor. On the other hand, I can understand the desire for brevity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just promoting ExecutorStd::Id
to Executor
(and removing the separate implementation from TimeSlot
). I figured this would be noncontroversial without considering this too deeply.
If you'd like to revisit the original name from ExecutorStd
I'm happy to do so, but need some ideas about what you'd find acceptable. Is this a sequence number? A task id? Should I move Id
and Tag
to Task
? One argument against that is that nested types cannot be forward declared, which means that suddenly Task
would get much wider exposure.
FWIW I'm OK with this as-is, but again, I'm happy to work to make this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any particular preference re. moving these to Task
-- this might be a little more natural, but I don't think it justifies the practical concerns you mention. I thought about simply making these Executor::TaskId/TaskTag
. If you have any objection to that, I'm absolutely fine with keeping the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me defer this for now. I'd like to address this naming concern along with the inability of most consumers of Executor/AsyncQueue being unable to forward declare them in a holistic fashion.
// Called by `DelayedOperation` when its user calls `Cancel`. Implementations | ||
// of `Cancel` should also `Dispose` the underlying `Task` to actually prevent | ||
// execution. | ||
virtual void Cancel(Id operation_id) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, Try
in the function name indicated that this function is not guaranteed to succeed (because it is unable to cancel an already started operation). This seems like something worth reminding the caller about, what is the rationale for removing it? (Note: I don't think there's any general requirement that a TryFoo
function must return a boolean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there's no requirement that methods with a "try" prefix return a boolean they typically are designed to allow the caller to handle failure. For example, std::mutex::try_lock
, clearly makes a style of programming possible wherein the caller can take action based on whether the lock was acquired or not. My take is that "try" in a method name isn't an indicator of possible failure or idempotency (because many methods have that property without being prefixed with "try"), but rather that "try" intends for the caller to discriminate. In this case I felt that the caller shouldn't care and so the "try" wasn't helpful.
Meanwhile, this was particularly apparent to me because DelayedOperation::Cancel
is the actual API for cancelation used in the rest of our code. I made TryCancel
private, accessible only to friends, and renamed it to match its caller.
Also FWIW, note that in google3, CancellableClosure
has a Cancel
even though it does allow discrimination.
Firestore/core/src/util/task.cc
Outdated
{ | ||
// Invoke the operation without holding mutex_ to avoid deadlocks where | ||
// the current task can trigger its own cancellation. | ||
InverseLockGuard unlock(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use Defer
as well (in which case you can delete InverseLockGuard
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Firestore/core/src/util/task.h
Outdated
namespace util { | ||
|
||
// For testing | ||
class TrackingTask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, a friend declaration serves as a forward declaration as well, so this can be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Local references that are used outside the executor lock can be deleted as the task executes concurrently. The locals need to be retained to keep them alive while interrogating/awaiting the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL: I've fixed several additional races and merged master.
Firestore/core/src/util/task.cc
Outdated
{ | ||
// Invoke the operation without holding mutex_ to avoid deadlocks where | ||
// the current task can trigger its own cancellation. | ||
InverseLockGuard unlock(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Firestore/core/src/util/task.h
Outdated
namespace util { | ||
|
||
// For testing | ||
class TrackingTask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Generated by 🚫 Danger |
|
||
bool tag_scheduled = false; | ||
for (Task* task : matches) { | ||
bool task_completed = task->AwaitIfRunning(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does potentially waiting for each task affect the semantics of this function? It seems like it could break out of the loop as soon as a single task returned !task_completed
(taking care to release every task regardless). Not sure if it matters either way, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really changes the semantics but it seems reasonable to only wait until we find a scheduled task.
// Stores non-owned pointers to `TimeSlot`s. | ||
// Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer. | ||
|
||
std::unordered_set<Task*> async_tasks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schedule_
is an index of tasks by their Id, and only contains scheduled tasks. Cancellation of scheduled tasks is common (e.g. the idle timer task is canceled on every write), and forcing this to be a linear search of everything in the async queue is more expensive than I want it to be.
Honestly, I'm a little skeptical about the efficiency argument. I have a vague memory of adding debug logging to see the number of tasks on the queue at regular intervals and seeing a pretty low, never-really-growing value, for which linearly searching a vector would probably be just as fast. I don't feel strongly about it, though, given that the current version is well-tested at this point.
Firestore/core/src/util/task.h
Outdated
* | ||
* If the task is currently executing while it is invoked, `Cancel` will await | ||
* the completion of the Task. This makes `Cancel` safe to call in the | ||
* destructor of an Executor: any currently executing tasks will extend the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this was discussed before and I forgot: how does a Task
extend the lifetime of an executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Through the behavior in this paragraph: the destructor/Dispose calls Cancel, but the task is currently running, so Cancel blocks, waiting for completion. This prevents the destructor from proceeding until the task is complete.
I made this "... any currently executing tasks will effectively extend the lifetime of the executor."
should_release = true; | ||
async_tasks_.erase(found); | ||
|
||
if (!task->is_immediate()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps ExecutorLibdispatch
should start numbering at one or Task
should default to -1
? No strong feelings, though.
// atomic. `RunSynchronized` may result in a deadlock. | ||
for (const auto& entry : schedule_) { | ||
entry.second->MarkDone(); | ||
// Release this method's ownership. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment need to be changed? The destructor didn't call Retain
, so it would seem it releases the class's, not the method's ownership.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment above the swap into local_tasks
where it describes how swapping out of the member into the local will prevent OnCompletion
from finding anything. That effectively transfers ownership to the method.
I've added a parenthetical to this comment:
Release this method's ownership (obtained when
local_tasks
swapped withtasks_
).
} | ||
|
||
if (should_release) { | ||
task->Release(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why is Release
called without the lock? Is this necessary for correctness, or is it an optimization? A short comment could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the presence of calls between objects that each have their own locks, a general deadlock avoidance protocol is to avoid making calls to others while holding your own lock.
Release
does not acquire the Task
's lock and does not call back to the executor so I guess it's safe to call while holding the executor's lock, but I'm wary of abandoning the general rule on the basis of this knowledge. Task could change in such a way that it called back into executor, or even worse could change such that it blocked on some other thread's callback into the executor and then it would introduce the possibility of deadlock (the former case can be solved by using recursive locks, the latter cannot).
I've added a comment here stating:
Avoid calling potentially locking methods on the task while holding the executor's lock.
ExecutorLibdispatch::PopFromSchedule() { | ||
absl::optional<Executor::TaggedOperation> result; | ||
bool ExecutorLibdispatch::IsIdScheduled(Id id) const { | ||
Task* found_task = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at these functions, it seems like Schedule
is a clearly "separatable" abstraction (similar to ExecutorStd
) that could wrap the mutex and the two containers and potentially return TaskRef
s. It might be overkill, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this too. For now I've left it as is just to avoid further iterating on this PR.
// Stores non-owned pointers to `TimeSlot`s. | ||
// Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer. | ||
|
||
std::unordered_set<Task*> async_tasks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Do you think it's worth appending this to the existing comment? Rough sketch: (while under normal operation the number of tasks on the schedule is small and doesn't grow much, unusual usage of the SDK could result in thousands of tasks being added)
.
Now that the Executor makes strong guarantees that tasks submitted to it will either be completely run or not start at all once destruction starts (#5547), build a Dispose protocol on top of this to allow Firestore to shut itself down leaf first. Now that Firestore calls the Dispose chain down to the Executor, it's no longer possible for tasks to run while the instance is partially destructed nor can user callbacks execute after the instance has been destroyed. This is the other half of the iOS changes required to address firebase/quickstart-unity#638. There are a few follow-on tasks that this PR does not address: * Ownership of FirestoreClient and AsyncQueue are still shared though they no longer need to be. * The tests are still overly defensive about keeping different test cases isolated from each other. Much of this can be removed now that we're guaranteed to release resources by the time the test case shuts down (at least in GoogleTest; XCTest remains to be seen).
In particular:
Note:
Executor::Dispose
or clean up any of the higher level componentry to take advantage of these changes.UserCallbackExecutor
to handle this), but in the end it turns out to be simpler to just cancel everything and make cancel handle tasks-in-progress in a reasonable way. LMK if you have grave objections to this, in which case I can rework this.I've tried to pull this apart into comprehensible commits, but the cutover from executor-specific tracking to
Task
in ceb3aa9 wasn't something I could split without a lot of additional scaffolding. This commit remains a bear; sorry.This is the foundation for addressing firebase/quickstart-unity#638.