Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Executors have deterministic shutdown #5547

Merged
merged 54 commits into from
May 18, 2020

Conversation

wilhuff
Copy link
Contributor

@wilhuff wilhuff commented May 8, 2020

In particular:

  • All tasks are guaranteed to either run to completion or not run at all;
  • Executors can now be destroyed from the threads they own (if applicable); and
  • Many common concepts between the Executor implementations have been harmonized and shared.

Note:

  • This change does not yet implement Executor::Dispose or clean up any of the higher level componentry to take advantage of these changes.
  • The executors implement shutdown by canceling any tasks that haven't started yet. This diverges from the plan (which called for a separate UserCallbackExecutor to handle this), but in the end it turns out to be simpler to just cancel everything and make cancel handle tasks-in-progress in a reasonable way. LMK if you have grave objections to this, in which case I can rework this.

I've tried to pull this apart into comprehensible commits, but the cutover from executor-specific tracking to Task in ceb3aa9 wasn't something I could split without a lot of additional scaffolding. This commit remains a bear; sorry.

This is the foundation for addressing firebase/quickstart-unity#638.

wilhuff added 17 commits May 6, 2020 18:14
Mutable references are now allowed in the style guide.
Also:

  * Rename to just Cancel since it doesn't return a boolean indicating
    success or not.
  * Make Cancel private, since it's an implementation detail of
    `DelayedOperation`.
The RunSynchronized approach was fundamentally flawed when wrapped
around an arbitrary dispatch queue supplied by a user. Users could
supply a concurrent queue, which wouldn't provide any mutual exclusion
guarantee.
Also:
  * Add tests Schedule of Tasks
  * Remove tests for template class Schedule
wilhuff added 2 commits May 8, 2020 09:03
  * Fix data race between Execute and Cancel
  * Fix multiple execution completing on destroyed executors
@wilhuff wilhuff force-pushed the wilhuff/deterministic-executor branch from ece98e5 to 3af0346 Compare May 8, 2020 16:06
Copy link
Contributor

@var-const var-const left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review (didn't yet look at defer and ExecutorLibdispatch).

At a high level, this looks good. I do have some feedback on implementation details of Task, though.

Firestore/core/src/util/task.cc Outdated Show resolved Hide resolved
Firestore/core/src/util/executor.h Outdated Show resolved Hide resolved
Firestore/core/src/util/executor.h Outdated Show resolved Hide resolved
// An opaque, monotonically increasing identifier for each operation that does
// not depend on their address. Where the `Tag` identifies the kind of
// operation, the `Id` identifies the specific instance.
using Id = uint32_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I think it's slightly strange that Executor::Id refers to the id of an operation, not the id of the executor. On the other hand, I can understand the desire for brevity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just promoting ExecutorStd::Id to Executor (and removing the separate implementation from TimeSlot). I figured this would be noncontroversial without considering this too deeply.

If you'd like to revisit the original name from ExecutorStd I'm happy to do so, but need some ideas about what you'd find acceptable. Is this a sequence number? A task id? Should I move Id and Tag to Task? One argument against that is that nested types cannot be forward declared, which means that suddenly Task would get much wider exposure.

FWIW I'm OK with this as-is, but again, I'm happy to work to make this better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any particular preference re. moving these to Task -- this might be a little more natural, but I don't think it justifies the practical concerns you mention. I thought about simply making these Executor::TaskId/TaskTag. If you have any objection to that, I'm absolutely fine with keeping the status quo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me defer this for now. I'd like to address this naming concern along with the inability of most consumers of Executor/AsyncQueue being unable to forward declare them in a holistic fashion.

// Called by `DelayedOperation` when its user calls `Cancel`. Implementations
// of `Cancel` should also `Dispose` the underlying `Task` to actually prevent
// execution.
virtual void Cancel(Id operation_id) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, Try in the function name indicated that this function is not guaranteed to succeed (because it is unable to cancel an already started operation). This seems like something worth reminding the caller about, what is the rationale for removing it? (Note: I don't think there's any general requirement that a TryFoo function must return a boolean)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there's no requirement that methods with a "try" prefix return a boolean they typically are designed to allow the caller to handle failure. For example, std::mutex::try_lock, clearly makes a style of programming possible wherein the caller can take action based on whether the lock was acquired or not. My take is that "try" in a method name isn't an indicator of possible failure or idempotency (because many methods have that property without being prefixed with "try"), but rather that "try" intends for the caller to discriminate. In this case I felt that the caller shouldn't care and so the "try" wasn't helpful.

Meanwhile, this was particularly apparent to me because DelayedOperation::Cancel is the actual API for cancelation used in the rest of our code. I made TryCancel private, accessible only to friends, and renamed it to match its caller.

Also FWIW, note that in google3, CancellableClosure has a Cancel even though it does allow discrimination.

Firestore/core/src/util/task.h Show resolved Hide resolved
Firestore/core/test/unit/util/schedule_test.cc Outdated Show resolved Hide resolved
Firestore/core/test/unit/util/executor_test.cc Outdated Show resolved Hide resolved
Firestore/core/test/unit/util/task_test.cc Outdated Show resolved Hide resolved
Firestore/core/test/unit/util/task_test.cc Outdated Show resolved Hide resolved
@var-const var-const assigned wilhuff and unassigned var-const May 9, 2020
@wilhuff wilhuff assigned var-const and unassigned wilhuff May 12, 2020
{
// Invoke the operation without holding mutex_ to avoid deadlocks where
// the current task can trigger its own cancellation.
InverseLockGuard unlock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use Defer as well (in which case you can delete InverseLockGuard)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace util {

// For testing
class TrackingTask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, a friend declaration serves as a forward declaration as well, so this can be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@var-const var-const assigned wilhuff and unassigned var-const May 13, 2020
Copy link
Contributor Author

@wilhuff wilhuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL: I've fixed several additional races and merged master.

{
// Invoke the operation without holding mutex_ to avoid deadlocks where
// the current task can trigger its own cancellation.
InverseLockGuard unlock(mutex_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace util {

// For testing
class TrackingTask;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@google-oss-bot
Copy link

1 Warning
⚠️ Did you forget to add a changelog entry? (Add #no-changelog to the PR description to silence this warning.)

Generated by 🚫 Danger


bool tag_scheduled = false;
for (Task* task : matches) {
bool task_completed = task->AwaitIfRunning();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does potentially waiting for each task affect the semantics of this function? It seems like it could break out of the loop as soon as a single task returned !task_completed (taking care to release every task regardless). Not sure if it matters either way, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really changes the semantics but it seems reasonable to only wait until we find a scheduled task.

Firestore/core/src/util/task.cc Show resolved Hide resolved
// Stores non-owned pointers to `TimeSlot`s.
// Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer.

std::unordered_set<Task*> async_tasks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schedule_ is an index of tasks by their Id, and only contains scheduled tasks. Cancellation of scheduled tasks is common (e.g. the idle timer task is canceled on every write), and forcing this to be a linear search of everything in the async queue is more expensive than I want it to be.

Honestly, I'm a little skeptical about the efficiency argument. I have a vague memory of adding debug logging to see the number of tasks on the queue at regular intervals and seeing a pretty low, never-really-growing value, for which linearly searching a vector would probably be just as fast. I don't feel strongly about it, though, given that the current version is well-tested at this point.

*
* If the task is currently executing while it is invoked, `Cancel` will await
* the completion of the Task. This makes `Cancel` safe to call in the
* destructor of an Executor: any currently executing tasks will extend the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this was discussed before and I forgot: how does a Task extend the lifetime of an executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through the behavior in this paragraph: the destructor/Dispose calls Cancel, but the task is currently running, so Cancel blocks, waiting for completion. This prevents the destructor from proceeding until the task is complete.

I made this "... any currently executing tasks will effectively extend the lifetime of the executor."

Firestore/core/src/util/executor_libdispatch.h Outdated Show resolved Hide resolved
Firestore/core/src/util/executor_libdispatch.h Outdated Show resolved Hide resolved
should_release = true;
async_tasks_.erase(found);

if (!task->is_immediate()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps ExecutorLibdispatch should start numbering at one or Task should default to -1? No strong feelings, though.

// atomic. `RunSynchronized` may result in a deadlock.
for (const auto& entry : schedule_) {
entry.second->MarkDone();
// Release this method's ownership.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment need to be changed? The destructor didn't call Retain, so it would seem it releases the class's, not the method's ownership.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment above the swap into local_tasks where it describes how swapping out of the member into the local will prevent OnCompletion from finding anything. That effectively transfers ownership to the method.

I've added a parenthetical to this comment:

Release this method's ownership (obtained when local_tasks swapped with tasks_).

}

if (should_release) {
task->Release();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why is Release called without the lock? Is this necessary for correctness, or is it an optimization? A short comment could be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the presence of calls between objects that each have their own locks, a general deadlock avoidance protocol is to avoid making calls to others while holding your own lock.

Release does not acquire the Task's lock and does not call back to the executor so I guess it's safe to call while holding the executor's lock, but I'm wary of abandoning the general rule on the basis of this knowledge. Task could change in such a way that it called back into executor, or even worse could change such that it blocked on some other thread's callback into the executor and then it would introduce the possibility of deadlock (the former case can be solved by using recursive locks, the latter cannot).

I've added a comment here stating:

Avoid calling potentially locking methods on the task while holding the executor's lock.

ExecutorLibdispatch::PopFromSchedule() {
absl::optional<Executor::TaggedOperation> result;
bool ExecutorLibdispatch::IsIdScheduled(Id id) const {
Task* found_task = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these functions, it seems like Schedule is a clearly "separatable" abstraction (similar to ExecutorStd) that could wrap the mutex and the two containers and potentially return TaskRefs. It might be overkill, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this too. For now I've left it as is just to avoid further iterating on this PR.

Firestore/core/src/util/task.cc Show resolved Hide resolved
// Stores non-owned pointers to `TimeSlot`s.
// Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer.

std::unordered_set<Task*> async_tasks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Do you think it's worth appending this to the existing comment? Rough sketch: (while under normal operation the number of tasks on the schedule is small and doesn't grow much, unusual usage of the SDK could result in thousands of tasks being added).

@wilhuff wilhuff merged commit e462d04 into master May 18, 2020
@wilhuff wilhuff deleted the wilhuff/deterministic-executor branch May 18, 2020 16:06
wilhuff added a commit that referenced this pull request May 18, 2020
Now that the Executor makes strong guarantees that tasks submitted to it will either be completely run or not start at all once destruction starts (#5547), build a Dispose protocol on top of this to allow Firestore to shut itself down leaf first.

Now that Firestore calls the Dispose chain down to the Executor, it's no longer possible for tasks to run while the instance is partially destructed nor can user callbacks execute after the instance has been destroyed.

This is the other half of the iOS changes required to address firebase/quickstart-unity#638.

There are a few follow-on tasks that this PR does not address:

  * Ownership of FirestoreClient and AsyncQueue are still shared though they no longer need to be.
  * The tests are still overly defensive about keeping different test cases isolated from each other. Much of this can be removed now that we're guaranteed to release resources by the time the test case shuts down (at least in GoogleTest; XCTest remains to be seen).
@firebase firebase locked and limited conversation to collaborators Jun 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants