Make Executors have deterministic shutdown #5547

wilhuff · 2020-05-08T02:52:30Z

In particular:

All tasks are guaranteed to either run to completion or not run at all;
Executors can now be destroyed from the threads they own (if applicable); and
Many common concepts between the Executor implementations have been harmonized and shared.

Note:

This change does not yet implement Executor::Dispose or clean up any of the higher level componentry to take advantage of these changes.
The executors implement shutdown by canceling any tasks that haven't started yet. This diverges from the plan (which called for a separate UserCallbackExecutor to handle this), but in the end it turns out to be simpler to just cancel everything and make cancel handle tasks-in-progress in a reasonable way. LMK if you have grave objections to this, in which case I can rework this.

I've tried to pull this apart into comprehensible commits, but the cutover from executor-specific tracking to Task in ceb3aa9 wasn't something I could split without a lot of additional scaffolding. This commit remains a bear; sorry.

This is the foundation for addressing firebase/quickstart-unity#638.

Mutable references are now allowed in the style guide.

Also: * Rename to just Cancel since it doesn't return a boolean indicating success or not. * Make Cancel private, since it's an implementation detail of `DelayedOperation`.

The RunSynchronized approach was fundamentally flawed when wrapped around an arbitrary dispatch queue supplied by a user. Users could supply a concurrent queue, which wouldn't provide any mutual exclusion guarantee.

Also: * Add tests Schedule of Tasks * Remove tests for template class Schedule

* Fix data race between Execute and Cancel * Fix multiple execution completing on destroyed executors

This avoids needing to pepper all the tests with additional hacks that avoid this through other means.

var-const

First round of review (didn't yet look at defer and ExecutorLibdispatch).

At a high level, this looks good. I do have some feedback on implementation details of Task, though.

Firestore/core/src/util/task.cc

Firestore/core/src/util/executor.h

var-const · 2020-05-08T18:13:11Z

Firestore/core/src/util/executor.h

+  // An opaque, monotonically increasing identifier for each operation that does
+  // not depend on their address. Where the `Tag` identifies the kind of
+  // operation, the `Id` identifies the specific instance.
+  using Id = uint32_t;


Optional: I think it's slightly strange that Executor::Id refers to the id of an operation, not the id of the executor. On the other hand, I can understand the desire for brevity.

I was just promoting ExecutorStd::Id to Executor (and removing the separate implementation from TimeSlot). I figured this would be noncontroversial without considering this too deeply.

If you'd like to revisit the original name from ExecutorStd I'm happy to do so, but need some ideas about what you'd find acceptable. Is this a sequence number? A task id? Should I move Id and Tag to Task? One argument against that is that nested types cannot be forward declared, which means that suddenly Task would get much wider exposure.

FWIW I'm OK with this as-is, but again, I'm happy to work to make this better.

I don't have any particular preference re. moving these to Task -- this might be a little more natural, but I don't think it justifies the practical concerns you mention. I thought about simply making these Executor::TaskId/TaskTag. If you have any objection to that, I'm absolutely fine with keeping the status quo.

Let me defer this for now. I'd like to address this naming concern along with the inability of most consumers of Executor/AsyncQueue being unable to forward declare them in a holistic fashion.

var-const · 2020-05-08T18:26:32Z

Firestore/core/src/util/executor.h

+  // Called by `DelayedOperation` when its user calls `Cancel`. Implementations
+  // of `Cancel` should also `Dispose` the underlying `Task` to actually prevent
+  // execution.
+  virtual void Cancel(Id operation_id) = 0;


Previously, Try in the function name indicated that this function is not guaranteed to succeed (because it is unable to cancel an already started operation). This seems like something worth reminding the caller about, what is the rationale for removing it? (Note: I don't think there's any general requirement that a TryFoo function must return a boolean)

While there's no requirement that methods with a "try" prefix return a boolean they typically are designed to allow the caller to handle failure. For example, std::mutex::try_lock, clearly makes a style of programming possible wherein the caller can take action based on whether the lock was acquired or not. My take is that "try" in a method name isn't an indicator of possible failure or idempotency (because many methods have that property without being prefixed with "try"), but rather that "try" intends for the caller to discriminate. In this case I felt that the caller shouldn't care and so the "try" wasn't helpful.

Meanwhile, this was particularly apparent to me because DelayedOperation::Cancel is the actual API for cancelation used in the rest of our code. I made TryCancel private, accessible only to friends, and renamed it to match its caller.

Also FWIW, note that in google3, CancellableClosure has a Cancel even though it does allow discrimination.

Firestore/core/src/util/task.h

Firestore/core/test/unit/util/schedule_test.cc

Firestore/core/test/unit/util/executor_test.cc

Firestore/core/test/unit/util/task_test.cc

var-const · 2020-05-13T00:49:11Z

Firestore/core/src/util/task.cc

+      {
+        // Invoke the operation without holding mutex_ to avoid deadlocks where
+        // the current task can trigger its own cancellation.
+        InverseLockGuard unlock(mutex_);


Should this use Defer as well (in which case you can delete InverseLockGuard)?

var-const · 2020-05-13T01:00:25Z

Firestore/core/src/util/task.h

+namespace util {
+
+// For testing
+class TrackingTask;


IIRC, a friend declaration serves as a forward declaration as well, so this can be omitted.

Local references that are used outside the executor lock can be deleted as the task executes concurrently. The locals need to be retained to keep them alive while interrogating/awaiting the task.

wilhuff

PTAL: I've fixed several additional races and merged master.

wilhuff · 2020-05-14T19:56:56Z

Firestore/core/src/util/task.cc

+      {
+        // Invoke the operation without holding mutex_ to avoid deadlocks where
+        // the current task can trigger its own cancellation.
+        InverseLockGuard unlock(mutex_);


wilhuff · 2020-05-14T20:30:35Z

Firestore/core/src/util/task.h

+namespace util {
+
+// For testing
+class TrackingTask;


google-oss-bot · 2020-05-15T16:43:04Z

	1 Warning
⚠️	Did you forget to add a changelog entry? (Add #no-changelog to the PR description to silence this warning.)

Generated by 🚫 Danger

var-const · 2020-05-15T20:19:03Z

Firestore/core/src/util/executor_libdispatch.mm

+
+  bool tag_scheduled = false;
+  for (Task* task : matches) {
+    bool task_completed = task->AwaitIfRunning();


Question: does potentially waiting for each task affect the semantics of this function? It seems like it could break out of the loop as soon as a single task returned !task_completed (taking care to release every task regardless). Not sure if it matters either way, though.

I don't think it really changes the semantics but it seems reasonable to only wait until we find a scheduled task.

Firestore/core/src/util/task.cc

var-const · 2020-05-15T21:20:53Z

Firestore/core/src/util/executor_libdispatch.h

-  // Stores non-owned pointers to `TimeSlot`s.
-  // Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer.
+
+  std::unordered_set<Task*> async_tasks_;


schedule_ is an index of tasks by their Id, and only contains scheduled tasks. Cancellation of scheduled tasks is common (e.g. the idle timer task is canceled on every write), and forcing this to be a linear search of everything in the async queue is more expensive than I want it to be.

Honestly, I'm a little skeptical about the efficiency argument. I have a vague memory of adding debug logging to see the number of tasks on the queue at regular intervals and seeing a pretty low, never-really-growing value, for which linearly searching a vector would probably be just as fast. I don't feel strongly about it, though, given that the current version is well-tested at this point.

var-const · 2020-05-15T21:25:02Z

Firestore/core/src/util/task.h

+   *
+   * If the task is currently executing while it is invoked, `Cancel` will await
+   * the completion of the Task. This makes `Cancel` safe to call in the
+   * destructor of an Executor: any currently executing tasks will extend the


Sorry if this was discussed before and I forgot: how does a Task extend the lifetime of an executor?

Through the behavior in this paragraph: the destructor/Dispose calls Cancel, but the task is currently running, so Cancel blocks, waiting for completion. This prevents the destructor from proceeding until the task is complete.

I made this "... any currently executing tasks will effectively extend the lifetime of the executor."

Firestore/core/src/util/executor_libdispatch.h

Firestore/core/src/util/executor_libdispatch.mm

var-const · 2020-05-15T21:51:06Z

Firestore/core/src/util/executor_libdispatch.mm

+      should_release = true;
+      async_tasks_.erase(found);
+
+      if (!task->is_immediate()) {


Perhaps ExecutorLibdispatch should start numbering at one or Task should default to -1? No strong feelings, though.

var-const · 2020-05-15T21:55:12Z

Firestore/core/src/util/executor_libdispatch.mm

-  // atomic. `RunSynchronized` may result in a deadlock.
-  for (const auto& entry : schedule_) {
-    entry.second->MarkDone();
+    // Release this method's ownership.


Does this comment need to be changed? The destructor didn't call Retain, so it would seem it releases the class's, not the method's ownership.

There's a comment above the swap into local_tasks where it describes how swapping out of the member into the local will prevent OnCompletion from finding anything. That effectively transfers ownership to the method.

I've added a parenthetical to this comment:

Release this method's ownership (obtained when local_tasks swapped with tasks_).

var-const · 2020-05-15T21:59:50Z

Firestore/core/src/util/executor_libdispatch.mm

+  }
+
+  if (should_release) {
+    task->Release();


Question: why is Release called without the lock? Is this necessary for correctness, or is it an optimization? A short comment could be useful.

In the presence of calls between objects that each have their own locks, a general deadlock avoidance protocol is to avoid making calls to others while holding your own lock.

Release does not acquire the Task's lock and does not call back to the executor so I guess it's safe to call while holding the executor's lock, but I'm wary of abandoning the general rule on the basis of this knowledge. Task could change in such a way that it called back into executor, or even worse could change such that it blocked on some other thread's callback into the executor and then it would introduce the possibility of deadlock (the former case can be solved by using recursive locks, the latter cannot).

I've added a comment here stating:

Avoid calling potentially locking methods on the task while holding the executor's lock.

var-const · 2020-05-15T23:08:46Z

Firestore/core/src/util/executor_libdispatch.mm

-ExecutorLibdispatch::PopFromSchedule() {
-  absl::optional<Executor::TaggedOperation> result;
+bool ExecutorLibdispatch::IsIdScheduled(Id id) const {
+  Task* found_task = nullptr;


Looking at these functions, it seems like Schedule is a clearly "separatable" abstraction (similar to ExecutorStd) that could wrap the mutex and the two containers and potentially return TaskRefs. It might be overkill, though.

I thought about this too. For now I've left it as is just to avoid further iterating on this PR.

Firestore/core/src/util/task.cc

var-const · 2020-05-17T02:20:30Z

Firestore/core/src/util/executor_libdispatch.h

-  // Stores non-owned pointers to `TimeSlot`s.
-  // Invariant: if a `TimeSlot` is in `schedule_`, it's a valid pointer.
+
+  std::unordered_set<Task*> async_tasks_;


That's a good point. Do you think it's worth appending this to the existing comment? Rough sketch: (while under normal operation the number of tasks on the schedule is small and doesn't grow much, unusual usage of the SDK could result in thousands of tasks being added).

Now that the Executor makes strong guarantees that tasks submitted to it will either be completely run or not start at all once destruction starts (#5547), build a Dispose protocol on top of this to allow Firestore to shut itself down leaf first. Now that Firestore calls the Dispose chain down to the Executor, it's no longer possible for tasks to run while the instance is partially destructed nor can user callbacks execute after the instance has been destroyed. This is the other half of the iOS changes required to address firebase/quickstart-unity#638. There are a few follow-on tasks that this PR does not address: * Ownership of FirestoreClient and AsyncQueue are still shared though they no longer need to be. * The tests are still overly defensive about keeping different test cases isolated from each other. Much of this can be removed now that we're guaranteed to release resources by the time the test case shuts down (at least in GoogleTest; XCTest remains to be seen).

wilhuff added 17 commits May 6, 2020 18:14

Remove checks for non-const references

2ab887b

Mutable references are now allowed in the style guide.

Fix typo

0e7b490

Move DelayedOperation below Executor

2f5b8fd

Use a common Executor::Id type

7581567

Use a common Executor::TimePoint type

832791f

Make kNoTag common

5542371

Harmonize Executor::TryCancel between implementations

b510bad

Also: * Rename to just Cancel since it doesn't return a boolean indicating success or not. * Make Cancel private, since it's an implementation detail of `DelayedOperation`.

Convert ExecutorLibdispatch to use an explicit mutex

752a828

The RunSynchronized approach was fundamentally flawed when wrapped around an arbitrary dispatch queue supplied by a user. Users could supply a concurrent queue, which wouldn't provide any mutual exclusion guarantee.

Remove exposed Dispatch{Async,Sync} helpers

6aae53e

Normalize delay, tag, id, operation argument ordering.

0919154

Remove DelayedOperation's internal notion of cancellation

e57f647

Make executors for testing include the test name, if available.

5d094d0

Add Task, representing a task for an Executor to execute

65f0490

Add Schedule, an ordered list of Tasks to execute

a4f1272

Also: * Add tests Schedule of Tasks * Remove tests for template class Schedule

Move MakeTargetTime to task.h

31bd970

Migrate Executors to using Tasks

ceb3aa9

Shorten executor test wait times

f608cb0

wilhuff added the api: firestore label May 8, 2020

wilhuff assigned var-const May 8, 2020

googlebot added the cla: yes label May 8, 2020

wilhuff added 2 commits May 8, 2020 09:03

Fix TSAN issues

69e8876

* Fix data race between Execute and Cancel * Fix multiple execution completing on destroyed executors

Make tracing facility permanently available

3af0346

wilhuff force-pushed the wilhuff/deterministic-executor branch from ece98e5 to 3af0346 Compare May 8, 2020 16:06

wilhuff added 4 commits May 8, 2020 16:03

Add defer, a general facility for ad-hoc RAII.

bf30968

Merge branch 'wilhuff/defer' into wilhuff/deterministic-executor

1fd787a

Fix typo

faf1d51

Fix the IsScheduled problem in the general case

4ad7557

This avoids needing to pepper all the tests with additional hacks that avoid this through other means.

var-const suggested changes May 9, 2020

View reviewed changes

var-const assigned wilhuff and unassigned var-const May 9, 2020

More comments

619c039

wilhuff assigned var-const and unassigned wilhuff May 12, 2020

var-const approved these changes May 13, 2020

View reviewed changes

var-const assigned wilhuff and unassigned var-const May 13, 2020

wilhuff added 8 commits May 14, 2020 10:23

Add an assertion that Release is not called more than necessary

3edbaec

Merge branch 'master' into wilhuff/deterministic-executor

43eefae

Remove InverseLockGuard

3f52fc6

Remove excess forward declaration

ed7d38a

Clarify ownership commentary in methods that affect it.

8c2c9a6

Retain locals held outside the executor lock.

48e3aea

Local references that are used outside the executor lock can be deleted as the task executes concurrently. The locals need to be retained to keep them alive while interrogating/awaiting the task.

Avoid deadlocking if the task cancels itself

d0ee6f4

Correctly build on Apple platforms even if not using libdispatch

3e404a8

wilhuff commented May 14, 2020

View reviewed changes

wilhuff added 4 commits May 14, 2020 16:21

Fix missing lock in Schedule::Clear

34d7876

style.sh generated changes

b341c17

Fix definition of HAVE_LIBDISPATCH under CocoaPods

786f29a

Include config.h

4f6c82e

wilhuff mentioned this pull request May 15, 2020

Implement safe, synchronous destruction in Firestore #5620

Merged

var-const approved these changes May 15, 2020

View reviewed changes

var-const reviewed May 15, 2020

View reviewed changes

More comments

f534aaa

var-const approved these changes May 17, 2020

View reviewed changes

wilhuff merged commit e462d04 into master May 18, 2020

wilhuff deleted the wilhuff/deterministic-executor branch May 18, 2020 16:06

wilhuff mentioned this pull request May 18, 2020

Force synchronous AsyncQueue shutdown #4912

Closed

firebase locked and limited conversation to collaborators Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Executors have deterministic shutdown #5547

Make Executors have deterministic shutdown #5547

wilhuff commented May 8, 2020 •

edited

Loading

var-const left a comment

var-const May 8, 2020

wilhuff May 9, 2020

var-const May 11, 2020

wilhuff May 12, 2020

var-const May 8, 2020

wilhuff May 9, 2020

var-const May 13, 2020

wilhuff May 14, 2020

var-const May 13, 2020

wilhuff May 14, 2020

wilhuff left a comment

wilhuff May 14, 2020

wilhuff May 14, 2020

google-oss-bot commented May 15, 2020

var-const May 15, 2020

wilhuff May 16, 2020

var-const May 15, 2020

var-const May 15, 2020

wilhuff May 16, 2020

var-const May 15, 2020

var-const May 15, 2020

wilhuff May 16, 2020

var-const May 15, 2020

wilhuff May 16, 2020

var-const May 15, 2020

wilhuff May 16, 2020

var-const May 17, 2020

Make Executors have deterministic shutdown #5547

Make Executors have deterministic shutdown #5547

Conversation

wilhuff commented May 8, 2020 • edited Loading

var-const left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilhuff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-oss-bot commented May 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilhuff commented May 8, 2020 •

edited

Loading