Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. #13103

Merged
merged 17 commits into from
Oct 20, 2020

Conversation

antoniovicente
Copy link
Contributor

Commit Message:
watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog.
Additional Description:
Use an atomic counter to track if the watchdog was touched since the last time the guarddog checked. This avoids expensive calls to get the current monotonic time every time the watchdog is touched, and thus reduces the overhead incurred by touching the watchdog after every event loop callback.

This PR also includes pre-requisite fixes to the Guarddog's test interlock mechanism so it operates correctly in cases where simulated monotonic time does not move forward between calls to GuardDogImpl::forceCheckForTest()
Risk Level: low, slight change to how watchdog timeouts are computed.
Testing: existing tests.
Docs Changes: n/a
Release Notes: n/a
Issue #11391

Interlocking based on time can result in confusion when using simulated time and without calls to advanceAndWait between calls to GuarddogImpl::forceCheckForTest.

Signed-off-by: Antonio Vicente <avd@google.com>
…t petting of the watchdog.

Signed-off-by: Antonio Vicente <avd@google.com>
Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Thanks for doing this.

: dog_(watch_dog),
last_touch_count_(dog_->touchCount()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should set this to 0 (the last touch count)

otherwise we'll fall through if (watched_dog->last_touch_count_ != touch_count) on the first iteration of the new watchdog in step. We'll end up using const auto ltt = watched_dog->last_touch_time_; without it being explicitly set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice line 179-180 above, where the WatchedDog is created and touched in the following line to guarantee that the first guarddog step after creation goes down the "watched_dog->last_touch_count_ != touch_count" branch.

}

private:
const Thread::ThreadId thread_id_;
TimeSource& time_source_;
std::atomic<std::chrono::steady_clock::duration> latest_touch_time_since_epoch_;
std::atomic<uint64_t> touch_count_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we support 32 bit ARM? Will this work there?
I sort of expect the std::atomic vs absl::Mutex discussion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more context? I assume ARM supports atomic ops. Are they particularly expensive in ARM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be acceptable to make this a 32 bit type, wrap around behavior would be acceptable since we're just checking that the numbers are different. Another implementation option would be to have the guarddog reset the counter to 0 when stepping through which should prevent overflow.

run format

Signed-off-by: Antonio Vicente <avd@google.com>
@yanavlasov yanavlasov self-assigned this Sep 15, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
}
// Accessors for use by GuardDogImpl to tell if the watchdog was touched recently and reset the
// count back to 0 when it has.
uint64_t touchCount() const { return touch_count_.load(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're going with uint32_t internally, perhaps we should return that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed.

if (watched_dog->dog_->touchCount() > 0) {
// Watchdog was touched since guarddog last checked; update time and reset the count.
watched_dog->dog_->resetTouchCount();
watched_dog->last_touch_time_ = now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'll lose some accuracy since the last touch time is set here instead of when the watchdog is actually touched. Is this ok?

Should we then rename it to something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still a timestamp from which the most recent touch was detected. I agree that there's a change in precision; watchdog miss/megamiss/kill/multikill would be delayed by up to min(miss_timeout / 2, megamiss_timeout / 2, kill_timeout / 2, multikill_timeout / 2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to last_checkin_ for consistency after your recent change

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
KBaichoo
KBaichoo previously approved these changes Sep 16, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
yanavlasov
yanavlasov previously approved these changes Sep 17, 2020
@antoniovicente
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13103 (comment) was created by @antoniovicente.

see: more, trace.

@mattklein123
Copy link
Member

Can you merge main and see if that fixes CI? Not sure what is broken here.

/wait

Signed-off-by: Antonio Vicente <avd@google.com>
mattklein123
mattklein123 previously approved these changes Sep 21, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few optional drive by comments.

Comment on lines +87 to +89
Thread::LockGuard guard(mutex_);
if (!run_thread_) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this lock now needs to be held for the entire function. Can you possibly add some more comments? Should there by GUARDED_BY annotations to make this more clear? Maybe this is just to simplify the logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. This is part of the fix to test interlock; we need a full run of the step function after forceCheckForTest. Performance shouldn't suffer since the use of mutex_ in the step() function should never be contended; the other uses of this mutex are GuardDogImpl::start, stop and forceCheckForTest.

Comment on lines 106 to 108
if (watched_dog->dog_->touchCount() > 0) {
// Watchdog was touched since the guard dog last checked; update time and reset the count.
watched_dog->dog_->resetTouchCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf nit: I think unless you do a relaxed load it's probably faster to just always do a reset and return the previous count, and then branch on that. Up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the watchdog API so it provides a single method that returns if the dog was touched recently and reset touch state. I also changed the type to bool since it's more appropriate and switched to a looser memory order for performance.

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
KBaichoo
KBaichoo previously approved these changes Sep 22, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comment but needs format fix.

/wait

void touch() override {
// Set touched_ if not already set.
bool expected = false;
touched_.compare_exchange_strong(expected, true, std::memory_order_release,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an atomics expert, but I'm pretty sure you can use store with memory_order_relaxed on this one, and use memory_order_relaxed on the exchange above.

The whole thing is eventually consistent so I'm not sure any ordering matters and that would probably be faster?

The only thing I'm not sure of is whether this would effect test determinism, however, if you are using locks to serialize the tests that should force consistency of all previous operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert either. After thinking about it some more, I think you're totally right - relaxed memory order is sufficient here. Switched to that.

Fix spelling

Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Contributor Author

LGTM with small comment but needs format fix.

/wait

Oops, spellcheck issue. Need to remember to run that before pushing changes.

Signed-off-by: Antonio Vicente <avd@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry one more comment otherwise LGTM.

/wait


// Server::WatchDog
void startWatchdog(Event::Dispatcher& dispatcher) override;
void touch() override {
// Set touched_ if not already set.
bool expected = false;
touched_.compare_exchange_strong(expected, true, std::memory_order_release,
std::memory_order_acquire);
touched_.compare_exchange_strong(expected, true, std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the CAS here. I think you can just do a relaxed store. It doesn't matter if it's already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I'm using CAS to avoid write in cases where touched_ is already true, since atomic reads are cheaper than atomic writes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@stale
Copy link

stale bot commented Oct 4, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 4, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 12, 2020
@mattklein123
Copy link
Member

Merge main once #13598 merges. Thanks!

/wait

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Contributor Author

Merged master, I see an old approval from you. Let me know if you have any newer comments or I should just merge in the PR.

@mattklein123
Copy link
Member

Go ahead and merge!

@antoniovicente antoniovicente merged commit 4797e7a into envoyproxy:master Oct 20, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 21, 2020
* master: (22 commits)
  ci: various improvements (envoyproxy#13660)
  dns: fix defunct fd bug in apple resolver (envoyproxy#13641)
  build: support ppc64le with wasm (envoyproxy#13657)
  [fuzz] Added random load balancer fuzz (envoyproxy#13400)
  dependencies: compute and check release dates via GitHub API. (envoyproxy#13582)
  mac ci: try ignoring update failure (envoyproxy#13658)
  watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103)
  typos: fix a couple 'enovy' mispellings (envoyproxy#13645)
  lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536)
  tap: fix upstream streamed transport socket taps (envoyproxy#13638)
  Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639)
  Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523)
  [fuzz] Fixed divide by zero bug (envoyproxy#13545)
  wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621)
  fix: record recovered local address (envoyproxy#13581)
  docs: fix incorrect compressor filter doc (envoyproxy#13611)
  docs: clean up docs for azp migration (envoyproxy#13558)
  wasm: fix building Wasm example. (envoyproxy#13619)
  test: Refactor flood tests into a separate test file (envoyproxy#13556)
  wasm: re-enable tests with precompiled modules. (envoyproxy#13583)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants