Add backoff to watcher reconnection #577

nightkr · 2021-07-04T16:27:05Z

Currently watcher will always keep a reconnection attempt in-flight when polled (and there is no active connection already). That's a bit excessive, and pretty unkind to the apiserver.

This isn't too bad for watcher itself since you can add a delay on the consumer side instead, but we should at least add something for Controller...

The text was updated successfully, but these errors were encountered:

nightkr · 2021-07-04T16:27:20Z

https://canary.discord.com/channels/500028886025895936/685161813675212899/861276713295675424

clux · 2021-11-03T11:40:01Z

Pasting comments here for people off discord:

pavelp: Out of pure curiosity, I noticed kube calls K8S API with high frequency when it's not available (e.g. local cluster down). Is this the usual frequency excercised by kuibe-rs to call REST API watches ? Is there any recommended ? I was fiddling with K8S API this weekend and did some throttling - I only ask twice a second maximum.
https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

teozkr: that said.. there are still some missing pieces (for example: exponential backoff, backoff for kube watchers, and so on)

Do we think the solution here is to try to bake an exponential backoff watcher or the the step_trampolined (inner state machinery of watcher) fn?

(btw one way I have gotten around on this in the controller-rs example is to try to get the CRD first before trying to watch it - as that causes the same error - but that only works for controllers who forgot to add the CRD)

nightkr · 2021-11-03T11:50:21Z

I think it would make sense to just have watcher (and applier/Controller in turn) take an arbitrary Backoff from the user.

clux · 2021-11-03T11:53:11Z

If that's the case then I think we should consider creating builders for watcher to pass it in (#276), it's already a bit strange to have to disambiguate the import paths runtime::watcher (module) vs runtime::watcher() (fn) or runtime::watcher::watcher (fn)

clux · 2021-11-03T12:03:52Z

rewording a bit; the watcher import problem is a separate issue, could be fixed by renaming the watcher module to watch,.

I do still think there's room for a Watcher<K>, there are a couple of other naming problems here that i would like to improve separately (module name + flatten helpers/naming (maybe something akin to #296)). I'll open an issue separately.

nightkr · 2021-11-03T12:04:49Z

To be honest, I don't really see the problem with the import, Rust has been smart enough to do the right thing in all the cases I've tried.

clux · 2021-11-03T12:15:54Z

Rustdoc needs explicit disambiguation when linking to either module or the fn.
And although that's more of a problem for us, people still have to do a double-take here on the name being re-used in imports:

Generally i just think we should organise it in the way that is less confusing:

-use runtime::{reflector, reflector::Store};
+use runtime::cache::{reflector, Store};

relatively minor point though.

nightkr · 2021-11-03T12:18:03Z

That makes sense to me.

moves the imports of kube_runtime reflector around a bit first as per comments in #577 then implements a slightly more ergonomic builder struct `Cache<K>` that wraps `reflector`. This encapsulates the writer and result unpacking within the struct, and creates 3 ways to consume it: - `Cache::run(self)` - if wanting to run forever - `Cache::applies(self)` - also exposes flattened applies stream - `Cache::touches(self)` - also exposes flattened touches stream Signed-off-by: clux <sszynrae@gmail.com>

clux · 2021-11-08T22:29:08Z

Coming back to this after thinking more about it in #698

If we put a backoff into watcher, propagated it into step and slowed down progression on errors, then this would fix this particular issue (being kind to the api server when it's having trouble), but there are other issues that are hard errors here which cannot be fixed by retrying:

if RBAC is wrong, then no amount of retrying should occur - we need to bubble this up from the watcher.

I can't think of any other non-ephemeral issues that can cause a watcher retry loop to happen, but if there are we should also catch them.

So, what we should probably try to do:

catch the RBAC permission issue (that will cause a retry loop within watcher)
have watcher take a mandatory backoff that controls non-critical error retries
create a builder for watcher that creates a sensible default backoff builders for kube-runtime reflector #276
document the error cases that are actually bubbled up from watcher to the user

The last point here because most users thing watcher/reflector always returns true and just uses ? without thinking (and i've personally never seen it crash the app). So if we suddenly add a common case here, we should document it.

nightkr · 2021-11-08T23:09:01Z

I can't think of any other non-ephemeral issues that can cause a watcher retry loop to happen, but if there are we should also catch them.

I'm not so sure that we can make a hard distinction here. For example:

RBAC errors may be caused by misconfiguration (permanent) or a redeploy that reorganizes RBAC roles but ultimately gives the same permissions in the end (ephemeral)
404 errors may be caused by a K8s version mismatch or missing CRD (permanent) or a new CRD that hasn't been picked up by the API server yet (ephemeral)
Connection errors may be caused by firewall misconfiguration (permanent) or the load balancer not having picked up on the new cluster state yet (ephemeral)

And so on, and so forth. Ultimately, we need to percolate the problem up to the administrator and let them handle the problem.

And even if it is a human-caused misconfiguration error, it'd still be useful to retry since we probably won't get notified once the issue is resolved.

clux · 2021-11-09T09:55:30Z

True. These configuration errors can happen, and it's nice if we can just keep retrying through it. But it's also important for us to provide that first signal that something has gone wrong upstream (like in the load balancer policy example - if we silently try to retry for that, then we'll never get correct percolation).

I checked watcher with some custom rbac to limit my watch and list permissions inside to verify and the following errors are not retried and are bubbled up to the stream consumer:

ErrorResponse 403 (missing/insufficient rbac)
ErrorResponse 404 (missing/invalid api kind)

So these would be up to users to handle atm.

Pre-backoff; I think this is good. If the user want's to be resilient towards upstream configuration errors (rather than crash and slowly go into CrashLoopBackOff), then they should be explicit about that and percolate signals in other ways (most clusters have notification systems when pods go into crashloop after all so it's a pretty safe default).

But this means that means the RBAC retry we get from configmapgen_controller (if we delete the configmapgenerators.nullable.se crd) is because the controller discards the watcher's errors. So we are currently inconsistent in our strategy.

If we inject backoffs into watcher then it probably also needs some sort of default threshold for how long the watcher (and ultimately the using Controller) is allowed to exist in a non-functioning state, because otherwise we run the risk of the default behaviour not percolating signals upstream at all.

Signed-off-by: clux <sszynrae@gmail.com>

* implement backoff for watcher - for #577 Signed-off-by: clux <sszynrae@gmail.com> * move magic number into strategy Signed-off-by: clux <sszynrae@gmail.com> * expose backoff from watcher and semi-propagate into controller awkward. will write a comment Signed-off-by: clux <sszynrae@gmail.com> * potential abstraction Signed-off-by: clux <sszynrae@gmail.com> * another builder layer; allow eliding ListParams Signed-off-by: clux <sszynrae@gmail.com> * forgot to add file Signed-off-by: clux <sszynrae@gmail.com> * easy parts of code review Signed-off-by: clux <sszynrae@gmail.com> * rewrite as a helper (take N) jesus this stuff is hard. Signed-off-by: clux <sszynrae@gmail.com> * rename as suggested Signed-off-by: clux <sszynrae@gmail.com> * Reimplement watcher backoff as FSM (#720) * Fix clippy warnings Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Reimplement watch backoff as FSM Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Remove useless lifetime bounds Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Silence clippy size warning Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Silence clippy properly this time around Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Split StreamBackoff into a separate utils module Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Backoff tests Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Add stream close test Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * remove backoff pin, fix docs Signed-off-by: clux <sszynrae@gmail.com> * newline Signed-off-by: clux <sszynrae@gmail.com> * Add `Backoff` wrapper that implements client-go's reset timer behaviour (#729) Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * use new reset backoff and replicate client-go reflector values Signed-off-by: clux <sszynrae@gmail.com> * fix node watcher example Signed-off-by: clux <sszynrae@gmail.com> * Use released `backoff` Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Factor out default `Backoff` Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Add note to `watcher` about backoff Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Added backoff to Controller Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Changelog Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Revert `Observer` for now Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * The clippyman comes for us all, eventually And we must all pay our due respects, or pay the price. Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * Fix build warnings Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se> * remove backoff_watch Signed-off-by: clux <sszynrae@gmail.com> * doc tweaks Signed-off-by: clux <sszynrae@gmail.com> * sentence Signed-off-by: clux <sszynrae@gmail.com> * upgrading backoff is not actually breaking Signed-off-by: clux <sszynrae@gmail.com> Co-authored-by: Teo Klestrup Röijezon <teo.roijezon@stackable.de> Co-authored-by: Teo Klestrup Röijezon <teo@nullable.se>

nightkr added the runtime controller runtime related label Jul 4, 2021

clux added this to Kube Roadmap Oct 26, 2021

clux moved this to Defining in Kube Roadmap Nov 3, 2021

clux added the errors error handling or error ergonomics label Nov 3, 2021

clux moved this from Defining to Backlog in Kube Roadmap Nov 3, 2021

clux added errors error handling or error ergonomics and removed errors error handling or error ergonomics labels Nov 3, 2021

clux mentioned this issue Nov 7, 2021

Cache: implement a builder wrapper for reflector #698

Closed

clux added a commit that referenced this issue Nov 11, 2021

implement backoff for watcher - for #577

6a6cbb3

Signed-off-by: clux <sszynrae@gmail.com>

clux self-assigned this Nov 11, 2021

clux moved this from Backlog to In Progress in Kube Roadmap Nov 11, 2021

clux linked a pull request Nov 11, 2021 that will close this issue

Add backoff handling for watcher and Controller #703

Merged

4 tasks

clux added a commit that referenced this issue Nov 16, 2021

implement backoff for watcher - for #577

c26e894

Signed-off-by: clux <sszynrae@gmail.com>

clux mentioned this issue Nov 21, 2021

Expose Conflict errors and allow wrapping get/replace with backoff #34

Closed

clux closed this as completed in #703 Dec 21, 2021

Repository owner moved this from In Progress to Done in Kube Roadmap Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backoff to watcher reconnection #577

Add backoff to watcher reconnection #577

nightkr commented Jul 4, 2021

nightkr commented Jul 4, 2021

clux commented Nov 3, 2021

nightkr commented Nov 3, 2021

clux commented Nov 3, 2021

clux commented Nov 3, 2021 •

edited

Loading

nightkr commented Nov 3, 2021

clux commented Nov 3, 2021 •

edited

Loading

nightkr commented Nov 3, 2021

clux commented Nov 8, 2021

nightkr commented Nov 8, 2021

clux commented Nov 9, 2021

Add backoff to watcher reconnection #577

Add backoff to watcher reconnection #577

Comments

nightkr commented Jul 4, 2021

nightkr commented Jul 4, 2021

clux commented Nov 3, 2021

nightkr commented Nov 3, 2021

clux commented Nov 3, 2021

clux commented Nov 3, 2021 • edited Loading

nightkr commented Nov 3, 2021

clux commented Nov 3, 2021 • edited Loading

nightkr commented Nov 3, 2021

clux commented Nov 8, 2021

nightkr commented Nov 8, 2021

clux commented Nov 9, 2021

clux commented Nov 3, 2021 •

edited

Loading

clux commented Nov 3, 2021 •

edited

Loading