Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff to watcher reconnection #577

Closed
nightkr opened this issue Jul 4, 2021 · 11 comments · Fixed by #703
Closed

Add backoff to watcher reconnection #577

nightkr opened this issue Jul 4, 2021 · 11 comments · Fixed by #703
Assignees
Labels
errors error handling or error ergonomics runtime controller runtime related

Comments

@nightkr
Copy link
Member

nightkr commented Jul 4, 2021

Currently watcher will always keep a reconnection attempt in-flight when polled (and there is no active connection already). That's a bit excessive, and pretty unkind to the apiserver.

This isn't too bad for watcher itself since you can add a delay on the consumer side instead, but we should at least add something for Controller...

@nightkr nightkr added the runtime controller runtime related label Jul 4, 2021
@nightkr
Copy link
Member Author

nightkr commented Jul 4, 2021

@clux clux moved this to Defining in Kube Roadmap Nov 3, 2021
@clux
Copy link
Member

clux commented Nov 3, 2021

Pasting comments here for people off discord:

pavelp: Out of pure curiosity, I noticed kube calls K8S API with high frequency when it's not available (e.g. local cluster down). Is this the usual frequency excercised by kuibe-rs to call REST API watches ? Is there any recommended ? I was fiddling with K8S API this weekend and did some throttling - I only ask twice a second maximum.
https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

teozkr: that said.. there are still some missing pieces (for example: exponential backoff, backoff for kube watchers, and so on)

Do we think the solution here is to try to bake an exponential backoff watcher or the the step_trampolined (inner state machinery of watcher) fn?

(btw one way I have gotten around on this in the controller-rs example is to try to get the CRD first before trying to watch it - as that causes the same error - but that only works for controllers who forgot to add the CRD)

@nightkr
Copy link
Member Author

nightkr commented Nov 3, 2021

I think it would make sense to just have watcher (and applier/Controller in turn) take an arbitrary Backoff from the user.

@clux
Copy link
Member

clux commented Nov 3, 2021

If that's the case then I think we should consider creating builders for watcher to pass it in (#276), it's already a bit strange to have to disambiguate the import paths runtime::watcher (module) vs runtime::watcher() (fn) or runtime::watcher::watcher (fn)

@clux
Copy link
Member

clux commented Nov 3, 2021

rewording a bit; the watcher import problem is a separate issue, could be fixed by renaming the watcher module to watch,.

I do still think there's room for a Watcher<K>, there are a couple of other naming problems here that i would like to improve separately (module name + flatten helpers/naming (maybe something akin to #296)). I'll open an issue separately.

@nightkr
Copy link
Member Author

nightkr commented Nov 3, 2021

To be honest, I don't really see the problem with the import, Rust has been smart enough to do the right thing in all the cases I've tried.

@clux
Copy link
Member

clux commented Nov 3, 2021

Rustdoc needs explicit disambiguation when linking to either module or the fn.
And although that's more of a problem for us, people still have to do a double-take here on the name being re-used in imports:

Generally i just think we should organise it in the way that is less confusing:

-use runtime::{reflector, reflector::Store};
+use runtime::cache::{reflector, Store};

relatively minor point though.

@nightkr
Copy link
Member Author

nightkr commented Nov 3, 2021

That makes sense to me.

@clux clux added the errors error handling or error ergonomics label Nov 3, 2021
@clux clux moved this from Defining to Backlog in Kube Roadmap Nov 3, 2021
@clux clux added errors error handling or error ergonomics and removed errors error handling or error ergonomics labels Nov 3, 2021
clux added a commit that referenced this issue Nov 7, 2021
moves the imports of kube_runtime reflector around a bit first as per
comments in #577

then implements a slightly more ergonomic builder struct `Cache<K>` that
wraps `reflector`. This encapsulates the writer and result unpacking
within the struct, and creates 3 ways to consume it:

- `Cache::run(self)` - if wanting to run forever
- `Cache::applies(self)` - also exposes flattened applies stream
- `Cache::touches(self)` - also exposes flattened touches stream

Signed-off-by: clux <sszynrae@gmail.com>
@clux
Copy link
Member

clux commented Nov 8, 2021

Coming back to this after thinking more about it in #698

If we put a backoff into watcher, propagated it into step and slowed down progression on errors, then this would fix this particular issue (being kind to the api server when it's having trouble), but there are other issues that are hard errors here which cannot be fixed by retrying:

if RBAC is wrong, then no amount of retrying should occur - we need to bubble this up from the watcher.

I can't think of any other non-ephemeral issues that can cause a watcher retry loop to happen, but if there are we should also catch them.

So, what we should probably try to do:

  • catch the RBAC permission issue (that will cause a retry loop within watcher)
  • have watcher take a mandatory backoff that controls non-critical error retries
  • create a builder for watcher that creates a sensible default backoff builders for kube-runtime reflector #276
  • document the error cases that are actually bubbled up from watcher to the user

The last point here because most users thing watcher/reflector always returns true and just uses ? without thinking (and i've personally never seen it crash the app). So if we suddenly add a common case here, we should document it.

@nightkr
Copy link
Member Author

nightkr commented Nov 8, 2021

I can't think of any other non-ephemeral issues that can cause a watcher retry loop to happen, but if there are we should also catch them.

I'm not so sure that we can make a hard distinction here. For example:

  • RBAC errors may be caused by misconfiguration (permanent) or a redeploy that reorganizes RBAC roles but ultimately gives the same permissions in the end (ephemeral)
  • 404 errors may be caused by a K8s version mismatch or missing CRD (permanent) or a new CRD that hasn't been picked up by the API server yet (ephemeral)
  • Connection errors may be caused by firewall misconfiguration (permanent) or the load balancer not having picked up on the new cluster state yet (ephemeral)

And so on, and so forth. Ultimately, we need to percolate the problem up to the administrator and let them handle the problem.

And even if it is a human-caused misconfiguration error, it'd still be useful to retry since we probably won't get notified once the issue is resolved.

@clux
Copy link
Member

clux commented Nov 9, 2021

True. These configuration errors can happen, and it's nice if we can just keep retrying through it. But it's also important for us to provide that first signal that something has gone wrong upstream (like in the load balancer policy example - if we silently try to retry for that, then we'll never get correct percolation).

I checked watcher with some custom rbac to limit my watch and list permissions inside to verify and the following errors are not retried and are bubbled up to the stream consumer:

  • ErrorResponse 403 (missing/insufficient rbac)
  • ErrorResponse 404 (missing/invalid api kind)

So these would be up to users to handle atm.

Pre-backoff; I think this is good. If the user want's to be resilient towards upstream configuration errors (rather than crash and slowly go into CrashLoopBackOff), then they should be explicit about that and percolate signals in other ways (most clusters have notification systems when pods go into crashloop after all so it's a pretty safe default).

But this means that means the RBAC retry we get from configmapgen_controller (if we delete the configmapgenerators.nullable.se crd) is because the controller discards the watcher's errors. So we are currently inconsistent in our strategy.

If we inject backoffs into watcher then it probably also needs some sort of default threshold for how long the watcher (and ultimately the using Controller) is allowed to exist in a non-functioning state, because otherwise we run the risk of the default behaviour not percolating signals upstream at all.

clux added a commit that referenced this issue Nov 11, 2021
Signed-off-by: clux <sszynrae@gmail.com>
@clux clux self-assigned this Nov 11, 2021
@clux clux moved this from Backlog to In Progress in Kube Roadmap Nov 11, 2021
@clux clux linked a pull request Nov 11, 2021 that will close this issue
4 tasks
clux added a commit that referenced this issue Nov 16, 2021
Signed-off-by: clux <sszynrae@gmail.com>
@clux clux closed this as completed in #703 Dec 21, 2021
clux added a commit that referenced this issue Dec 21, 2021
* implement backoff for watcher - for #577

Signed-off-by: clux <sszynrae@gmail.com>

* move magic number into strategy

Signed-off-by: clux <sszynrae@gmail.com>

* expose backoff from watcher and semi-propagate into controller

awkward. will write a comment

Signed-off-by: clux <sszynrae@gmail.com>

* potential abstraction

Signed-off-by: clux <sszynrae@gmail.com>

* another builder layer; allow eliding ListParams

Signed-off-by: clux <sszynrae@gmail.com>

* forgot to add file

Signed-off-by: clux <sszynrae@gmail.com>

* easy parts of code review

Signed-off-by: clux <sszynrae@gmail.com>

* rewrite as a helper (take N)

jesus this stuff is hard.

Signed-off-by: clux <sszynrae@gmail.com>

* rename as suggested

Signed-off-by: clux <sszynrae@gmail.com>

* Reimplement watcher backoff as FSM (#720)

* Fix clippy warnings

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Reimplement watch backoff as FSM

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Remove useless lifetime bounds

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Silence clippy size warning

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Silence clippy properly this time around

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Split StreamBackoff into a separate utils module

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Backoff tests

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Add stream close test

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* remove backoff pin, fix docs

Signed-off-by: clux <sszynrae@gmail.com>

* newline

Signed-off-by: clux <sszynrae@gmail.com>

* Add `Backoff` wrapper that implements client-go's reset timer behaviour (#729)

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* use new reset backoff and replicate client-go reflector values

Signed-off-by: clux <sszynrae@gmail.com>

* fix node watcher example

Signed-off-by: clux <sszynrae@gmail.com>

* Use released `backoff`

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Factor out default `Backoff`

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Add note to `watcher` about backoff

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Added backoff to Controller

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Changelog

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Revert `Observer` for now

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* The clippyman comes for us all, eventually

And we must all pay our due respects, or pay the price.

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* Fix build warnings

Signed-off-by: Teo Klestrup Röijezon <teo@nullable.se>

* remove backoff_watch

Signed-off-by: clux <sszynrae@gmail.com>

* doc tweaks

Signed-off-by: clux <sszynrae@gmail.com>

* sentence

Signed-off-by: clux <sszynrae@gmail.com>

* upgrading backoff is not actually breaking

Signed-off-by: clux <sszynrae@gmail.com>

Co-authored-by: Teo Klestrup Röijezon <teo.roijezon@stackable.de>
Co-authored-by: Teo Klestrup Röijezon <teo@nullable.se>
Repository owner moved this from In Progress to Done in Kube Roadmap Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
errors error handling or error ergonomics runtime controller runtime related
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants