Agent crashes when API server is not available #222

jiria · 2021-01-25T01:38:49Z

Agent crashes when API server is not available.

Here is an example:

akri.sh Agent start
akri.sh KUBERNETES_PORT found ... env_logger::init
[2021-01-23T05:40:51Z TRACE agent] akri.sh KUBERNETES_PORT found ... env_logger::init finished
[2021-01-23T05:40:51Z INFO  akri_shared::akri::metrics] starting metrics server on port 8080 at /metrics
[2021-01-23T05:40:51Z INFO  agent::util::config_action] do_config_watch - enter
[2021-01-23T05:40:51Z TRACE akri_shared::k8s] Loading in-cluster config
[2021-01-23T05:40:51Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - start
[2021-01-23T05:40:51Z TRACE akri_shared::k8s] Loading in-cluster config
[2021-01-23T05:40:51Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration pre delay_for
[2021-01-23T05:40:51Z TRACE akri_shared::akri::configuration] get_configurations enter
[2021-01-23T05:40:51Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request::<KubeAkriInstanceList>(akri_config_type.list(...)?).await?
[2021-01-23T05:40:52Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request error: ReqwestError(reqwest::Error { kind: Request, url: Url { scheme: "https", host: Some(Ipv4(10.43.0.1)), port: None, path: "/apis/akri.sh/v0/configurations", query: Some(""), fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 113, kind: Other, message: "No route to host" })) })
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ReqwestError(reqwest::Error { kind: Request, url: Url { scheme: "https", host: Some(Ipv4(10.43.0.1)), port: None, path: "/apis/akri.sh/v0/configurations", query: Some(""), fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 113, kind: Other, message: "No route to host" })) })', agent/src/main.rs:68:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: JoinError::Panic(...)

Consider using the cached configuration instead of crashing.

The text was updated successfully, but these errors were encountered:

jiayihu · 2021-02-02T23:23:41Z

Ideally, the agent should be alive for about 5mins before anything extreme like crashing. 5mins is for instance the standard eviction timeout duration for a node: https://kubernetes.io/docs/concepts/architecture/nodes/#condition

Another idea is to retry a fixed number of times, maybe with something like exponential backoff. Krustlet does something similar when it needs to register itself as new node to the apiserver: https://github.com/deislabs/krustlet/blob/master/crates/kubelet/src/node/mod.rs#L80

github-actions · 2021-09-04T00:01:13Z

Issue has been automatically marked as stale due to inactivity for 45 days. Update the issue to remove label, otherwise it will be automatically closed.

adithyaj · 2023-01-03T18:03:55Z

@kate-goldenring did #374 resolve this issue? Should we close it?

kate-goldenring · 2023-01-03T18:13:48Z

@adithyaj I am not sure we ever tested if it resolved it but i think it is safe to close this and reopen it if someone else experiences it still

kate-goldenring · 2023-03-07T18:29:46Z

This issue is still not resolved according to #557

diconico07 · 2023-03-08T15:50:54Z

From what I understand of the kube-rs documentation:

Errors from the underlying watch are propagated, after which the stream will go into recovery mode on the next poll.

So my guess is that this construct is at fault:

let watcher = watcher(resource, ListParams::default());
let mut informer = watcher.boxed();
...
while let Some(event) = informer.try_next().await? {
...

When doing this we are exiting the loop in case of an error probably panic, we should check for error within the loop and handle it locally (maybe also have a backoff mechanism).

I'll try to do a PR for this.

kate-goldenring added bug Something isn't working documentation Improvements or additions to documentation labels Feb 2, 2021

github-actions bot added the stale label Sep 4, 2021

kate-goldenring mentioned this issue Sep 8, 2021

Use resource watcher instead of Api::watch #374

Closed

kate-goldenring removed the documentation Improvements or additions to documentation label Sep 22, 2021

bfjelds added keep-alive and removed stale labels Nov 2, 2021

kate-goldenring mentioned this issue Mar 7, 2023

AKRI core components are restarting very often #557

Closed

diconico07 mentioned this issue Mar 10, 2023

Fix watch crash api unreachable #568

Merged

8 tasks

yujinkim-msft closed this as completed in #568 Apr 5, 2023

kate-goldenring added this to Akri Roadmap Oct 3, 2023

kate-goldenring moved this to Done in Akri Roadmap Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent crashes when API server is not available #222

Agent crashes when API server is not available #222

jiria commented Jan 25, 2021

jiayihu commented Feb 2, 2021

github-actions bot commented Sep 4, 2021

adithyaj commented Jan 3, 2023

kate-goldenring commented Jan 3, 2023

kate-goldenring commented Mar 7, 2023

diconico07 commented Mar 8, 2023

Agent crashes when API server is not available #222

Agent crashes when API server is not available #222

Comments

jiria commented Jan 25, 2021

jiayihu commented Feb 2, 2021

github-actions bot commented Sep 4, 2021

adithyaj commented Jan 3, 2023

kate-goldenring commented Jan 3, 2023

kate-goldenring commented Mar 7, 2023

diconico07 commented Mar 8, 2023