Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent crashes when API server is not available #222

Closed
jiria opened this issue Jan 25, 2021 · 6 comments · Fixed by #568
Closed

Agent crashes when API server is not available #222

jiria opened this issue Jan 25, 2021 · 6 comments · Fixed by #568
Labels
bug Something isn't working keep-alive

Comments

@jiria
Copy link
Contributor

jiria commented Jan 25, 2021

Agent crashes when API server is not available.

Here is an example:

akri.sh Agent start
akri.sh KUBERNETES_PORT found ... env_logger::init
[2021-01-23T05:40:51Z TRACE agent] akri.sh KUBERNETES_PORT found ... env_logger::init finished
[2021-01-23T05:40:51Z INFO  akri_shared::akri::metrics] starting metrics server on port 8080 at /metrics
[2021-01-23T05:40:51Z INFO  agent::util::config_action] do_config_watch - enter
[2021-01-23T05:40:51Z TRACE akri_shared::k8s] Loading in-cluster config
[2021-01-23T05:40:51Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - start
[2021-01-23T05:40:51Z TRACE akri_shared::k8s] Loading in-cluster config
[2021-01-23T05:40:51Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration pre delay_for
[2021-01-23T05:40:51Z TRACE akri_shared::akri::configuration] get_configurations enter
[2021-01-23T05:40:51Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request::<KubeAkriInstanceList>(akri_config_type.list(...)?).await?
[2021-01-23T05:40:52Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request error: ReqwestError(reqwest::Error { kind: Request, url: Url { scheme: "https", host: Some(Ipv4(10.43.0.1)), port: None, path: "/apis/akri.sh/v0/configurations", query: Some(""), fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 113, kind: Other, message: "No route to host" })) })
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ReqwestError(reqwest::Error { kind: Request, url: Url { scheme: "https", host: Some(Ipv4(10.43.0.1)), port: None, path: "/apis/akri.sh/v0/configurations", query: Some(""), fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 113, kind: Other, message: "No route to host" })) })', agent/src/main.rs:68:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: JoinError::Panic(...)

Consider using the cached configuration instead of crashing.

@kate-goldenring kate-goldenring added bug Something isn't working documentation Improvements or additions to documentation labels Feb 2, 2021
@jiayihu
Copy link
Contributor

jiayihu commented Feb 2, 2021

Ideally, the agent should be alive for about 5mins before anything extreme like crashing. 5mins is for instance the standard eviction timeout duration for a node: https://kubernetes.io/docs/concepts/architecture/nodes/#condition

Another idea is to retry a fixed number of times, maybe with something like exponential backoff. Krustlet does something similar when it needs to register itself as new node to the apiserver: https://github.com/deislabs/krustlet/blob/master/crates/kubelet/src/node/mod.rs#L80

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2021

Issue has been automatically marked as stale due to inactivity for 45 days. Update the issue to remove label, otherwise it will be automatically closed.

@github-actions github-actions bot added the stale label Sep 4, 2021
@kate-goldenring kate-goldenring removed the documentation Improvements or additions to documentation label Sep 22, 2021
@bfjelds bfjelds added keep-alive and removed stale labels Nov 2, 2021
@adithyaj
Copy link
Collaborator

adithyaj commented Jan 3, 2023

@kate-goldenring did #374 resolve this issue? Should we close it?

@kate-goldenring
Copy link
Contributor

@adithyaj I am not sure we ever tested if it resolved it but i think it is safe to close this and reopen it if someone else experiences it still

@kate-goldenring
Copy link
Contributor

This issue is still not resolved according to #557

@diconico07
Copy link
Contributor

From what I understand of the kube-rs documentation:

Errors from the underlying watch are propagated, after which the stream will go into recovery mode on the next poll.

So my guess is that this construct is at fault:

let watcher = watcher(resource, ListParams::default());
let mut informer = watcher.boxed();
...
while let Some(event) = informer.try_next().await? {
...

When doing this we are exiting the loop in case of an error probably panic, we should check for error within the loop and handle it locally (maybe also have a backoff mechanism).

I'll try to do a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keep-alive
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants