Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

Closed
mikeh-elastic opened this issue Oct 6, 2020 · 5 comments
Labels
>bug Something isn't working

Comments

@mikeh-elastic
Copy link

Bug Report

What did you do?
Creating a pod validation problem such as a resource set higher than a limit for memory or cpu for a nodeSet in the elasticsearch yaml will start to update the stateful set, but the stateful set gets stuck, fixing the elasticsearch yaml will not update the stateful set and requires a manual edit to repair and have the operator regain control over the stateful set

Steps:
create elasticsearch cluster of 3 nodes with limit and request set for 1cpu and 500 mcpu respectively.

cluster forms and is healthy with 3 nodes

update elasticsearch yaml to have limit 400mcpu and keep request at 500 mcpu.

statefulset is updated to those values and the first pod is taken down and does not relaunch that pod

kubectl describe statefulset.apps/elasticsearch-sample
Warning FailedCreate 4m32s (x18 over 15m) statefulset-controller create Pod elasticsearch-sample-newheap-es-default-2 in StatefulSet elasticsearch-sample-newheap-es-default failed error: Pod "elasticsearch-sample-newheap-es-default-2" is invalid: spec.containers[0].resources.requests: Invalid value: "500m": must be less than or equal to cpu limit

update elasticsearch yaml back to original values of 1000m and 500m and apply

nothing happens to the statefulset and the 400m and 500m values are still seen in the stateful set with a describe even though the elasticsearch is correctly repaired to 1000m and 500m values seen with a describe

to repair:

manually edit the statefulset to update the cpu values back to any value that passes validation such as 600m and 500m

the pod starts and is immediately terminated due to the operator now applying the changes to the stateful set to match the elasticsearch definition back to 1000m and 500m

cluster restarts with new values from elasticsearch yaml

What did you expect to see?
update to statefulset to latest values in the elasticsearch yaml without manual edit

What did you see instead? Under which circumstances?
stateful set became stuck with values that required manual edit to repair

Environment
aws eks 1.16

  • ECK version:

    1.2

  • Kubernetes information:

    insert any information about your Kubernetes environment that could help us:

    • On premise ?
    • Cloud: GKE / EKS / AKS ?
    • Kubernetes distribution: Openshift / Rancher / PKS ?

    for each of them please give us the version you are using

$ kubectl version

kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.7-eks-1861c5", GitCommit:"1861c597586f84f1498a9f2151c78d8a6bf47814", GitTreeState:"clean", BuildDate:"2019-09-24T22:12:08Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

  • Resource definition:
if relevant insert the resource definition
  • Logs:
insert operator logs or any relevant message to the issue here
@botelastic botelastic bot added the triage label Oct 6, 2020
@sebgl
Copy link
Contributor

sebgl commented Oct 7, 2020

I cannot reproduce locally on master against a GKE cluster.
The StatefulSet does not get updated with the wrong resource reqs, we catch the error in the dry-run check that was added in ECK 1.2.0 (PR).

According to Michael's comment here, EKS 1.16 does not allow dry-run requests on the pod-identity-webhook :(. Which probably explains how the server side check is ignored on that environment.

Looking at the EKS webhook, it looks like this has been fixed very recently: aws/amazon-eks-pod-identity-webhook#79. I'm not sure how does it translate to EKS releases though.

I don't think there's much more we could do.
It seems fairly straightforward for us to validate that resource requests are not higher than resource limits though (duplicating the validation done at the StatefulSet level). Which would catch that particular case, but not all potential podTemplate errors.

@asfalots
Copy link

asfalots commented Oct 7, 2020

I'm not sure it's the same issue than me in #3799 . @mikeh-elastic Do you get an error when doing an update of your elasticsearch manifest ?

@asfalots
Copy link

@sebgl , do you know how can I update the ressource while having the "context deadline exceeded" error ? That prevent me to update my cluster through the operator.

@sebgl
Copy link
Contributor

sebgl commented Oct 12, 2020

@asfalots I think the problem you're experiencing is unrelated to that issue.

@pebrc
Copy link
Collaborator

pebrc commented Feb 18, 2021

Closing this as we have a theory of what happened, there seems to be a mitigation and there is nothing we need to do on the ECK side.

@pebrc pebrc closed this as completed Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants