-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801
Comments
I cannot reproduce locally on According to Michael's comment here, EKS 1.16 does not allow dry-run requests on the Looking at the EKS webhook, it looks like this has been fixed very recently: aws/amazon-eks-pod-identity-webhook#79. I'm not sure how does it translate to EKS releases though. I don't think there's much more we could do. |
I'm not sure it's the same issue than me in #3799 . @mikeh-elastic Do you get an error when doing an update of your elasticsearch manifest ? |
@sebgl , do you know how can I update the ressource while having the "context deadline exceeded" error ? That prevent me to update my cluster through the operator. |
@asfalots I think the problem you're experiencing is unrelated to that issue. |
Closing this as we have a theory of what happened, there seems to be a mitigation and there is nothing we need to do on the ECK side. |
Bug Report
What did you do?
Creating a pod validation problem such as a resource set higher than a limit for memory or cpu for a nodeSet in the elasticsearch yaml will start to update the stateful set, but the stateful set gets stuck, fixing the elasticsearch yaml will not update the stateful set and requires a manual edit to repair and have the operator regain control over the stateful set
Steps:
create elasticsearch cluster of 3 nodes with limit and request set for 1cpu and 500 mcpu respectively.
cluster forms and is healthy with 3 nodes
update elasticsearch yaml to have limit 400mcpu and keep request at 500 mcpu.
statefulset is updated to those values and the first pod is taken down and does not relaunch that pod
kubectl describe statefulset.apps/elasticsearch-sample
Warning FailedCreate 4m32s (x18 over 15m) statefulset-controller create Pod elasticsearch-sample-newheap-es-default-2 in StatefulSet elasticsearch-sample-newheap-es-default failed error: Pod "elasticsearch-sample-newheap-es-default-2" is invalid: spec.containers[0].resources.requests: Invalid value: "500m": must be less than or equal to cpu limit
update elasticsearch yaml back to original values of 1000m and 500m and apply
nothing happens to the statefulset and the 400m and 500m values are still seen in the stateful set with a describe even though the elasticsearch is correctly repaired to 1000m and 500m values seen with a describe
to repair:
manually edit the statefulset to update the cpu values back to any value that passes validation such as 600m and 500m
the pod starts and is immediately terminated due to the operator now applying the changes to the stateful set to match the elasticsearch definition back to 1000m and 500m
cluster restarts with new values from elasticsearch yaml
What did you expect to see?
update to statefulset to latest values in the elasticsearch yaml without manual edit
What did you see instead? Under which circumstances?
stateful set became stuck with values that required manual edit to repair
Environment
aws eks 1.16
ECK version:
1.2
Kubernetes information:
insert any information about your Kubernetes environment that could help us:
for each of them please give us the version you are using
kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.7-eks-1861c5", GitCommit:"1861c597586f84f1498a9f2151c78d8a6bf47814", GitTreeState:"clean", BuildDate:"2019-09-24T22:12:08Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: