kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

mikeh-elastic · 2020-10-06T18:33:50Z

Bug Report

What did you do?
Creating a pod validation problem such as a resource set higher than a limit for memory or cpu for a nodeSet in the elasticsearch yaml will start to update the stateful set, but the stateful set gets stuck, fixing the elasticsearch yaml will not update the stateful set and requires a manual edit to repair and have the operator regain control over the stateful set

Steps:
create elasticsearch cluster of 3 nodes with limit and request set for 1cpu and 500 mcpu respectively.

cluster forms and is healthy with 3 nodes

update elasticsearch yaml to have limit 400mcpu and keep request at 500 mcpu.

statefulset is updated to those values and the first pod is taken down and does not relaunch that pod

kubectl describe statefulset.apps/elasticsearch-sample
Warning FailedCreate 4m32s (x18 over 15m) statefulset-controller create Pod elasticsearch-sample-newheap-es-default-2 in StatefulSet elasticsearch-sample-newheap-es-default failed error: Pod "elasticsearch-sample-newheap-es-default-2" is invalid: spec.containers[0].resources.requests: Invalid value: "500m": must be less than or equal to cpu limit

update elasticsearch yaml back to original values of 1000m and 500m and apply

nothing happens to the statefulset and the 400m and 500m values are still seen in the stateful set with a describe even though the elasticsearch is correctly repaired to 1000m and 500m values seen with a describe

to repair:

manually edit the statefulset to update the cpu values back to any value that passes validation such as 600m and 500m

the pod starts and is immediately terminated due to the operator now applying the changes to the stateful set to match the elasticsearch definition back to 1000m and 500m

cluster restarts with new values from elasticsearch yaml

What did you expect to see?
update to statefulset to latest values in the elasticsearch yaml without manual edit

What did you see instead? Under which circumstances?
stateful set became stuck with values that required manual edit to repair

Environment
aws eks 1.16

ECK version:

1.2
Kubernetes information:

insert any information about your Kubernetes environment that could help us:
- On premise ?
- Cloud: GKE / EKS / AKS ?
- Kubernetes distribution: Openshift / Rancher / PKS ?
for each of them please give us the version you are using

$ kubectl version

kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.7-eks-1861c5", GitCommit:"1861c597586f84f1498a9f2151c78d8a6bf47814", GitTreeState:"clean", BuildDate:"2019-09-24T22:12:08Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Resource definition:

if relevant insert the resource definition

Logs:

insert operator logs or any relevant message to the issue here

The text was updated successfully, but these errors were encountered:

sebgl · 2020-10-07T08:04:59Z

I cannot reproduce locally on master against a GKE cluster.
The StatefulSet does not get updated with the wrong resource reqs, we catch the error in the dry-run check that was added in ECK 1.2.0 (PR).

According to Michael's comment here, EKS 1.16 does not allow dry-run requests on the pod-identity-webhook :(. Which probably explains how the server side check is ignored on that environment.

Looking at the EKS webhook, it looks like this has been fixed very recently: aws/amazon-eks-pod-identity-webhook#79. I'm not sure how does it translate to EKS releases though.

I don't think there's much more we could do.
It seems fairly straightforward for us to validate that resource requests are not higher than resource limits though (duplicating the validation done at the StatefulSet level). Which would catch that particular case, but not all potential podTemplate errors.

asfalots · 2020-10-07T14:13:16Z

I'm not sure it's the same issue than me in #3799 . @mikeh-elastic Do you get an error when doing an update of your elasticsearch manifest ?

asfalots · 2020-10-12T07:28:05Z

@sebgl , do you know how can I update the ressource while having the "context deadline exceeded" error ? That prevent me to update my cluster through the operator.

sebgl · 2020-10-12T08:21:44Z

@asfalots I think the problem you're experiencing is unrelated to that issue.

pebrc · 2021-02-18T10:11:49Z

Closing this as we have a theory of what happened, there seems to be a mitigation and there is nothing we need to do on the ECK side.

botelastic bot added the triage label Oct 6, 2020

charith-elastic added the >bug Something isn't working label Oct 7, 2020

botelastic bot removed the triage label Oct 7, 2020

charith-elastic mentioned this issue Oct 7, 2020

context deadline exceeded with Reconciler error #3799

Closed

pebrc closed this as completed Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

mikeh-elastic commented Oct 6, 2020

sebgl commented Oct 7, 2020 •

edited

Loading

asfalots commented Oct 7, 2020

asfalots commented Oct 12, 2020

sebgl commented Oct 12, 2020

pebrc commented Feb 18, 2021

kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

kubernetes pod validation can get a statefulset stuck in a state that requires manual intervention to repair #3801

Comments

mikeh-elastic commented Oct 6, 2020

Bug Report

sebgl commented Oct 7, 2020 • edited Loading

asfalots commented Oct 7, 2020

asfalots commented Oct 12, 2020

sebgl commented Oct 12, 2020

pebrc commented Feb 18, 2021

sebgl commented Oct 7, 2020 •

edited

Loading