-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting a pod while cluster is red results in pod not being replaced. #850
Comments
I don't know how this situation happened. All the data above is mostly correlating, not determining a cause. The pod, when deleted, did not return after waiting for 10-15 minutes. I gave up waiting and deleted the elasticsearch resource and recreated it fully. |
Where there should be 6 pods, there are now only 5:
|
I think we're seeing more of these errors now than before, since our readiness health check is too permissive for several of the write-operations we do at the moment. Ready = true whenever ES returns a successful HTTP response to When it comes to the ready check implementation, I think we also have to consider it as something the user might want to configure, as the ready check itself could depend on the specific use case (and node type). To reap particularly good benefits of this, additional smarts would have to exist on the Service(s) layer though (e,g routing reads to data nodes that are ready, even if they are partitioned from master nodes?). In any case, there are other semantically valid ready checks that could work, such as the Helm-chart one (Node has seen cluster state = green/yellow/red once, then whether it is responding to |
This will be fixed by moving to StatefulSets: the StatefulSet controller will automatically recreate deleted pods with the same PVCs. |
I am closing this as we are moving to StatefulSets |
Bug Report
What did you do?
Deleted an elasticsearch pod.
What did you expect to see?
The loss of a pod should be reconciled by replacing the lost pod.
What did you see instead? Under which circumstances?
Controller logged
503 Service Unavailable
inside the reconciler.Circumstances:
kubectl get elasticsearch
.Environment
Version information:
https://github.com/elastic/cloud-on-k8s/blob/pre-release/docs/k8s-quickstart.asciidoc
Kubernetes information:
GKE master 1.11.8-gke.6, nodes 1.11.7-gke.12
I don't have logs from the session, but we did observe a reconciler error message reporting 503 Service Unavailable at the time of the issue.
The text was updated successfully, but these errors were encountered: