You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Packing complex logic (and API calls!) into a pre-stop hook is a bit of an anti-pattern. However the ECK operator currently does not handle k8s node maintenance in a graceful way. This is because evictions due to node maintenance are not orchestrated by the operator and none of the pre-shutdown logic is executed that we run on regular scale down or ES Pod upgrades.
This becomes a problem on clusters with a lot of data where Pods are being evicted due to node maintenance and shard recovery kicks in almost immediately.
A possible solution would be to add a pre-stop script that queries the ES API to find out whether a node shutdown is currently in progress. If so it does nothing more. If not is issues a _node/shutdown request to the ES API of type restart (which is a guess of course because we cannot know in the pre-stop hook what kind of shutdown is happening)
Downsides of this approach are:
exposure of additional API credentials (cluster_admin) in the script
implementing a solid retry logic and dealing with unavailbility of ES (the overall pre-stop hook timeout helps here)
implementing a loop to wait for the shutdown complete condition from the ES side
Packing complex logic (and API calls!) into a pre-stop hook is a bit of an anti-pattern. However the ECK operator currently does not handle k8s node maintenance in a graceful way. This is because evictions due to node maintenance are not orchestrated by the operator and none of the pre-shutdown logic is executed that we run on regular scale down or ES Pod upgrades.
This becomes a problem on clusters with a lot of data where Pods are being evicted due to node maintenance and shard recovery kicks in almost immediately.
A possible solution would be to add a pre-stop script that queries the ES API to find out whether a node shutdown is currently in progress. If so it does nothing more. If not is issues a
_node/shutdown
request to the ES API of typerestart
(which is a guess of course because we cannot know in the pre-stop hook what kind of shutdown is happening)Downsides of this approach are:
cluster_admin
) in the scriptcomplete
condition from the ES sidecc @SpencerLN
The text was updated successfully, but these errors were encountered: