-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling updates nullify use of readiness probes #1028
Comments
cc @davidvossel |
cc @tchughesiv |
We're using rolling updates and upgrades are working as expected. |
I think this would only be impact you, maybe I need to update the impact 😎, if you need your new operator to do work during an upgrade. Even then, this issue would not really prevent that work from being done it would just be done after OLM had said the operator update was complete. |
To let the upgrades go nicely and make good use of conditions reporting, we need to make sure that k8s don't change things behind our back. To do that, we force the update strategy to "Recreate". Relevant OLM issues: operator-framework/operator-lifecycle-manager#1028 operator-framework/operator-lifecycle-manager#952 Signed-off-by: Francesco Romani <fromani@redhat.com>
If rolling updates or other deployment-level options don't work for you case, you can change the name of the deployment in your install strategy. This will spin up the new deployment and remove the old deployment without any rolling. |
@ecordell I thought that if you changed the name of the Deployment, OLM will still wait for the new Pod to become ready before removing the old Deployment. Is that not the case? |
@ecordell Changing the name of the deployment in the install strategy is the most problematic of all the options available to an operator developer. In that scenario, OLM will bring up all of the new deployments, wait for them to be ready, and then take down the old deployments.
You may not call this a rolling update but for all practical purposes this is a rolling update. There is no correlation, from an OLM perspective, between I don't understand why this issue was closed without any further discussion. At a minimum this is a doc issue where we should be warning operator developers of potential pitfalls related to upgrades. |
@ecordell shouldn't this be reopened? Recreate should not be the first option if you want your operators to be available. Just consider the case where the new pods are not schedulable for a longer time, you operator will be down then. This can of course also happen otherwise (e.g. cluster is out of capacity and the operator is evicted or crashes), but it is an unnecessary additional risk during a normal cluster operation. |
Problem
Using rolling updates for operator upgrades, it is not possible for an operator upgrade to perform any initialization/transition tasks inside the scope of an OLM update.
Background
Let us assume that the operator is configured to delay readiness to perform some set of initialization tasks during an upgrade (upgrading operands, transitioning resource versions, etc) and that these operators use leader election. When the rollout begins there are two operator instances that want to own the same set of resources, the old operator instance has no work to do and stays "ready", the new operator is not able to do any work because it isn't in control. At this point, whether the use of readiness is not effective.
Impact
Only operators that are using a rolling update strategy (either by changing the names of deployments in the CSV or the "RollingUpdate" k8s deployment strategy). This is regardless if they are using the "Many Version" or "Tightly Coupled" approach.
Solution
It seems that, in a majority of scenarios, that operator developers should be encouraged to use update/patch (keep deployment names the same) + recreate strategy type to grant the updated operator control over the resources; especially if. This makes #952 more of an issue for operator developers.
At the same time this appears, to me at least, to shine on the discussion of #922 where a more robust mechanism for communicating operator status back to OLM, whether that be install/continuous operation/upgrades, would be beneficial to operators being managed by OLM.
The text was updated successfully, but these errors were encountered: