Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment errors leave deployments paused in OCP #860

Open
filex opened this issue Feb 25, 2022 · 5 comments
Open

deployment errors leave deployments paused in OCP #860

filex opened this issue Feb 25, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@filex
Copy link

filex commented Feb 25, 2022

Describe the bug

When a deployment triggered by odsComponentStageRolloutOpenShiftDeployment() fails, DeploymentConfigs in OCP might be left in pauses state.

There is code to catch an fix that: https://github.com/opendevstack/ods-jenkins-shared-library/blob/4.x/src/org/ods/component/RolloutOpenShiftDeploymentStage.groovy#L156

But I think it doesn't work as expected.

To Reproduce

Our component has more than one DeploymentConfig. The rollout starts with all DCs being paused. Then, the latest built images from the -cd namespaces are tagged over into the ImageStream of my current namespace. (This would trigger the DC, but they are paused). After all images are tagged, the DCs are resumed one by one.

Let's say we have 3 DCs and one DC rollout fails (e.g. a container does not come up because of misconfiguration elsewhere). Let's say it was the second DC:

  • The first DC is unpaused and running the new version
  • The second DC is unpaused and running the old version (because the rollout failed)
  • The third DC is now in paused state and running the old version (because it was never resumed).

This is how it looks in the jenkins log:

+ oc -n ahdm-dev rollout status DeploymentConfig/my-service-2 --watch=true
Waiting for rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for rollout to finish: 1 old replicas are pending termination...
error: replication controller "…" has failed progressing

Now the bulkResume() mentioned above is run in the finally block. But it fails, too:

+ oc rollout resume DeploymentConfig/my-service-1 -n ahdm-dev
error: deploymentconfigs.apps.openshift.io "my-service-1" is not paused

The bulkResume fails because the first DC was already unpaused. This leaves the last DC (3) in a paused state.

This state will also let the next Jenkins job fail: after building everything, the next call to odsComponentStageRolloutOpenShiftDeployment() will fail, too. This time because setting all DCs to pause fails, because the last DCs (3) is already paused.

Workaround

If I manually resume the paused DC after a rollout failure, the next Jenkins job will work as expected.

One side-effect of this workaround could be problematic: The ODS rollout has imported the image from the -cd namespace to as :latest. That means that resuming the DCs will actually continue the failed rollout. The DC state then reads like this:

  • The first DC is running the new version
  • The second DC is running the old version (because the rollout failed)
  • The third DC is running the new version (because it was updated/triggered by manually resuming the DC).

Expected behavior

A failed rollout should not leave a state that will let the next rollout fail, too.

However, we must be aware that gracefully resuming all DCs after a failure will continue the rollout in OCP without ODS/Jenkins still monitoring that.

A better solution might be to not let the ODS rollout fail when a single DC rollout fails. Then, ODS has still control over the rollout of the remaining DCs and can track their state.

Affected version (please complete the following information):

  • OpenShift: 3.11
  • OpenDevStack 4.x
@filex filex added the bug Something isn't working label Feb 25, 2022
@clemensutschig
Copy link
Member

I guess the trick is to query only the non-paused DCs :) in both code blocks ...

@filex
Copy link
Author

filex commented Feb 25, 2022

yes. but this leads to "unattended" rollouts.

@clemensutschig
Copy link
Member

The rollout starts with all DCs being paused. - this is where we should be smarter ... and similarely - at This time because setting all DCs to pause fails, because the last DCs (3) is already paused.

@clemensutschig
Copy link
Member

@michaelsauter

@michaelsauter
Copy link
Member

Phew, tricky!

The pausing/resuming logic seems to have been added in #686, in order to add further labels to the resources without causing multiple rollouts.

In general, I personally think there are a couple of underlying issues:

  • The use of DeploymentConfig resources allows to use image triggers, and the shared lib is using them, applying the latest tag which is assumed to cause a rollout. I've argued for quite some time now that I think this is not ideal: we should be using Deployment resources instead, not use any latest tag and update the spec instead to point to specific tags. That is more predictable and easier to debug when things go wrong.
  • Pausing and unpausing to apply labels is prone to cause trouble: I think it would be safer to embed labels into the the templates themselves instead of patching them onto the resources.

@clemensutschig so ... I think your idea of "query only the non-paused DCs" works, but it feels like a band-aid to me ... as an aside: because of this we intentionally side-step all of those issues in ODS pipeline by not interfering at all with the helm upgrade, not iterating over any DeploymentConfig/Deployment resources, and not applying any labels.

A better solution might be to not let the ODS rollout fail when a single DC rollout fails. Then, ODS has still control over the rollout of the remaining DCs and can track their state.

Whether we fail right away or continue and fail at the end, we still leave the deployments in an inconsistent state. This might be the more pressing issue to solve? Ideally we would have some kind of atomic operation - either all succeed, or all stay on the old version ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants