Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix irrecoverable errors in async operations #2665

Merged
merged 1 commit into from
Sep 28, 2022

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Sep 23, 2022

What type of PR is this?
/kind bug

What this PR does / why we need it:
Reconciliations involving asynchronous operations can fall into a loop
where an eventual "Failed" result can block future reconciliations from
making any further changes to that particular resource to fix the
problem.

This change sets longRunningOperationStates for agent pools on the
corresponding AzureManagedMachinePool instead of the
AzureManagedControlPlane, since changes to that resource were not being
persisted. It also only blocks starting new operations on the status of
existing operations of the same type. In-progress PUT operations will no
longer block new DELETEs and in-progress DELETEs will not block
in-progress PUTs.

In cases where polling a future from the Azure API would eventually
return both isDone==true and a non-nil error, a "failed checking if
the operation was complete" message would be logged and the error would
refer to the ultimate failure unrelated to polling the future itself.
This change treats all isDone==true polling checks as successful and
relies on the operation's error to be captured in the future's Result.
The future will always be deleted from the status when the operation is
done so it can be retried the next reconciliation if it failed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
I found these issues while working on #2664 and testing failure scenarios like creating an AKS cluster with more nodes than available public IP addresses. Incorporating that feature here will exercise these changes.

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Sep 23, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @nojnhuh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 23, 2022
}

return isDone, nil
return future.DoneWithContext(ctx, ac.managedclusters)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this is a change we'd make to all the service clients at once. I'll only include the change to managedclusters for now and once we decide how these should look then I'll make any necessary changes to the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the rest.

@@ -281,6 +288,14 @@ func (s *ManagedMachinePoolScope) PatchCAPIMachinePoolObject(ctx context.Context
)
}

// PatchControlPlaneObject persists the AzureManagedControlPlane configuration and status.
func (s *ManagedMachinePoolScope) PatchControlPlaneObject(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why you had to do this and then I saw https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2665/files#diff-5607bc7afbe8f18686e0e62796b8da6371e999006ddb598c5034dd7ef7040a47R234... I think the bug is actually that we're setting/getting the future on the wrong object above no? Probably a bad copy paste :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if that was the case. That seems reasonable though so I'll change this to set the future on the AzureManagedMachinePool.

@@ -85,7 +94,7 @@ func processOngoingOperation(ctx context.Context, scope FutureScope, client Futu
// This can happen if the resource was deleted by another process before we could get the result.
scope.DeleteLongRunningOperationState(resourceName, serviceName)
}
return result, err
return result, resultErr{err}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having to differentiate the error here, it might be easier to call DeleteLongRunningOperationState in all cases where done is true (regardless of the error return) that way we'll attempt a new operation on the next loop if the current one finished and won't keep the erroneous future in Status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll try that.

_, err := processOngoingOperation(ctx, s.Scope, s.Deleter, resourceName, serviceName)
return err
// PUT operations still in progress should not be waited on if a delete was requested.
if future != nil && future.Type == infrav1.DeleteFuture {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also want the reverse to be true (don't block a PUT on a delete operation in progress). What do you think of modifying GetLongRunningOperationState to only return futures for the right operation type so this applies to all operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll add that.

I noticed there are a couple of places this is called outside of this file for scalesets:

Since it looks like this is looking for a future of any type, is it worth baking in some capability to search for multiple types? Or maybe just do that inline here?

future = s.Scope.GetLongRunningOperationState(s.Scope.ScaleSetSpec().Name, serviceName)

The second one is in a Delete method, so I'm guessing we'd only want delete futures here too?

future := s.Scope.GetLongRunningOperationState(vmssSpec.Name, serviceName)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first use (L94) is in Reconcile so that's either a PUT or PATCH future, and the second use (L185) should be just DELETE futures.

Scalesets still needs to be refactored as part of #1702, after that it will also be using async.go, so I think we can just do it inline for now and check both.

@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2022
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 27, 2022

Testing this a bit more, I see some less-than-ideal behavior when I manage to create an agent pool in a "Failed" state. When the create operation finishes, the appropriate error is logged and set on the AzureManagedMachinePool status as expected. Then in the next reconciliation, the resource is found to exist with the same parameters which is assumed to be successful, so the error status gets cleared out and everything indicates success even though the agent pool is still in a failed state. The only sign something is wrong is a transient error log saying the underlying VMSS doesn't exist.

That seems like it might be a separate issue from what these changes attempt to address, so I'll leave that as a gap here for now and tackle that separately.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 27, 2022
@nojnhuh nojnhuh marked this pull request as ready for review September 27, 2022 19:47
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2022
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 27, 2022

/retest

@nojnhuh nojnhuh force-pushed the async branch 2 times, most recently from 4a8d950 to cf3e347 Compare September 27, 2022 20:52
@CecileRobertMichon
Copy link
Contributor

Testing this a bit more, I see some less-than-ideal behavior when I manage to create an agent pool in a "Failed" state

We might need to handle agent pools the same way we're handling VMs: in addition to the operation result which might be "failed" if it failed to create all together, in which case we do want to retry the operation the current way, there is also a case of the resource itself being in "failed" state, which requires us to inspect the provisioning state of the resource on every reconcile and mark the object as not ready if the resource is not healthy https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/virtualmachines/virtualmachines.go#L113

@CecileRobertMichon
Copy link
Contributor

@nojnhuh also tangent for managed cluster resource health: #2673

// If the operation failed, this will allow it to be retried during the next reconciliation.
// If the resource is not found, we also reset the long-running operation state so we can attempt to create it again.
// This can happen if the resource was deleted by another process before we could get the result.
scope.DeleteLongRunningOperationState(resourceName, serviceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add futureType to delete to make sure that we are differentiating between delete and put futures and not blindly deleting the first one we find?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I've made that change.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending comment on DeleteLongRunningOperationState

/cc @jackfrancis @Jont828

@jackfrancis
Copy link
Contributor

cherry-pick release-1.4
cherry-pick release-1.5

Reconciliations involving asynchronous operations can fall into a loop
where an eventual "Failed" result can block future reconciliations from
making any further changes to that particular resource to fix the
problem.

This change sets `longRunningOperationStates` for agent pools on the
corresponding AzureManagedMachinePool instead of the
AzureManagedControlPlane, since changes to that resource were not being
persisted. It also only blocks starting new operations on the status of
existing operations of the same type. In-progress PUT operations will no
longer block new DELETEs and in-progress DELETEs will not block
in-progress PUTs.

In cases where polling a future from the Azure API would eventually
return both `isDone==true` and a non-nil error, a "failed checking if
the operation was complete" message would be logged and the error would
refer to the ultimate failure unrelated to polling the future itself.
This change treats all `isDone==true` polling checks as successful and
relies on the operation's error to be captured in the future's `Result`.
The future will always be deleted from the status when the operation is
done so it can be retried the next reconciliation if it failed.
@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apidiff ef6f663 link false /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 28, 2022

/cherry-pick release-1.4
/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@nojnhuh: only kubernetes-sigs org members may request cherry-picks. You can still do the cherry-pick manually.

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/cherry-pick release-1.4
/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@CecileRobertMichon: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @jackfrancis

Great work @nojnhuh 🙌

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2022
Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

in addition to addressing pathologies, this is a nice collection of clearer, simplified flows

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2022
@k8s-ci-robot k8s-ci-robot merged commit a38912d into kubernetes-sigs:main Sep 28, 2022
@k8s-infra-cherrypick-robot

@CecileRobertMichon: new pull request created: #2679

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@CecileRobertMichon: #2665 failed to apply on top of branch "release-1.4":

Applying: fix irrecoverable errors in async operations
Using index info to reconstruct a base tree...
M	azure/scope/cluster.go
M	azure/scope/machine.go
M	azure/scope/machinepool.go
M	azure/scope/machinepoolmachine.go
M	azure/scope/managedcontrolplane.go
M	azure/scope/managedmachinepool.go
M	azure/services/agentpools/client.go
M	azure/services/agentpools/mock_agentpools/agentpools_mock.go
M	azure/services/bastionhosts/client.go
M	azure/services/groups/groups.go
M	azure/services/inboundnatrules/client.go
M	azure/services/loadbalancers/client.go
M	azure/services/natgateways/client.go
M	azure/services/networkinterfaces/client.go
M	azure/services/publicips/client.go
M	azure/services/routetables/client.go
M	azure/services/scalesets/scalesets.go
M	azure/services/scalesets/scalesets_test.go
M	azure/services/scalesetvms/scalesetvms_test.go
M	azure/services/securitygroups/client.go
M	azure/services/subnets/client.go
M	azure/services/virtualnetworks/client.go
M	azure/services/virtualnetworks/virtualnetworks.go
M	azure/services/vnetpeerings/client.go
Falling back to patching base and 3-way merge...
Auto-merging azure/services/vnetpeerings/client.go
Auto-merging azure/services/virtualnetworks/virtualnetworks.go
Auto-merging azure/services/virtualnetworks/client.go
Auto-merging azure/services/subnets/client.go
Auto-merging azure/services/securitygroups/client.go
Auto-merging azure/services/scalesetvms/scalesetvms_test.go
Auto-merging azure/services/scalesets/scalesets_test.go
Auto-merging azure/services/scalesets/scalesets.go
Auto-merging azure/services/routetables/client.go
Auto-merging azure/services/publicips/client.go
Auto-merging azure/services/networkinterfaces/client.go
Auto-merging azure/services/natgateways/client.go
Auto-merging azure/services/loadbalancers/client.go
Auto-merging azure/services/inboundnatrules/client.go
Auto-merging azure/services/groups/groups.go
Auto-merging azure/services/bastionhosts/client.go
Auto-merging azure/services/agentpools/mock_agentpools/agentpools_mock.go
CONFLICT (content): Merge conflict in azure/services/agentpools/mock_agentpools/agentpools_mock.go
Auto-merging azure/services/agentpools/client.go
CONFLICT (content): Merge conflict in azure/services/agentpools/client.go
Auto-merging azure/scope/managedmachinepool.go
Auto-merging azure/scope/managedcontrolplane.go
Auto-merging azure/scope/machinepoolmachine.go
Auto-merging azure/scope/machinepool.go
Auto-merging azure/scope/machine.go
Auto-merging azure/scope/cluster.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 fix irrecoverable errors in async operations
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackfrancis
Copy link
Contributor

@CecileRobertMichon @nojnhuh We can't cherry-pick this into release-1.4 because the async refactor of agentpools was not cherry-picked back. See:

#2479

I think this makes sense, as the refactor work performed in #2479 isn't a candidate for a patch release cherry-pick.

We could cherry-pick the changes into release-1.4 only for those services that are using the async pattern, but that might be more frankenstein than we want for a patch release two versions ago. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants