fix irrecoverable errors in async operations #2665

nojnhuh · 2022-09-23T19:35:14Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Reconciliations involving asynchronous operations can fall into a loop
where an eventual "Failed" result can block future reconciliations from
making any further changes to that particular resource to fix the
problem.

This change sets longRunningOperationStates for agent pools on the
corresponding AzureManagedMachinePool instead of the
AzureManagedControlPlane, since changes to that resource were not being
persisted. It also only blocks starting new operations on the status of
existing operations of the same type. In-progress PUT operations will no
longer block new DELETEs and in-progress DELETEs will not block
in-progress PUTs.

In cases where polling a future from the Azure API would eventually
return both isDone==true and a non-nil error, a "failed checking if
the operation was complete" message would be logged and the error would
refer to the ultimate failure unrelated to polling the future itself.
This change treats all isDone==true polling checks as successful and
relies on the operation's error to be captured in the future's Result.
The future will always be deleted from the status when the operation is
done so it can be retried the next reconciliation if it failed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
I found these issues while working on #2664 and testing failure scenarios like creating an AKS cluster with more nodes than available public IP addresses. Incorporating that feature here will exercise these changes.

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

k8s-ci-robot · 2022-09-23T19:35:22Z

Hi @nojnhuh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nojnhuh · 2022-09-23T19:41:02Z

azure/services/managedclusters/client.go

-	}
-
-	return isDone, nil
+	return future.DoneWithContext(ctx, ac.managedclusters)


I imagine this is a change we'd make to all the service clients at once. I'll only include the change to managedclusters for now and once we decide how these should look then I'll make any necessary changes to the others.

Updated the rest.

CecileRobertMichon · 2022-09-23T19:58:13Z

azure/scope/managedmachinepool.go

@@ -281,6 +288,14 @@ func (s *ManagedMachinePoolScope) PatchCAPIMachinePoolObject(ctx context.Context
 	)
 }

+// PatchControlPlaneObject persists the AzureManagedControlPlane configuration and status.
+func (s *ManagedMachinePoolScope) PatchControlPlaneObject(ctx context.Context) error {


I was wondering why you had to do this and then I saw https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2665/files#diff-5607bc7afbe8f18686e0e62796b8da6371e999006ddb598c5034dd7ef7040a47R234... I think the bug is actually that we're setting/getting the future on the wrong object above no? Probably a bad copy paste :(

I wondered if that was the case. That seems reasonable though so I'll change this to set the future on the AzureManagedMachinePool.

CecileRobertMichon · 2022-09-23T20:01:52Z

azure/services/async/async.go

@@ -85,7 +94,7 @@ func processOngoingOperation(ctx context.Context, scope FutureScope, client Futu
 		// This can happen if the resource was deleted by another process before we could get the result.
 		scope.DeleteLongRunningOperationState(resourceName, serviceName)
 	}
-	return result, err
+	return result, resultErr{err}


instead of having to differentiate the error here, it might be easier to call DeleteLongRunningOperationState in all cases where done is true (regardless of the error return) that way we'll attempt a new operation on the next loop if the current one finished and won't keep the erroneous future in Status.

Sounds good, I'll try that.

CecileRobertMichon · 2022-09-23T20:04:08Z

azure/services/async/async.go

-		_, err := processOngoingOperation(ctx, s.Scope, s.Deleter, resourceName, serviceName)
-		return err
+	// PUT operations still in progress should not be waited on if a delete was requested.
+	if future != nil && future.Type == infrav1.DeleteFuture {


we also want the reverse to be true (don't block a PUT on a delete operation in progress). What do you think of modifying GetLongRunningOperationState to only return futures for the right operation type so this applies to all operations?

That makes sense, I'll add that.

I noticed there are a couple of places this is called outside of this file for scalesets:

Since it looks like this is looking for a future of any type, is it worth baking in some capability to search for multiple types? Or maybe just do that inline here?

cluster-api-provider-azure/azure/services/scalesets/scalesets.go

Line 94 in d28ff9a

future = s.Scope.GetLongRunningOperationState(s.Scope.ScaleSetSpec().Name, serviceName)

The second one is in a Delete method, so I'm guessing we'd only want delete futures here too?

cluster-api-provider-azure/azure/services/scalesets/scalesets.go

Line 185 in d28ff9a

future := s.Scope.GetLongRunningOperationState(vmssSpec.Name, serviceName)

The first use (L94) is in Reconcile so that's either a PUT or PATCH future, and the second use (L185) should be just DELETE futures.

Scalesets still needs to be refactored as part of #1702, after that it will also be using async.go, so I think we can just do it inline for now and check both.

CecileRobertMichon · 2022-09-26T20:09:35Z

/ok-to-test

nojnhuh · 2022-09-27T18:49:47Z

Testing this a bit more, I see some less-than-ideal behavior when I manage to create an agent pool in a "Failed" state. When the create operation finishes, the appropriate error is logged and set on the AzureManagedMachinePool status as expected. Then in the next reconciliation, the resource is found to exist with the same parameters which is assumed to be successful, so the error status gets cleared out and everything indicates success even though the agent pool is still in a failed state. The only sign something is wrong is a transient error log saying the underlying VMSS doesn't exist.

That seems like it might be a separate issue from what these changes attempt to address, so I'll leave that as a gap here for now and tackle that separately.

nojnhuh · 2022-09-27T20:03:36Z

/retest

CecileRobertMichon · 2022-09-27T21:49:19Z

Testing this a bit more, I see some less-than-ideal behavior when I manage to create an agent pool in a "Failed" state

We might need to handle agent pools the same way we're handling VMs: in addition to the operation result which might be "failed" if it failed to create all together, in which case we do want to retry the operation the current way, there is also a case of the resource itself being in "failed" state, which requires us to inspect the provisioning state of the resource on every reconcile and mark the object as not ready if the resource is not healthy https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/virtualmachines/virtualmachines.go#L113

CecileRobertMichon · 2022-09-27T21:54:43Z

@nojnhuh also tangent for managed cluster resource health: #2673

CecileRobertMichon · 2022-09-28T00:10:15Z

azure/services/async/async.go

+	// If the operation failed, this will allow it to be retried during the next reconciliation.
+	// If the resource is not found, we also reset the long-running operation state so we can attempt to create it again.
+	// This can happen if the resource was deleted by another process before we could get the result.
+	scope.DeleteLongRunningOperationState(resourceName, serviceName)


Should we also add futureType to delete to make sure that we are differentiating between delete and put futures and not blindly deleting the first one we find?

Yeah that makes sense. I've made that change.

CecileRobertMichon

lgtm pending comment on DeleteLongRunningOperationState

/cc @jackfrancis @Jont828

jackfrancis · 2022-09-28T00:22:32Z

cherry-pick release-1.4
cherry-pick release-1.5

Reconciliations involving asynchronous operations can fall into a loop where an eventual "Failed" result can block future reconciliations from making any further changes to that particular resource to fix the problem. This change sets `longRunningOperationStates` for agent pools on the corresponding AzureManagedMachinePool instead of the AzureManagedControlPlane, since changes to that resource were not being persisted. It also only blocks starting new operations on the status of existing operations of the same type. In-progress PUT operations will no longer block new DELETEs and in-progress DELETEs will not block in-progress PUTs. In cases where polling a future from the Azure API would eventually return both `isDone==true` and a non-nil error, a "failed checking if the operation was complete" message would be logged and the error would refer to the ultimate failure unrelated to polling the future itself. This change treats all `isDone==true` polling checks as successful and relies on the operation's error to be captured in the future's `Result`. The future will always be deleted from the status when the operation is done so it can be retried the next reconciliation if it failed.

k8s-ci-robot · 2022-09-28T15:06:44Z

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-apidiff	`ef6f663`	link	false	`/test pull-cluster-api-provider-azure-apidiff`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

nojnhuh · 2022-09-28T20:36:50Z

/cherry-pick release-1.4
/cherry-pick release-1.5

k8s-infra-cherrypick-robot · 2022-09-28T20:36:57Z

@nojnhuh: only kubernetes-sigs org members may request cherry-picks. You can still do the cherry-pick manually.

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2022-09-28T20:47:06Z

/cherry-pick release-1.4
/cherry-pick release-1.5

k8s-infra-cherrypick-robot · 2022-09-28T20:47:07Z

@CecileRobertMichon: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon

/lgtm
/assign @jackfrancis

Great work @nojnhuh 🙌

jackfrancis

/approve

in addition to addressing pathologies, this is a nice collection of clearer, simplified flows

k8s-ci-robot · 2022-09-28T21:45:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-infra-cherrypick-robot · 2022-09-28T21:47:20Z

@CecileRobertMichon: new pull request created: #2679

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2022-09-28T21:47:54Z

@CecileRobertMichon: #2665 failed to apply on top of branch "release-1.4":

Applying: fix irrecoverable errors in async operations
Using index info to reconstruct a base tree...
M	azure/scope/cluster.go
M	azure/scope/machine.go
M	azure/scope/machinepool.go
M	azure/scope/machinepoolmachine.go
M	azure/scope/managedcontrolplane.go
M	azure/scope/managedmachinepool.go
M	azure/services/agentpools/client.go
M	azure/services/agentpools/mock_agentpools/agentpools_mock.go
M	azure/services/bastionhosts/client.go
M	azure/services/groups/groups.go
M	azure/services/inboundnatrules/client.go
M	azure/services/loadbalancers/client.go
M	azure/services/natgateways/client.go
M	azure/services/networkinterfaces/client.go
M	azure/services/publicips/client.go
M	azure/services/routetables/client.go
M	azure/services/scalesets/scalesets.go
M	azure/services/scalesets/scalesets_test.go
M	azure/services/scalesetvms/scalesetvms_test.go
M	azure/services/securitygroups/client.go
M	azure/services/subnets/client.go
M	azure/services/virtualnetworks/client.go
M	azure/services/virtualnetworks/virtualnetworks.go
M	azure/services/vnetpeerings/client.go
Falling back to patching base and 3-way merge...
Auto-merging azure/services/vnetpeerings/client.go
Auto-merging azure/services/virtualnetworks/virtualnetworks.go
Auto-merging azure/services/virtualnetworks/client.go
Auto-merging azure/services/subnets/client.go
Auto-merging azure/services/securitygroups/client.go
Auto-merging azure/services/scalesetvms/scalesetvms_test.go
Auto-merging azure/services/scalesets/scalesets_test.go
Auto-merging azure/services/scalesets/scalesets.go
Auto-merging azure/services/routetables/client.go
Auto-merging azure/services/publicips/client.go
Auto-merging azure/services/networkinterfaces/client.go
Auto-merging azure/services/natgateways/client.go
Auto-merging azure/services/loadbalancers/client.go
Auto-merging azure/services/inboundnatrules/client.go
Auto-merging azure/services/groups/groups.go
Auto-merging azure/services/bastionhosts/client.go
Auto-merging azure/services/agentpools/mock_agentpools/agentpools_mock.go
CONFLICT (content): Merge conflict in azure/services/agentpools/mock_agentpools/agentpools_mock.go
Auto-merging azure/services/agentpools/client.go
CONFLICT (content): Merge conflict in azure/services/agentpools/client.go
Auto-merging azure/scope/managedmachinepool.go
Auto-merging azure/scope/managedcontrolplane.go
Auto-merging azure/scope/machinepoolmachine.go
Auto-merging azure/scope/machinepool.go
Auto-merging azure/scope/machine.go
Auto-merging azure/scope/cluster.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 fix irrecoverable errors in async operations
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.4
/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jackfrancis · 2022-09-28T22:09:23Z

@CecileRobertMichon @nojnhuh We can't cherry-pick this into release-1.4 because the async refactor of agentpools was not cherry-picked back. See:

#2479

I think this makes sense, as the refactor work performed in #2479 isn't a candidate for a patch release cherry-pick.

We could cherry-pick the changes into release-1.4 only for those services that are using the async pattern, but that might be more frankenstein than we want for a patch release two versions ago. Wdyt?

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Sep 23, 2022

k8s-ci-robot requested review from alexeldeib and mboersma September 23, 2022 19:35

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2022

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 23, 2022

nojnhuh mentioned this pull request Sep 23, 2022

Add nodePublicIPPrefixID to AzureManagedMachinePool #2664

Merged

3 tasks

nojnhuh commented Sep 23, 2022

View reviewed changes

CecileRobertMichon reviewed Sep 23, 2022

View reviewed changes

nojnhuh force-pushed the async branch from 98b479e to 12e8552 Compare September 23, 2022 22:26

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2022

nojnhuh force-pushed the async branch from 12e8552 to ce7d6c2 Compare September 27, 2022 19:46

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 27, 2022

nojnhuh marked this pull request as ready for review September 27, 2022 19:47

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2022

k8s-ci-robot requested review from devigned and Jont828 September 27, 2022 19:47

nojnhuh force-pushed the async branch from ce7d6c2 to c9130e9 Compare September 27, 2022 20:01

nojnhuh force-pushed the async branch 2 times, most recently from 4a8d950 to cf3e347 Compare September 27, 2022 20:52

CecileRobertMichon mentioned this pull request Sep 27, 2022

Deleting VMSS does not continue if previous long running operation fails #2671

Closed

CecileRobertMichon reviewed Sep 28, 2022

View reviewed changes

k8s-ci-robot requested a review from jackfrancis September 28, 2022 00:14

nojnhuh force-pushed the async branch from cf3e347 to ef6f663 Compare September 28, 2022 15:04

CecileRobertMichon reviewed Sep 28, 2022

View reviewed changes

k8s-ci-robot assigned jackfrancis and CecileRobertMichon Sep 28, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2022

jackfrancis approved these changes Sep 28, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2022

k8s-ci-robot merged commit a38912d into kubernetes-sigs:main Sep 28, 2022

k8s-infra-cherrypick-robot mentioned this pull request Sep 28, 2022

[release-1.5] fix irrecoverable errors in async operations #2679

Merged

nojnhuh deleted the async branch September 29, 2022 04:10

CecileRobertMichon mentioned this pull request Sep 29, 2022

Controller fails to delete VM's in failed state #2571

Closed

nojnhuh mentioned this pull request Oct 4, 2022

fix agent pool created in failed state becoming ready #2700

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix irrecoverable errors in async operations #2665

fix irrecoverable errors in async operations #2665

nojnhuh commented Sep 23, 2022 •

edited

Loading

k8s-ci-robot commented Sep 23, 2022

nojnhuh Sep 23, 2022

nojnhuh Sep 27, 2022

CecileRobertMichon Sep 23, 2022

nojnhuh Sep 23, 2022

CecileRobertMichon Sep 23, 2022

nojnhuh Sep 23, 2022

CecileRobertMichon Sep 23, 2022

nojnhuh Sep 23, 2022

CecileRobertMichon Sep 26, 2022

CecileRobertMichon commented Sep 26, 2022

nojnhuh commented Sep 27, 2022

nojnhuh commented Sep 27, 2022

CecileRobertMichon commented Sep 27, 2022

CecileRobertMichon commented Sep 27, 2022

CecileRobertMichon Sep 28, 2022

nojnhuh Sep 28, 2022

CecileRobertMichon left a comment •

edited

Loading

jackfrancis commented Sep 28, 2022

k8s-ci-robot commented Sep 28, 2022

nojnhuh commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

CecileRobertMichon commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

CecileRobertMichon left a comment

jackfrancis left a comment

k8s-ci-robot commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

jackfrancis commented Sep 28, 2022

fix irrecoverable errors in async operations #2665

fix irrecoverable errors in async operations #2665

Conversation

nojnhuh commented Sep 23, 2022 • edited Loading

k8s-ci-robot commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CecileRobertMichon commented Sep 26, 2022

nojnhuh commented Sep 27, 2022

nojnhuh commented Sep 27, 2022

CecileRobertMichon commented Sep 27, 2022

CecileRobertMichon commented Sep 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CecileRobertMichon left a comment • edited Loading

Choose a reason for hiding this comment

jackfrancis commented Sep 28, 2022

k8s-ci-robot commented Sep 28, 2022

nojnhuh commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

CecileRobertMichon commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

CecileRobertMichon left a comment

Choose a reason for hiding this comment

jackfrancis left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

k8s-infra-cherrypick-robot commented Sep 28, 2022

jackfrancis commented Sep 28, 2022

nojnhuh commented Sep 23, 2022 •

edited

Loading

CecileRobertMichon left a comment •

edited

Loading