-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix irrecoverable errors in async operations #2665
Conversation
Hi @nojnhuh. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
} | ||
|
||
return isDone, nil | ||
return future.DoneWithContext(ctx, ac.managedclusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this is a change we'd make to all the service clients at once. I'll only include the change to managedclusters for now and once we decide how these should look then I'll make any necessary changes to the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the rest.
azure/scope/managedmachinepool.go
Outdated
@@ -281,6 +288,14 @@ func (s *ManagedMachinePoolScope) PatchCAPIMachinePoolObject(ctx context.Context | |||
) | |||
} | |||
|
|||
// PatchControlPlaneObject persists the AzureManagedControlPlane configuration and status. | |||
func (s *ManagedMachinePoolScope) PatchControlPlaneObject(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why you had to do this and then I saw https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2665/files#diff-5607bc7afbe8f18686e0e62796b8da6371e999006ddb598c5034dd7ef7040a47R234... I think the bug is actually that we're setting/getting the future on the wrong object above no? Probably a bad copy paste :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if that was the case. That seems reasonable though so I'll change this to set the future on the AzureManagedMachinePool.
azure/services/async/async.go
Outdated
@@ -85,7 +94,7 @@ func processOngoingOperation(ctx context.Context, scope FutureScope, client Futu | |||
// This can happen if the resource was deleted by another process before we could get the result. | |||
scope.DeleteLongRunningOperationState(resourceName, serviceName) | |||
} | |||
return result, err | |||
return result, resultErr{err} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having to differentiate the error here, it might be easier to call DeleteLongRunningOperationState
in all cases where done is true (regardless of the error return) that way we'll attempt a new operation on the next loop if the current one finished and won't keep the erroneous future in Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll try that.
azure/services/async/async.go
Outdated
_, err := processOngoingOperation(ctx, s.Scope, s.Deleter, resourceName, serviceName) | ||
return err | ||
// PUT operations still in progress should not be waited on if a delete was requested. | ||
if future != nil && future.Type == infrav1.DeleteFuture { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also want the reverse to be true (don't block a PUT on a delete operation in progress). What do you think of modifying GetLongRunningOperationState
to only return futures for the right operation type so this applies to all operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'll add that.
I noticed there are a couple of places this is called outside of this file for scalesets:
Since it looks like this is looking for a future of any type, is it worth baking in some capability to search for multiple types? Or maybe just do that inline here?
future = s.Scope.GetLongRunningOperationState(s.Scope.ScaleSetSpec().Name, serviceName) |
The second one is in a Delete
method, so I'm guessing we'd only want delete futures here too?
future := s.Scope.GetLongRunningOperationState(vmssSpec.Name, serviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first use (L94) is in Reconcile so that's either a PUT or PATCH future, and the second use (L185) should be just DELETE futures.
Scalesets still needs to be refactored as part of #1702, after that it will also be using async.go, so I think we can just do it inline for now and check both.
/ok-to-test |
Testing this a bit more, I see some less-than-ideal behavior when I manage to create an agent pool in a "Failed" state. When the create operation finishes, the appropriate error is logged and set on the AzureManagedMachinePool status as expected. Then in the next reconciliation, the resource is found to exist with the same parameters which is assumed to be successful, so the error status gets cleared out and everything indicates success even though the agent pool is still in a failed state. The only sign something is wrong is a transient error log saying the underlying VMSS doesn't exist. That seems like it might be a separate issue from what these changes attempt to address, so I'll leave that as a gap here for now and tackle that separately. |
/retest |
4a8d950
to
cf3e347
Compare
We might need to handle agent pools the same way we're handling VMs: in addition to the operation result which might be "failed" if it failed to create all together, in which case we do want to retry the operation the current way, there is also a case of the resource itself being in "failed" state, which requires us to inspect the provisioning state of the resource on every reconcile and mark the object as not ready if the resource is not healthy https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/virtualmachines/virtualmachines.go#L113 |
azure/services/async/async.go
Outdated
// If the operation failed, this will allow it to be retried during the next reconciliation. | ||
// If the resource is not found, we also reset the long-running operation state so we can attempt to create it again. | ||
// This can happen if the resource was deleted by another process before we could get the result. | ||
scope.DeleteLongRunningOperationState(resourceName, serviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add futureType to delete to make sure that we are differentiating between delete and put futures and not blindly deleting the first one we find?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I've made that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending comment on DeleteLongRunningOperationState
/cc @jackfrancis @Jont828
cherry-pick release-1.4 |
Reconciliations involving asynchronous operations can fall into a loop where an eventual "Failed" result can block future reconciliations from making any further changes to that particular resource to fix the problem. This change sets `longRunningOperationStates` for agent pools on the corresponding AzureManagedMachinePool instead of the AzureManagedControlPlane, since changes to that resource were not being persisted. It also only blocks starting new operations on the status of existing operations of the same type. In-progress PUT operations will no longer block new DELETEs and in-progress DELETEs will not block in-progress PUTs. In cases where polling a future from the Azure API would eventually return both `isDone==true` and a non-nil error, a "failed checking if the operation was complete" message would be logged and the error would refer to the ultimate failure unrelated to polling the future itself. This change treats all `isDone==true` polling checks as successful and relies on the operation's error to be captured in the future's `Result`. The future will always be deleted from the status when the operation is done so it can be retried the next reconciliation if it failed.
@nojnhuh: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherry-pick release-1.4 |
@nojnhuh: only kubernetes-sigs org members may request cherry-picks. You can still do the cherry-pick manually. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.4 |
@CecileRobertMichon: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @jackfrancis
Great work @nojnhuh 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
in addition to addressing pathologies, this is a nice collection of clearer, simplified flows
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@CecileRobertMichon: new pull request created: #2679 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@CecileRobertMichon: #2665 failed to apply on top of branch "release-1.4":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@CecileRobertMichon @nojnhuh We can't cherry-pick this into I think this makes sense, as the refactor work performed in #2479 isn't a candidate for a patch release cherry-pick. We could cherry-pick the changes into release-1.4 only for those services that are using the |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Reconciliations involving asynchronous operations can fall into a loop
where an eventual "Failed" result can block future reconciliations from
making any further changes to that particular resource to fix the
problem.
This change sets
longRunningOperationStates
for agent pools on thecorresponding AzureManagedMachinePool instead of the
AzureManagedControlPlane, since changes to that resource were not being
persisted. It also only blocks starting new operations on the status of
existing operations of the same type. In-progress PUT operations will no
longer block new DELETEs and in-progress DELETEs will not block
in-progress PUTs.
In cases where polling a future from the Azure API would eventually
return both
isDone==true
and a non-nil error, a "failed checking ifthe operation was complete" message would be logged and the error would
refer to the ultimate failure unrelated to polling the future itself.
This change treats all
isDone==true
polling checks as successful andrelies on the operation's error to be captured in the future's
Result
.The future will always be deleted from the status when the operation is
done so it can be retried the next reconciliation if it failed.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
I found these issues while working on #2664 and testing failure scenarios like creating an AKS cluster with more nodes than available public IP addresses. Incorporating that feature here will exercise these changes.
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: