-
Notifications
You must be signed in to change notification settings - Fork 40.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: fix scheduler/resource claim controller race #124931
DRA: fix scheduler/resource claim controller race #124931
Conversation
// JSON patch can only append to a non-empty array. An empty reservedFor gets | ||
// omitted and even if it didn't, it would be null and not an empty array. | ||
// Therefore we have to test and add if it's currently empty. | ||
reservedForEntry := fmt.Sprintf(`{"resource": "pods", "name": %q, "uid": %q}`, pod.Name, pod.UID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it was fun playing around with JSON patch, I think this is taking it too far...
A simpler, more obvious approach would be to add a retry loop which uses normal Update calls and gets the latest claim on a conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented, ready for review again.
843dca1
to
434e786
Compare
/retest |
/triage accepted |
/assign @kerthcet Can you help review? |
/test pull-kubernetes-node-e2e-containerd-1-7-dra Testing kubernetes/test-infra#32774 |
/test pull-kubernetes-node-e2e-containerd-1-7-dra Testing another PR: kubernetes/test-infra#32776 |
/test pull-kubernetes-node-e2e-containerd-1-7-dra |
will finish the review tomorrow. |
claim.Finalizers = append(claim.Finalizers, resourcev1alpha2.Finalizer) | ||
updatedClaim, err := pl.clientset.ResourceV1alpha2().ResourceClaims(claim.Namespace).Update(ctx, claim, metav1.UpdateOptions{}) | ||
if err != nil { | ||
if apierrors.IsConflict(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will retryOnConflict
help here? Then we can remove for
loop here, it seems risky to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint. It makes the code a bit simpler and adds exponential backoff.
I'm taking extra care to not do a GET in the first iteration because most of the time, the claim will be recent enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed.
I verified again that the retry loop works: when running _output/bin/ginkgo -v --focus="with translated parameters on single node.*supports sharing a claim sequentially" ./test/e2e
, the scheduler had to retry once as indicated by the log message when it gets the newer claim.
} | ||
|
||
// The finalizer needs to be added in a normal update. | ||
// If we were interrupted in the past, it might already be set and we simply continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get for the laster claim to check whether the finalizer has been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it has been removed and the claim
instance is stale, then the UpdateStatus
below will fail and we retry with a fresh copy of the claim.
434e786
to
952e6aa
Compare
952e6aa
to
caf544a
Compare
There was a race caused by having to update claim finalizer and status in two different operations: - Resource claim controller removes allocation, does not yet get to remove the finalizer. - Scheduler prepares an allocation, without adding the finalizer because it's there. - Controller removes finalizer. - Scheduler adds allocation. This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur. The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one). Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop). The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.
The JSON patch approach works, but it is complex. A retry loop is easier to understand (detect conflict, get new claim, try again). There is one additional API call (the get), but in practice this scenario is unlikely.
caf544a
to
4bddebc
Compare
/lgtm The two commits seems exclusive, will you squash them? |
LGTM label has been added. Git tree hash: 8a6a4b040255d3614b0addbe44eaafa77377c41f
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kerthcet, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'd prefer to keep the two commits: the first one can serve as reference for how this would have looked like with JSON patching. /hold cancel |
What type of PR is this?
/kind bug
What this PR does / why we need it:
There was a race caused by having to update claim finalizer and status in two different operations:
This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur.
Special notes for your reviewer:
The fix is to check that the finalizer is still set when adding the allocation. This can be done with a complicated JSON patch (see first commit, but only if you are really, really curious!), but a local retry loop with Update calls is simpler.
The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.
Does this PR introduce a user-facing change?